What is a machine learning project? What really a predictive analytical problem statement is? How we are going to solve this? These questions are going to be answered in this article and a few novel questions will be raised which will help us to understand better and solve the problem more accurately
Basically, using some previous information, prediction about the future is the base of machine learning models. The process of extracting information and generating trends is called training or modeling and telling about the future is called predictions. An ML model involves various steps like exploratory data analysis, handling data, feature engineering, training, predicting, evaluation of the model, and how we can improve our performances. We will go through a very basic structure of creating any model and understand the meaning of basic terms.
Generally, our data set is given in the form of a CSV (comma-separated file) or an Excel file or a text file or images and voices,
For instance, a CVS file or an Excel file has rows and columns. A row is basically an observation and columns are features that determine our target. For other types, we need to extract the feature information by some prescribed methods like a 3-D matrix defining an image contains values of pixels which can be converted into a 2-D matrix for training a deep learning model.
But for ease let us consider that we want to predict a house price, what do you think are basic factors that will affect the price? (will be answered later in the article)
Read the problem statement very very carefully. Most of the people and unsuccessful data scientist commits this mistake. Reading the problem is a very crucial step because a lot of information can be extracted from it.
Once you have read it, ask yourself these basic questions
- What is the problem?
- Why does the problem need to be solved
- How can we solve the problem?
I find answering these questions very crucial to understand the problem. We can use the best-ever algorithms to get top accuracy, but it would be meaningless if we solve the wrong problem.
These questions will not just introduce us to a problem but also helps us to understand & validate the data collected and to improve the results.
Once you have read the problem statement, it’s time to use our knowledge and experience to list on which factors the target variable depends. This is a kind of brainstorming. It gives us the idea which factors are import while which are not which can be later proven in the EDA section.
We were predicting house prices. So, what did it depend on? Maybe the location, size, how old is the house is, the number of rooms, central AC present, parking availability, population density in that area, and a lot more. Maybe the color of the houses is also crucial? Well, these answers can be found out.
One should admire how vital this step is. This process is a must process for beginners. During the first step, it is not the accuracy, our intentions should be to become a good Data scientist, not merely a problem solver!
Read your data and make yourself comfortable with it. Read the central tendencies of your data. Ask yourself what are the continuous features, what are categorical features, are there any missing values, what are the data types involved, etc.
There are two types of visualization analysis viz. Univariate analysis and Bivariate analysis,
It comprises of visualizations of feature one at a time. These can be used to know how our continuous features are distributed, are there any outliers or missing values? For categorical features, we can extract how much is the count of each category. Majorly, histogram plots and box plots are used in the case of continuous features and count plots are used for categorical data.
When we examine trends of each feature with the target variable, its called Bivariate analysis. How are data correlated, what is the impact of each feature on the target variable are basic question answered during this analysis.
While doing bivariate analysis on pairs of different features, trends, and patterns can generate for the imputation of missing values.
For continuous-continuous features, we use scatter plots. It tells about how strong is the linear relationship between the features. For continuous-categorical, we use violin plots as they comprise of both the range and the distribution of features. For categorical-categorical, we can use the cross-table method, which is predefined in the Pandas library.
This is the most creative and determining factor during modeling a problem. But as every difficult problem can be sb divided into smaller easy problems, this can also be enacted as two major sections viz. Feature generation and Encoding.
Feature generation is a process in which new features are generated from existing features. For example, give a string date feature, information about year and month can be extracted. Another good example is when predicting sales in a Mart, price, and weight are the feature can be given and we can generate new feature i.e. price per unit weight. This helps us in reducing the dimensions of Data keeping the quality of data retained.
After all the above processes, our data set is ready for model building but the issue is that most of the machine learning algorithms can not read categorical values like Male, Female, No, Yes, etc. We use encoding methods to convert them into numerical values. This largely involves Label Encoding, One-Hot Encoding, Count Encoding, etc
This step is about the selection of a Machine learning algorithm and the fitting of our training data to it. Choosing an algorithm depends on whether the problem is of classification or regression or supervised or unsupervised. Various models are available.
Gradient Descent, Linear Regression, Logistic Regression, Linear Discriminant Analysis
Classification And Regression Trees, Naive Bayes, K-Nearest Neighbors, Learning Vector Quantization, Support Vector Machines
Bagging and Random Forest, Boosting and AdaBoost
After fitting the trained data to a model using an algorithm, we do some predictions on novel data, which is new to the model.
A number of algorithms would predict, but the question arises which is the best? Hence, we need to evaluate our model. Currently, there are various methods to evaluate our models like MSE, MAE, roc-curve, f1-score, log loss, and many more.
Scikit Learn’s Cheat Sheet
The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which estimators to try on your data.
This is was a very basic and naive approach to any problem, but good enough for beginners. Once you get along these steps and develop an understanding of predictive modeling, there are several dimensions on which you can start working to improve your skills.
Please share your opinions about the post, I would love to hear them!