Many companies use predictive models in their activity to provide better customer service, to sell more products and services to customers, to manage risk from fraudulent activity, to better plan the use of their human resources, etc. How the predictive analysis provides all these opportunities? In this article we will consider the process of predictive analytics.


The starting point in your analysis is determining the goal of the project. The clear set task is important in implementing correct predictive analytics methodology and in choosing necessary data sets.


The next step is to collect data. Depending on your target, varied sources are using, such as web archives, transaction data, CRM data, customer service data, digital marketing and advertising data, demographic data, machine-generated data (for example, telemetric data or data from sensors), geographical data, and others. It is important to have true and up to date information. Most of the time, you will have information from multiple sources and quite often in a raw state. Some of it will be structured in tables, other will be semi-structured or even unstructured, like social media comments.


So, the third step seems to be obvious: the analysts must organize data. This is called data preprocessing. Data preprocessing includes cleaning, normalization, transformation, feature extraction and selection. Сleaning is used to raw data that can be incomplete, noisy (e.g., containing errors or outlier values), inconsistent. If your data is from multiple sources, then you need also to combine data into a coherent store. Feature selection is needed to reduce the calculation time, since the complex data analysis on huge amount of data may take a very long time. Preprocessing usually takes 80% of the time and efforts of all analysis.


Once you have the data in a nice, clean, well prepared format we need to develop a model. In most cases, the best solution is to use already existing tools, like Decision Trees, different linear models, Logistic Regression, Neural Networks, and many more. You can find these tools in libraries built on open-source programming languages (for example, R and Python). The data scientist’s task is to know the available model types and choose the best for the job. In general, they can be grouped into three main types:


  1. Classification is used when we need to predict the category of a new piece of dataset based on its characteristics. Classification algorithms are useful for customer segmentation, spam detection, text analysis, etc. The classification includes following methods: Decision Trees, Random Forests, Naïve Bayes, k-Nearest Neighbors, and other techniques.

  2. Regression is used for predicting outputs that are continuous. Price optimization, stock prices prediction are typical cases of study when the regression algorithms are used. The regression methods are Logistic regression, Polynomial regression, Linear regression, etc.

  3. Several techniques can be applied together to produce better predictions in combining models.

Then the training data set is used to train a model and optimization parameters. When the training is complete, you try the model with new data to how well it performs (model validation).


Then you must ensure that your model makes business sense and deploy the analytics results into your production system, software programs or devices, web apps, and so on. The model can be valid for a certain period since reality is not static and environment can change significantly. For example, the preferences of the customers may change so fast that previous expectations become outdated. So, it is important to monitor model periodically.