Machine Learning Work-Flow (Part 1)

So far, I am planning to write a serie of posts explaining a basic Machine Learning work-flow (mostly supervised). In this post, my target is to propose the bird-eye view, as I'll dwell into details at the latter posts explaining each of the components in detail. I decide to write this serie due to two reasons; the first reason is self-education -to get all my bits and pieces together after a period of theoretical research and industrial practice- the second is to present a naive guide to beginners and enthusiasts.

Below, we have the overview of the proposed work-flow. We have a color code indicating bases. Each box has a color tone from YELLOW to RED. The yellower the box, the more this component relies on Statistics knowledge base. As the box turns into red[gets darker], the component depends more heavily on Machine Learning knowledge base. By saying this, I also imply that, without good statistical understanding, we are not able to construct a convenient machine learning pipeline. As a footnote, this schema is changed by post-modernism of Representation Learning algorithms and I'll touch this at the latter posts.


Let's go through the schema. We start with the raw data (documents, images, sound recordings etc.) from which we need to extract feature representation. Feature Extraction is necessary in order to convert each of the raw data into a normalized form (For example Bag Of Words representation converts each different length document into a same length vector representation) that the ML algorithm is able to understand.

The next step is the Sanity-Check that ensures the quality and convenience of the feature values. I believe that this step is ignored by any of the ML courses as well as the community itself. However, any mistakes at this stage are so crucial since they are too darn hard to realize. For instance, you do everything right but your final prediction accuracy is very low compared to the expectation. You debug all the pipeline over and over again but nothing seems flawed. Then after many hours, you realized a Null value at the 2965th feature vector. Yes, you f**ked up! :). I am very sure that, if you have some level of experience in the field, you suffered from something same or at least similar.

Later; we need to split our dataset into train, validation and held-out sets. This step is related to your methods at the Model Training and Evaluation steps. For instance, if you expect to use Neural Network at the Model Training part and Cross-Validation at the Evaluation then train, validation, held-out splitting might be the best choice. Use train data with K-fold Cross-Validation for training and hyper-parameter optimization, then assert the final model with validation data. Use held-out set at the very final stage for the final performance quantization.

Feature Preprocessing is another block related to your ML model. At that step we transform the train data into acceptable scales of the targeted ML algorithm. For instance, SVM demands mean=0, std=1 scaling in general. If you forget preprocessing of your data, you are very likely to see very awkward behaviours of the ML algorithm. Maybe it takes too long to converge or model weights can be fluctuating in very interesting range of values. Moreover, do not forget to apply another sanity-check to feature values after preprocessing as well. For instance, it is very frequent to divide your values by 0 for normalization. Obviously the result is Null.

Model Training is the fun part. You have a bunch of algorithms waiting to be applied. Unfortunately, this is just the 5% of your hands-on work on the overall work-flow. However, from the execution perspective, it is by far most time wasting step. One of the common mistakes at this step is to use random set of algorithms without any reasoning. Nevertheless, this is like looking for a needle in a haystack. For the correct algorithm, we should investigate our problem first. The choice of loss function, learning algorithm, regularizer term and whole other parameters are totally problem specific. I will explain more in the coming posts... Model Training includes hyper-parameter optimization and the first insights about your data the the quality of your preceding steps. For example, suppose you apply Grid-Search for parameter selection with K-fold Cross-Validation and you observe very different scores for each fold for each different candidate value. It is early indicator of insufficient training data or inappropriate feature representation. Then you need to reiterate as suggested.

Model Evaluation measures the quality of your recently trained model on the validation set which not touched at any of the former steps. First, we preprocess this as well with the same method applied to train data then we feed our model. This step is defined with the selection of the quality measure (ROC, RMNS, F-Score) supposedly same with the measure used for training. Keep in mind that different measures gives different insights about your model.

Don't throw your trained models away. Suppose you trained 5 different NNs with small performance gaps, then ensemble them with average, max, or any other voting schema. If you look at the Kaggle winners, most of them are ensembles of NNs, SVMs or Random-Forests(es). They even use the poor models as well in the ensemble. This is because, ensemble means better generalization performance slightly depending on your ensemble schema. There are some constraints here. Consider the run-time expectations of the final model. As you increase the size of your ensemble, computational and memory needs also increase. We are talking about here the Scalability vs Accuracy trade-off. Maybe the best example of this is the 1million$ Netflix Challenge where Netflix has not implement the winner algorithm because off the scalability issues.

Now, we are at the final step. You will power-off your PC or not after this. Measure the performance of your final model (ensemble of models) with the held-out set. Do not forget the preprocess it just like the validation and train set. This is the most crucial step indicating the real life performance of your final model. If your measures are so different then the validation data, it means your ensemble schema does not work and you need to change it. However, be sure that there is nothing wrong before and all the ensemble models are very convenient form all possible perspectives. Otherwise, poor performance at this step leaves you in a very dangerous situation, with many different combinations of flaws depending on any model in your ensemble at any step.

This is it, for the time being. I tried to explained my proposed ML work-flow with small amount of nuances. I intend to explain each of these steps top to bottom in details with new posts that will be released each week (at least the plan is it). More importantly, please leave any kind of comment for this post to re-shape and discuss the ideas. Do not hesitate to punch my face with the mistakes.