Suppose you have a problem that you like to tackle with machine learning and use the resulting system in a real-life project. I like to share my simple pathway for such purpose, in order to provide a basic guide to beginners and keep these things as a reminder to myself. These rules are tricky since even-thought they are simple, it is not that trivial to remember all and suppress your instinct which likes to see a running model as soon as possible.
When we confronted any problem, initially we have numerous learning algorithms, many bytes or gigabytes of data and already established knowledge to apply some of these models to particular problems. With all these in mind, we follow a three stages procedure;
We already discussed first four steps of ML work-flow. So far, we preprocessed crude data by DICTR (Discretization, Integration, Cleaning, Transformation, Reduction), then applied a way of feature extraction procedure to convert data into machine understandable representation, and finally divided data into different bunches like train and test sets . Now, it is time to preprocess feature values and make them ready for the state of art ML model ;).
We need Feature Preprocessing in order to:
Evade scale differences between dimensions.
Convey instances into a bounded region in the space.
Remove correlations between different dimensions.
You may ask “Why are we so concerned about these?” Because
Evading scale differences reduces unit differences between particular feature dimensions. Think about Age and Height of your customers. Age is scaled in years and Height is scaled in cm's. Therefore, these two dimension values are distributed in different manners. We need to resolve this and convert data into a scale invariant representation before training your ML algorithm, especially if you are using one of the linear models like Logistic Regression or SVM (Tree based models are more robust to scale differences).
Conveying instances into a bounded region in the space resolves the representation biases between instances. For instance, if you work on a document classification problem with bag of words representation then you should care about document length since longer documents include more words which result in more crowded feature histograms. One of the reasonable ways to solve this issue is to divide each word frequency by the total word frequency in the document so that we can convert each histogram value into a probability of seeing that word in the document. As a result, document is represented with a feature vector that is 1 in total of its elements. This new space is called vector space model in the literature.
Removing correlations between dimensions cleans your data from redundant information exposed by multiple feature dimensions. Hence data is projected into a new space where each dimension explains something independently important from the other feature dimensions.
Okay, I hope now we are clear why we are concerned about these. Henceforth, I'll try to emphasis some basic stuff in our toolkit for feature preprocessing.
Can be applied to both feature dimensions or data instances.
If we apply to dimensions, it reduces unit effect and if we apply to instances then we solve instance biases as in the case of the document classification problem.
The result of standardization is that each feature dimension (instance) is scaled into defined mean and variance so that we fix the unit differences between dimensions.
: for each dimension (instance), subtract the mean and divide by the variance of that dimension (instance) so that each dimension is kept inside a mean = 0 , variance = 1 curve.
Min Max Scaling
Personally, I've not applied Min-Max Scaling to instances,
It is still useful for unit difference problem.
Instead of distributional consideration, it hinges the values in the range [0,1].
: Find max and min values of the feature dimension and apply the formula.
Caveat 1: One common problem of Scaling and Standardization is you need to keep min and max for Scaling, mean and variance values for Standardization for the novel data and the test time. We estimate these values from only the training data and assume that these are still valid for the test and real world data. This assumption might be true for small problems but especially for online environment this caveat should be dealt with a great importance.
Sigmoid function naturally fetches given values into a [0, 1] range
Does not need any assumption about the data like mean and variance
It penalizes large values more than the small ones.
You can use other activation functions like tanh.
Caveat 2: How to choose and what to choose are very problem dependent questions. However, if you have a clustering problem then standardization seems more reasonable for better similarity measure between instance and if you intend to use Neural Networks then some particular kind of NN demands [0,1] scaled data (or even more interesting scale ranges for better gradient propagation on the NN model). Also, I personally use sigmoid function for simple problems in order to get fast result by SVM without complex investigation.
Zero Phase Component Analysis (ZCA Whitening)
As I explained before, whitening is a process to reduce redundant information by decorrelating data with a final diagonal correlation matrix with preferable all diagonals are one.
It has especially very important implications in Image Recognition and Feature Learning so as to make visual cues more concrete on images.
Instead of formula, it is more intuitive to wire some code
I tried to touch some methods and common concerns of feature preprocessing, by no means complete. Nevertheless, a couple of takeaways from this post are; do not ignore normalizing your feature values before going into training phase and choose the correct method by investigating the values painstakingly.
PS: I actually promised to write a post per week but I am as busy as a bee right now and I barely find some time to write a new stuff. Sorry about it 🙁
We are now one step ahead of Feature Extraction and we extracted statistically important (covariate) representation of the given raw data. Just after Feature Extraction, first thing we need to do is to check the values of the new representation. In general, people are keen on avoiding this and regarding it as a waste of time. However, I believe this is a serious mistake. As I stated before, a single NULL value, or skewed representation might cause a very big pain at the end and it can leave you in very hazy conditions.
Let’s start our discussion. I list here my Sanity Check steps;
In this post, I'll talk about the details of Feature Extraction (aka Feature Construction, Feature Aggregation …) in the path of successful ML. Finding good feature representations is a domain related process and it has an important influence on your final results. Even if you keep all the settings same, with different Feature Extraction methods you would observe drastically different results at the end. Therefore, choosing the correct Feature Extraction methodology requires painstaking work.
Feature Extraction is a process of conveying the given raw data into set of instance points embedded in a standardized, distinctive and machine understandable space. Standardized means comparable representations with same length; so you can compute similarities or differences of the instances that have initially very versatile structural differences (like different length documents). Distinctive means having different feature values for different class instances so that we can observe clusters of different classes in the new data space. Machine understandable representation is mostly the numerical representation of the given instances. You can understand any document by reading it but machines only understand semantics implied by the numbers. Continue reading ML Work-Flow (Part 3) - Feature Extraction→