ML Work-Flow (Part 5) – Feature Preprocessing

We already discussed first four steps of ML work-flow. So far, we preprocessed crude data by DICTR (Discretization, Integration, Cleaning, Transformation, Reduction), then applied a way of feature extraction procedure to convert data into machine understandable representation, and finally divided data into different bunches like train and test sets . Now, it is time to preprocess feature values and make them ready for the state of art ML model ;).

We need Feature Preprocessing in order to:

  1. Evade scale differences between dimensions.
  2. Convey instances into a bounded region in the space.
  3. Remove correlations between different dimensions.

You may ask “Why are we so concerned about these?” Because

  1. Evading scale differences reduces unit differences between particular feature dimensions. Think about Age and Height of your customers. Age is scaled in years and Height is scaled in cm's. Therefore, these two dimension values are distributed in different manners. We need to resolve this and convert data into a scale invariant representation before training your ML algorithm, especially if you are using one of the linear models like Logistic Regression or SVM (Tree based models are more robust to scale differences).
  2. Conveying instances into a bounded region in the space resolves the representation biases between instances. For instance, if you work on a document classification problem with bag of words representation then you should care about document length since longer documents include more words which result in more crowded feature histograms. One of the reasonable ways to solve this issue is to divide each word frequency by the total word frequency in the document so that we can convert each histogram value into a probability of seeing that word in the document. As a result, document is represented with a feature vector that is 1 in total of its elements. This new space is called vector space model in the literature.
  3. Removing correlations between dimensions cleans your data from redundant information exposed by multiple feature dimensions. Hence data is projected into a new space where each dimension explains something independently important from the other feature dimensions.

Okay, I hope now we are clear why we are concerned about these. Henceforth, I'll try to emphasis some basic stuff in our toolkit for feature preprocessing.

Standardization

  • Can be applied to both feature dimensions or data instances.
  • If we apply to dimensions, it reduces unit effect and if we apply to instances then we solve instance biases as in the case of the document classification problem.
  • The result of standardization is that each feature dimension (instance) is scaled into defined mean and variance so that we fix the unit differences between dimensions.
  •  z = (x-\mu)/\alpha  : for each dimension (instance),  subtract the mean and divide by the variance of that dimension (instance) so that each dimension is kept inside a mean = 0 , variance = 1 curve.

Min Max Scaling

  • Personally, I've not applied Min-Max Scaling to instances,
  • It is still useful for unit difference problem.
  • Instead of distributional consideration, it hinges the values in the range  [0,1].
  • x_{norm} = (x - x_{min})/(x_{max} - x_{min}) :  Find max and min values of the feature dimension and apply the formula.

Caveat 1: One common problem of Scaling and Standardization is you need to keep min and max for Scaling, mean and variance values for Standardization for the novel data and the test time. We estimate these values from only the training data and assume that these are still valid for the test and real world data. This assumption might be true for small problems but especially for online environment this caveat should be dealt with a great importance.

Sigmoid Functions

  • Sigmoid function naturally fetches given values into a [0, 1] range
  • Does not need any assumption about the data like mean and variance
  • It penalizes large values  more than the small ones.
  • You can use other activation functions like tanh.
Sigmoid function

Caveat 2: How to choose and what to choose are very problem dependent questions. However, if you have a clustering problem then standardization seems more reasonable for better similarity measure between instance and if you intend to use Neural Networks then some particular kind of NN demands [0,1] scaled data (or even more interesting scale ranges for better gradient propagation on the NN model). Also, I personally use sigmoid function for simple problems in order to get fast result by SVM without complex investigation.

Zero Phase Component Analysis (ZCA Whitening)

  • As I explained before, whitening is a process to reduce redundant information by decorrelating data with a final diagonal correlation matrix with preferable all diagonals are one.
  • It has especially very important implications in Image Recognition and Feature Learning  so as to make visual cues more concrete on images.
  • Instead of formula, it is more intuitive to wire some code
Covariance Matrices before and after ZCA

I tried to touch some methods and common concerns of feature preprocessing, by no means  complete. Nevertheless, a couple of takeaways from this post are; do not ignore normalizing your feature values before going into training phase and choose the correct method by investigating the values painstakingly.

PS: I actually promised to write a post per week but I am as busy as a bee right now and I barely find some time to write a new stuff. Sorry about it :(

 

LinkedInStumbleUponPocketRedditShare

"No free lunch" theorem lies about Random Forests

download

I've read a great paper by Delgado et al.  namely "Do we Need Hundreds of Classifiers to Solve Real World Classi cation Problems?" in which they compare 179 different classifiers from 17 families on 121 data sets composed by the whole UCI data base and some real-world problems. Classifiers are from R with and without caret pack, C and Matlab (I wish I could see Sklearn as well).

I really recommend you to read the paper in detail but I will share some of the highlights here. The most impressive result is the performance of Random Forests (RF) Implementations. For each dataset, RF is always at the top places. It gets 94.1%  of max accuracy and goes by 90% in the 84.3% of the data sets. Also, 3 out of 5 best classifiers are RF for any data set. This is pretty impressive, I guess. The runner-up is SVM with Gaussian kernel implemented in LibSVM and it archives 92.3% max accuracy. The paper points RF, SVM with Gaussian and Polynomial kernels, Extreme Learning Machines with Gaussian kernel, C5.0 and avNNet (a committe of MLPs implemented in R with caret package) as the top list algorithms after their experiments.

One shortcoming of the paper, from my beloved NN perspective,  is used Neural Network models are not very up-to-date versions such as drop-out, max-out networks. Therefore, it is hard to evaluate algorithms against these advance NN models. However, for anyone in the darn dark of algorithms, it is a quite good guideline that shows the power of RF and SVM against the others.

ML Work-Flow (Part 4) – Sanity Checks and Data Splitting

SANITY CHECK

We are now one step ahead of Feature Extraction and we extracted statistically important (covariate) representation of the given raw data. Just after Feature Extraction, first thing we need to do is to check the values of the new representation. In general, people are keen on avoiding this and regarding it as a waste of time. However, I believe this is a serious mistake. As I stated before, a single  NULL value, or skewed representation might cause a very big pain at the end and it can leave you in very hazy conditions.

Let’s start our discussion. I list here my Sanity Check steps;

Continue reading

ML Work-Flow (Part 3) - Feature Extraction

In this post, I'll talk about the details of Feature Extraction (aka Feature Construction, Feature Aggregation …) in the path of successful ML. Finding good feature representations is a domain related process and it has an important influence on your final results. Even if you keep all the settings same, with different Feature Extraction methods you would observe drastically different results at the end. Therefore, choosing the correct Feature Extraction methodology requires painstaking work.

Feature Extraction is a process of conveying the given raw data into set of instance points embedded in a standardized, distinctive and machine understandable space. Standardized means comparable representations with same length; so you can compute similarities or differences of the instances that have initially very versatile structural differences (like different length documents). Distinctive means having different feature values for different class instances so that we can observe clusters of different classes in the new data space. Machine understandable representation is mostly the numerical representation of the given instances. You can understand any document by reading it but machines only understand semantics implied by the numbers. Continue reading

ML WORK-FLOW (Part2) - Data Preprocessing

I try to keep my promised schedule on as much as possible. Here is the detailed the first step discussion of my proposed Machine Learning Work-Flow, that is Data Preprocessing.

Data Preprocessing is an important step in which mostly aims to improve raw data quality before you dwell into the technical concerns. Even-though this step involves very easy tasks to do, without this, you might observe very false or even freaking results at the end.

I also stated at the work-flow that, Data Preprocessing is statistical job other than ML. By saying this, Data Preprocessing demands good data inference and analysis just before any possible decision you made. These components are not the subjects of a ML course but are for a Statistics. Hence, if you aim to be cannier at ML as a whole, do not ignore statistics.

We can divide Data Preprocessing into 5 different headings;

  1. Data Integration
  2. Data Cleaning
  3. Data Transformation
  4. Data Discretization
  5. Data Reduction

Continue reading

Machine Learning Work-Flow (Part 1)

So far, I am planning to write a serie of posts explaining a basic Machine Learning work-flow (mostly supervised). In this post, my target is to propose the bird-eye view, as I'll dwell into details at the latter posts explaining each of the components in detail. I decide to write this serie due to two reasons; the first reason is self-education -to get all my bits and pieces together after a period of theoretical research and industrial practice- the second is to present a naive guide to beginners and enthusiasts.

Below, we have the overview of the proposed work-flow. We have a color code indicating bases. Each box has a color tone from YELLOW to RED. The yellower the box, the more this component relies on Statistics knowledge base. As the box turns into red[gets darker], the component depends more heavily on Machine Learning knowledge base. By saying this, I also imply that, without good statistical understanding, we are not able to construct a convenient machine learning pipeline. As a footnote, this schema is changed by post-modernism of Representation Learning algorithms and I'll touch this at the latter posts.

 

Continue reading

Why I chose industry over academy

In general, if I need to choose something over some other thing I enlist the positive and negative facts about options and have a basic summation to find out the correct one.

Here I itemize my subjective pros and cons list. Maybe you might find it skewed or ridiculous but these are based on my 3 years of hard core academic effort and 2 years in industry (sum of my partial efforts). I think they present at least some of the obstacles you would see in the both worlds.

 

First, I start with the academy;

Pros--

  1. Academic life is the best in terms of freedom at work. You choose your study topic, at least to some extent, you team-up and follow the boundaries of human knowledge so as to extend it a bit. This is a very respectful and curious search. For sure, it is better than having a boss choosing your way to go. However , even this freedom is limited as in  the below comic :)
  2. Dresscode. Yes, academy is not so certain to define a particular dresscode for you, in most cases. You are free to put on your comfortable shorts and flip-flops and go to your office to work. However, it should be pointed out that present industry also realized the idiocy of strict dresscodes and it provides better conditions for the employees as well. Yet, business is still not comparable with the academy.2345
  3. Travel around the world with conferences, summer-schools, meetings, internships at low-cost. Meet people around the globe and feel the international sense.
  4. Respectful job. It urges the sense of respect as you say you are an academic and  people usually assume you are more intelligent than the most, thanks to great scientist ancestors.
  5. Set your schedule. Schedule of an academic is more flexible and you have a bit of freedom to define your work time.
  6. Teaching. It is really great to envision young people with your knowledge and experience. Even-more, it is a vital role in a society since you are able to shape the future with the young people you touch.
  7. Elegant social circle. Being an academic chains you with a social circle of people with a similar education level and supposedly similar level of cultivation. That of course does not mean that the industry consists of the ignorant but living in corporate life is more susceptible to facing unfortunate minds.

Continue reading

I presented my Master Dissertation?

I am glad to be presented my master dissertation at the end, I collect valuable feedback from my community. Before, I shared what I have done so far for the thesis on the different posts. CMAP (Concept Map) and FAME (Face Association through Model Evolution) (I called it AME at the Thesis for being more generic) are basically two different method for mining visual concepts from noisy image sources such as Google Image Search or Flickr. You might prefer to look at the posts for details or I posted here also the presentation for the brief view of my work.