ML WORK-FLOW (Part2) - Data Preprocessing

I try to keep my promised schedule on as much as possible. Here is the detailed the first step discussion of my proposed Machine Learning Work-Flow, that is Data Preprocessing.

Data Preprocessing is an important step in which mostly aims to improve raw data quality before you dwell into the technical concerns. Even-though this step involves very easy tasks to do, without this, you might observe very false or even freaking results at the end.

I also stated at the work-flow that, Data Preprocessing is statistical job other than ML. By saying this, Data Preprocessing demands good data inference and analysis just before any possible decision you made. These components are not the subjects of a ML course but are for a Statistics. Hence, if you aim to be cannier at ML as a whole, do not ignore statistics.

We can divide Data Preprocessing into 5 different headings;

  1. Data Integration
  2. Data Cleaning
  3. Data Transformation
  4. Data Discretization
  5. Data Reduction

Data Integration

Put different format data from various sources into a uniform shape suitable for the upcoming processes. These different sources might be called different databases, streams even excel tables. Albeit the simplicity of the idea, this emerges a different set of commercial softwares, namely ETL (Extract - Transform - Load ) tools. These tools make you able to reach different sources from a single point of view and merge data with defined homogenize data-flow. Incidently, data integration includes the other headings in itself recursively. More explicitly, any sub-component of your integration flow is able to include one of more Data Preprocessing process that we explain below.

It is important to define data format without any hesitation in advance, according to your problem. If you are not very sure about the convenient format, investigate it. Otherwise, integration might be too obscure . It is very time consuming, especially for big data, and motivation breaking for the next steps.

Data Cleaning

Fill the missing values in the data, attributes or class labels.  Most simple approach is to use mean or median value of the other rows or mean or median of same class instances. (Median is robust to outlier values in general) . Maybe the other approach is to train a model for the prediction of the missing values just like the class labels.

Identify outliers and smooth out noisy instances. Outliers and noisy instances are deceiving for many ML algorithms like AdaBoost. Therefore, you need to rectify the data before any further proceeding. Even, you need to repeat all Preprocessing again after you remove outliers since , for instance, if you fill the missing values by including outliers, these are also wrong and need to be re-defined.

For outlier removal, one common way is to Cluster the data and remove the poor clusters. Moreover, you can use particular outlier detection algorithm (such as my baby RSOM or LOF). Another option is to fit a Regression model and align the data into this to remove the outlier effect.

Correct inconsistency in the data. This requires expert knowledge in general. You should consult to you business partner or the customer.

Data Transformation

Normalization -Scaling - Standardization. Depending on your further steps like feature extraction, you may need to transform data into different scales or domains.  This is very brutally important to get high quality, discriminative features at the end. Especially, if you are using automatized feature extraction algorithms, in general, they expects certain data formats and they are very fragile about it.

Construct new attributes. For instance, if you have weight and height values of the customers, adding  BMI as a new attribute is very reasonable. Such attribute constructions need some level of experience and statistical knowledge in the domain but creates very big performance improvements.

Data Discretization

Continuous values are problematic in some cases and for particular ML algorithms. Even I try to avoid discretizing data with reasonable algorithms, specially for inference purposes discretization is very essential.

Use unsupervised equal-binning. Divide the numeric data into equal size or range bins without any detailed considerations.

Supervised discretization. Use class boundaries by sorting values and placing hinges between values by observing class distributions on the values.  You can also use entropy measure to define the partition. Now, you defines some candidate set of value partitions but you decide the best one with the best entropy based,information gain value.  My choice is to use continuous value capable Decision Tree to define value partitions from the nodes of the constructed tree.

Data Reduction

Reduce number of instances. Sometimes, you prefer to use subset of the data instead of the whole junk. In that case, sampling schema works for you. Even though, there are many different sampling methods, I prefer the most naive one, Random Sampling. If I need more robust results with multiple subset, I prefer to use bootstrapping with replacement.

Reduce number of attributes. Please do not try to predict number of Nobel Prizes of a country by the chocolate consumption (This is real story).

Nobel Prize vs Chocolate Consumption

Although this needs some level of expertise, if you are sure about any irrelevant attribute remove it from your attribute lists. However, if you are hesitated then wait for the Feature Selection step for its magic.

As a side-note, there is also a sub-topic in ML that applies reduction paradigm to complex problems so that it can solve the whole problem by coming through the simple sub-problems. MORE...


Machine Learning Work-Flow (Part 1)

So far, I am planning to write a serie of posts explaining a basic Machine Learning work-flow (mostly supervised). In this post, my target is to propose the bird-eye view, as I'll dwell into details at the latter posts explaining each of the components in detail. I decide to write this serie due to two reasons; the first reason is self-education -to get all my bits and pieces together after a period of theoretical research and industrial practice- the second is to present a naive guide to beginners and enthusiasts.

Below, we have the overview of the proposed work-flow. We have a color code indicating bases. Each box has a color tone from YELLOW to RED. The yellower the box, the more this component relies on Statistics knowledge base. As the box turns into red[gets darker], the component depends more heavily on Machine Learning knowledge base. By saying this, I also imply that, without good statistical understanding, we are not able to construct a convenient machine learning pipeline. As a footnote, this schema is changed by post-modernism of Representation Learning algorithms and I'll touch this at the latter posts.


Continue reading

Why I chose industry over academy

In general, if I need to choose something over some other thing I enlist the positive and negative facts about options and have a basic summation to find out the correct one.

Here I itemize my subjective pros and cons list. Maybe you might find it skewed or ridiculous but these are based on my 3 years of hard core academic effort and 2 years in industry (sum of my partial efforts). I think they present at least some of the obstacles you would see in the both worlds.


First, I start with the academy;


  1. Academic life is the best in terms of freedom at work. You choose your study topic, at least to some extent, you team-up and follow the boundaries of human knowledge so as to extend it a bit. This is a very respectful and curious search. For sure, it is better than having a boss choosing your way to go. However , even this freedom is limited as in  the below comic :)
  2. Dresscode. Yes, academy is not so certain to define a particular dresscode for you, in most cases. You are free to put on your comfortable shorts and flip-flops and go to your office to work. However, it should be pointed out that present industry also realized the idiocy of strict dresscodes and it provides better conditions for the employees as well. Yet, business is still not comparable with the academy.2345
  3. Travel around the world with conferences, summer-schools, meetings, internships at low-cost. Meet people around the globe and feel the international sense.
  4. Respectful job. It urges the sense of respect as you say you are an academic and  people usually assume you are more intelligent than the most, thanks to great scientist ancestors.
  5. Set your schedule. Schedule of an academic is more flexible and you have a bit of freedom to define your work time.
  6. Teaching. It is really great to envision young people with your knowledge and experience. Even-more, it is a vital role in a society since you are able to shape the future with the young people you touch.
  7. Elegant social circle. Being an academic chains you with a social circle of people with a similar education level and supposedly similar level of cultivation. That of course does not mean that the industry consists of the ignorant but living in corporate life is more susceptible to facing unfortunate minds.

Continue reading

I presented my Master Dissertation?

I am glad to be presented my master dissertation at the end, I collect valuable feedback from my community. Before, I shared what I have done so far for the thesis on the different posts. CMAP (Concept Map) and FAME (Face Association through Model Evolution) (I called it AME at the Thesis for being more generic) are basically two different method for mining visual concepts from noisy image sources such as Google Image Search or Flickr. You might prefer to look at the posts for details or I posted here also the presentation for the brief view of my work.



FAME: Face Association through Model Evolution

Here, I summarize a new method called FAME for learning Face Models from noisy set of web images. I am studying this for my MS Thesis. To be a little intro to my thesis, the title is "Mining Web Images for Concept Learning" and it introduces two new methods for automatic learning of visual concepts from noisy web images. First proposed method is FAME and the other work was presented here before, that is namely ConceptMap and it is accepted for ECCV14 (self promotion :)).

Before I start, I should disclaim that FAME is not a fully furnished work and waiting your valuable comments. Please leave your statements about anything you find useful, ridiculous, awkward or great.

In this work, we grasp the problem of learning face models for public faces from images collected from web through querying a particular person name. Collected images are called weakly-labelled by the rough prescription of defined query. However, the data is very noisy even after face detection, with false detections or several irrelevant faces Continue reading

Our ECCV2014 work "ConceptMap: Mining noisy web data for concept learning"

---- I am living the joy of seeing my paper title on the list of accepted ECCV14 papers :). Seeing the outcome of your work makes worthwhile all your day to night efforts, REALLY!!!. Before start, I shall thank to my supervisor Pinar Duygulu for her great guidance.----

In this post, I would like to summarize the title work since I believe sometimes a friendly blog post might be more expressive than a solid scientific article.

"ConceptMap: Mining noisy web data for concept learning" proposes a pipeline so as to learn wide range of visual concepts by only defining a query to a image search engine. The idea is to query a concept at the service and download a huge bunch of images. Cluster images as removing the irrelevant instances. Learn a model from each of the clusters. At the end, each concept is represented by the ensemble of these classifiers. Continue reading

Large data really helps for Object Detection ?

I stumbled upon a interesting BMVC 2012 paper (Do We Need More Training Data or Better Models for Object Detection? -- Zhu, Xiangxin, Vondrick, Carl, Ramanan, Deva, Fowlkes, Charless). It is claming something contrary to current notion of big data theory that advocates benefit of large data-sets so as to learn better models with increasing training data size. Nevertheless, the paper states that large training data is not that much helpful for learning better models, indeed more data is maleficent without careful tuning of your system !! Continue reading

How does Feature Extraction work on Images?

Here I share enhanced version of one of my Quora answer to a similar question ...

There is no single answer for this question since there are many diverse set of methods to extract feature from an image.

First, what is called feature? "a distinctive attribute or aspect of something." so the thing is to have some set of values for a particular instance that diverse that instance from the counterparts. In the field of images, features might be raw pixels for simple problems like digit recognition of well-known Mnist dataset. However, in natural images, usage of simple image pixels are not descriptive enough. Instead there are two main steam to follow. One is to use hand engineered feature extraction methods (e.g. SIFT, VLAD, HOG, GIST, LBP) and the another stream is to learn features that are discriminative in the given context (i.e. Sparse Coding, Auto Encoders, Restricted Boltzmann Machines, PCA, ICA, K-means). Note that second alternative, Continue reading