We are now one step ahead of Feature Extraction and we extracted statistically important (covariate) representation of the given raw data. Just after Feature Extraction, first thing we need to do is to check the values of the new representation. In general, people are keen on avoiding this and regarding it as a waste of time. However, I believe this is a serious mistake. As I stated before, a single NULL value, or skewed representation might cause a very big pain at the end and it can leave you in very hazy conditions.
Let’s start our discussion. I list here my Sanity Check steps;
Continue reading ML Work-Flow (Part 4) – Sanity Checks and Data Splitting
In this post, I'll talk about the details of Feature Extraction (aka Feature Construction, Feature Aggregation …) in the path of successful ML. Finding good feature representations is a domain related process and it has an important influence on your final results. Even if you keep all the settings same, with different Feature Extraction methods you would observe drastically different results at the end. Therefore, choosing the correct Feature Extraction methodology requires painstaking work.
Feature Extraction is a process of conveying the given raw data into set of instance points embedded in a standardized, distinctive and machine understandable space. Standardized means comparable representations with same length; so you can compute similarities or differences of the instances that have initially very versatile structural differences (like different length documents). Distinctive means having different feature values for different class instances so that we can observe clusters of different classes in the new data space. Machine understandable representation is mostly the numerical representation of the given instances. You can understand any document by reading it but machines only understand semantics implied by the numbers. Continue reading ML Work-Flow (Part 3) - Feature Extraction