# ML Work-Flow (Part 5) – Feature Preprocessing

We already discussed first four steps of ML work-flow. So far, we preprocessed crude data by DICTR (Discretization, Integration, Cleaning, Transformation, Reduction), then applied a way of feature extraction procedure to convert data into machine understandable representation, and finally divided data into different bunches like train and test sets . Now, it is time to preprocess feature values and make them ready for the state of art ML model ;).

We need Feature Preprocessing in order to:

1. Evade scale differences between dimensions.
2. Convey instances into a bounded region in the space.
3. Remove correlations between different dimensions.

1. Evading scale differences reduces unit differences between particular feature dimensions. Think about Age and Height of your customers. Age is scaled in years and Height is scaled in cm's. Therefore, these two dimension values are distributed in different manners. We need to resolve this and convert data into a scale invariant representation before training your ML algorithm, especially if you are using one of the linear models like Logistic Regression or SVM (Tree based models are more robust to scale differences).
2. Conveying instances into a bounded region in the space resolves the representation biases between instances. For instance, if you work on a document classification problem with bag of words representation then you should care about document length since longer documents include more words which result in more crowded feature histograms. One of the reasonable ways to solve this issue is to divide each word frequency by the total word frequency in the document so that we can convert each histogram value into a probability of seeing that word in the document. As a result, document is represented with a feature vector that is 1 in total of its elements. This new space is called vector space model in the literature.
3. Removing correlations between dimensions cleans your data from redundant information exposed by multiple feature dimensions. Hence data is projected into a new space where each dimension explains something independently important from the other feature dimensions.

Okay, I hope now we are clear why we are concerned about these. Henceforth, I'll try to emphasis some basic stuff in our toolkit for feature preprocessing.

Standardization

• Can be applied to both feature dimensions or data instances.
• If we apply to dimensions, it reduces unit effect and if we apply to instances then we solve instance biases as in the case of the document classification problem.
• The result of standardization is that each feature dimension (instance) is scaled into defined mean and variance so that we fix the unit differences between dimensions.
• $z = (x-mu)/alpha$  : for each dimension (instance),  subtract the mean and divide by the variance of that dimension (instance) so that each dimension is kept inside a mean = 0 , variance = 1 curve.

Min Max Scaling

• Personally, I've not applied Min-Max Scaling to instances,
• It is still useful for unit difference problem.
• Instead of distributional consideration, it hinges the values in the range  [0,1].
• $x_{norm} = (x - x_{min})/(x_{max} - x_{min})$ :  Find max and min values of the feature dimension and apply the formula.

Caveat 1: One common problem of Scaling and Standardization is you need to keep min and max for Scaling, mean and variance values for Standardization for the novel data and the test time. We estimate these values from only the training data and assume that these are still valid for the test and real world data. This assumption might be true for small problems but especially for online environment this caveat should be dealt with a great importance.

Sigmoid Functions

• Sigmoid function naturally fetches given values into a [0, 1] range
• Does not need any assumption about the data like mean and variance
• It penalizes large values  more than the small ones.
• You can use other activation functions like tanh.

Caveat 2: How to choose and what to choose are very problem dependent questions. However, if you have a clustering problem then standardization seems more reasonable for better similarity measure between instance and if you intend to use Neural Networks then some particular kind of NN demands [0,1] scaled data (or even more interesting scale ranges for better gradient propagation on the NN model). Also, I personally use sigmoid function for simple problems in order to get fast result by SVM without complex investigation.

Zero Phase Component Analysis (ZCA Whitening)

• As I explained before, whitening is a process to reduce redundant information by decorrelating data with a final diagonal correlation matrix with preferable all diagonals are one.
• It has especially very important implications in Image Recognition and Feature Learning  so as to make visual cues more concrete on images.
• Instead of formula, it is more intuitive to wire some code

I tried to touch some methods and common concerns of feature preprocessing, by no means  complete. Nevertheless, a couple of takeaways from this post are; do not ignore normalizing your feature values before going into training phase and choose the correct method by investigating the values painstakingly.

PS: I actually promised to write a post per week but I am as busy as a bee right now and I barely find some time to write a new stuff. Sorry about it 🙁

# Normalize Your Database with NF rules.

The Normal Form rules are the basic rules that are created by the Relational Database concept creater Edgar F. Codd. The main purpose of these is to create database tables that do not have redundancy for the data inside. There are basically three NF variations 1NF, 2NF, 3NF (Actually now on it goes up to 6NF by other theoreticians but basic structure is based on 1NF 2NF 3NF) .

1NF points out:
- Table should not have columns that are based on same data
- Create separate tables for each group of related data and identify each row with a unique column (the primary key).

Example:
(Manager, Employee1, Employee2 ) are the columns but there are two Employee column that violate rule one.

(Manager, Employee) is good solution but it violates still second rule.

(Manager, EmployeeID) is intuitive solution since each Employee has only one Manager. Now our table is proper for 1NF.

2NF points out:
- Remove subset of data that causes multiple rows includes same information and create seperate table for them
-Use foreign keys to relate the tables that are newly created and previous one

EXAMPLE:
(EmpID,FirstName, LastName, City, Zipcode, ) is our table but if there are two employee with same city so zipcodem it makes two row with redundant data. So it is more appropriate to use another table that includes cities and correspondence zipcodes. In addition new new structure makes easier to add, remove or update address information. Instead of updating all the rows that have same city and zipcode, we just need to update one row in the new table.