Tag Archives: data science

ML WORK-FLOW (Part2) - Data Preprocessing

I try to keep my promised schedule on as much as possible. Here is the detailed the first step discussion of my proposed Machine Learning Work-Flow, that is Data Preprocessing.

Data Preprocessing is an important step in which mostly aims to improve raw data quality before you dwell into the technical concerns. Even-though this step involves very easy tasks to do, without this, you might observe very false or even freaking results at the end.

I also stated at the work-flow that, Data Preprocessing is statistical job other than ML. By saying this, Data Preprocessing demands good data inference and analysis just before any possible decision you made. These components are not the subjects of a ML course but are for a Statistics. Hence, if you aim to be cannier at ML as a whole, do not ignore statistics.

We can divide Data Preprocessing into 5 different headings;

  1. Data Integration
  2. Data Cleaning
  3. Data Transformation
  4. Data Discretization
  5. Data Reduction

Continue reading ML WORK-FLOW (Part2) - Data Preprocessing


Machine Learning Work-Flow (Part 1)

So far, I am planning to write a serie of posts explaining a basic Machine Learning work-flow (mostly supervised). In this post, my target is to propose the bird-eye view, as I'll dwell into details at the latter posts explaining each of the components in detail. I decide to write this serie due to two reasons; the first reason is self-education -to get all my bits and pieces together after a period of theoretical research and industrial practice- the second is to present a naive guide to beginners and enthusiasts.

Below, we have the overview of the proposed work-flow. We have a color code indicating bases. Each box has a color tone from YELLOW to RED. The yellower the box, the more this component relies on Statistics knowledge base. As the box turns into red[gets darker], the component depends more heavily on Machine Learning knowledge base. By saying this, I also imply that, without good statistical understanding, we are not able to construct a convenient machine learning pipeline. As a footnote, this schema is changed by post-modernism of Representation Learning algorithms and I'll touch this at the latter posts.


Continue reading Machine Learning Work-Flow (Part 1)