Suppose you have a problem that you like to tackle with machine learning and use the resulting system in a real-life project. I like to share my simple pathway for such purpose, in order to provide a basic guide to beginners and keep these things as a reminder to myself. These rules are tricky since even-thought they are simple, it is not that trivial to remember all and suppress your instinct which likes to see a running model as soon as possible.
When we confronted any problem, initially we have numerous learning algorithms, many bytes or gigabytes of data and already established knowledge to apply some of these models to particular problems. With all these in mind, we follow a three stages procedure;
Here we have a very good slide summarizing performance measures, statistical tests and sampling for model comparison and evaluation. You can refer it when you have some couple of classifiers on different datasets and you want to see which one is better and why?
Selection of your final machine learning model is a vital part of your project. Using the accurate metric and the selection paradigm might give very good results even you use very simple or even wrong learning algorithm. Here, I explain a very parsimonious and plane way.
The metric you choose is depended to your problem end expectations. Some common alternatives are F1 score (combination of precision and recall), accuracy (ratio of correctly classified instances to all instances), ROC curve or error rate (1-accuracy).
For being an example I use error rate (at the below figure). First divide the data into 3 as train set, held-out set, test set. We will use held-out set as an objective guidance of hyper-parameters of your algorithm. You might also prefer to use K-fold X-validation but my choice is to keep a held-out set, if I have enough number of instances.
Following procedure can be used for parameter selection and the selection of the final model. The idea is, plotting the performance of the model with the lines of test fold accuracy (held-out set) and the train fold accuracy. This plot should be met at a certain point where both of the curves consistent in some sense (training fold and test fold scores are at reasonable levels) and after a slight step they start to be stray away from each other (train fold score increases still and test fold score starts to be dropped down). This straying effect might be underfitting or after a numerous learning iterations likely to be overfitting. Choice the best trade-off point on the plot as the correct model.
Another caveat, do not use so much folds for x-validation since some of the papers (that cannot come up the name right now:( ), asymptotic behaviour of cross validation is likely to tout over-fitting therefore use of leave-multiple out procedure instead of leave-one out if you propose to use large fold number.