Here, I summarize a new method called FAME for learning Face Models from noisy set of web images. I am studying this for my MS Thesis. To be a little intro to my thesis, the title is "Mining Web Images for Concept Learning" and it introduces two new methods for automatic learning of visual concepts from noisy web images. First proposed method is FAME and the other work was presented here before, that is namely ConceptMap and it is accepted for ECCV14 (self promotion :)).
Before I start, I should disclaim that FAME is not a fully furnished work and waiting your valuable comments. Please leave your statements about anything you find useful, ridiculous, awkward or great.
In this work, we grasp the problem of learning face models for public faces from images collected from web through querying a particular person name. Collected images are called weakly-labelled by the rough prescription of defined query. However, the data is very noisy even after face detection, with false detections or several irrelevant faces of other people. The proposed method FAME (Face Association through Model Evolution) is able to prune the data in an iterative manner, for the face models associated to a name to evolve. The idea is to quantify representativeness and discriminativeness of each image against a vast amount of random images and eliminating the poor instances regarding to these qualifications. At the end, final clean data is used to train models for face identification of novel images. We believe that FAME is a generic method that can be used in different domains other than vision tasks but we did not testify this argument in this work.
To be more casual, the purpose is to query someone's name from a Image Search engine like Google Image Search and then without any human effort, filter the images from irrelevant instances and train good quality face models. The crux of the pipeline is the data pruning procedure of FAME. Let's look at how FAME eliminates spurious instances through evolving models.
The idea is very simple and intuitive. Assume that we have a queried set of images of Adam Sandler with some level of spurious instances as well. We also have a vast number of random face images again collected from web. The first thing FAME does is to learn a linear model separating Sandler images (+ class) from the random images (- class). The hyperplane of gives the disciminativeness measure of Sandler images against to rest of the world (random images) by distance to the hyperplane. As a result of this intuition, we select the top Sandler images that are far most from the hyperplane at the positive side. These images are supposed to be the most discriminant (iconic) images of Sandler compared to the random images. Second, we train another linear model that is separating selected images (+ class) from remaining Sandler images (- images). measures the representativeness of the instances by examining them in relation to iconic images selected by . There after, we eliminate instances that are far most distant from the hyperplane at the negative side. We believe that those are the images that are different from the iconic images and diverse from the class basis. This is just a single iteration of FAME so we iterate this flow up to desired level of image elimination.
I know, “desired level of image elimination” doesn't sound satisfying. Now, what can be used as a stop sign of the algorithm? We use the training accuracy of as the measure of data quality. This is because as we iterate, training accuracy of constantly increases up to a certain limit. This is very intuitive since as we eliminate spurious Sandler images, the dispersion of the random images and the remaining Sandler images is getting more clear. Hence, when we see a decrease or saturation of the accuracy score, we stop the algorithm and use the remaining salient images to train a final face model. Sometimes, the accuracy can get 100% in very few iterations. In that case, we wait until the elimination of 0.1 of category images. This is the overall elimination level that we encounter for the all name categories.
Another carious fact of our method is the data representation. We rely on very high dimensional representations of face images since it is required to discern categories from the others despite of sub-modularity (view variation, visual differences) in each category. In that way, we are able to separate any category with 100% with a easy to train linear model (We tested it by training a linear model among all classes). Linear model makes faster FAME iterations compared to alternative complex classifiers. Furthermore, we observe better results with simple Logistic Regression compared to Linear SVM. Maybe, this is because SMV's margin loss yields very restrictive constraints over the instance space that is not suitable for the FAME's problem.
Face images are represented by 40000 dimension feature vectors, using very simple but powerful method proposed by Adam Coates LINK EKLE. It is basically single layer K-means quantization of visual words with 5 grid (4 quadrants + image center ) average spatial pooling (In their work, they use 4 quadrants but face images include information at the center as well.). If you are not able to run new trend , expensive multi-layer models, this simple K-means alternative works very well with little or no performance loss. You might see the example filters learned by the methods. There are many filter examples receptive to eyes and mouths. (Seems like magic !!!)
We train our final face models with L1 norm SVM over final pruned image collections and models are enhanced by grafting algorithm proposed by Simon Perkins LINK. Basically, grafting selects important feature dimensions in a greedy manner with respect to their gradient information at each iteration. This way, we ease the system requirements for the final models with reducing the necessary feature dimensions.
We tested our method and classification pipeline over well-known face datasets, PubFig83, FAN-large. (For more details visit project LINK page or refer to the paper). I won't give the full result table but will give important ones. Our training pipeline ( Filter Learning + L1 SVM + grafting) (without data cleaning and models are trained with the training partition of the dataset) is able to classify 83 categories of PubFig83 with 90.75 % accuracy that is higher than state of art, up to our knowledge, presented by Becker et al. With 85.9%.
If we train our models with noisy web images after FAME iterations which is the real problem of this study, we are able to classify PubFig83 with 79.3% and FAN-large-ALL (so to say very hard dataset) with 67.1% accuracy values. However, notice that all these models are trained with only the web images filtered by FAME. The real improvement yielded by FAME is observed in comparison to baseline results. We conduct the baseline by using all the web images without any data filtering through same training pipeline. This baseline gives 52.8% for PubFig and 52.7% for FAN-large. Hence the FAME improvement is very obvious.
As a final comment, this work has been submitted to BMVC14 but genteelly rejected with ironic reviews by 2 out of 3 reviewer. I also believe that this work is not really ready for publication, thus your comments are really precious to advance the idea.