Lade Inhalt...

Face Image Analysis with Convolutional Neural Networks

Doktorarbeit / Dissertation 2008 191 Seiten

Informatik - Angewandte Informatik



1 Introduction
1.1 Context
1.2 Applications
1.3 Difficulties
1.3.1 Illumination
1.3.2 Pose
1.3.3 Facial Expressions
1.3.4 Partial Occlusions
1.3.5 Other types of variations
1.4 Objectives
1.5 Outline

2 Machine Learning Techniques for Object Detection and Recog nition
2.1 Introduction
2.2 Statistical Projection Methods
2.2.1 Principal Component Analysis
2.2.2 Linear Discriminant Analysis
2.2.3 Other Projection Methods
2.3 Active Appearance Models
2.3.1 Modeling shape and appearance
2.3.2 Matching the model
2.4 Hidden Markov Models
2.4.1 Introduction
2.4.2 Finding the most likely state sequence
2.4.3 Training
2.4.4 HMMs for Image Analysis
2.5 Adaboost
2.5.1 Introduction
2.5.2 Training
2.6 Support Vector Machines
2.6.1 Structural Risk Minimization
2.6.2 Linear Support Vector Machines
2.6.3 Non-linear Support Vector Machines
2.6.4 Extension to multiple classes
2.7 Bag of Local Signatures
2.8 Neural Networks
2.8.1 Introduction
2.8.2 Perceptron
2.8.3 Multi-Layer Perceptron
2.8.4 Auto-Associative Neural Networks
2.8.5 Training Neural Networks
2.8.6 Radial Basis Function Networks
2.8.7 Self-Organizing Maps
2.9 Conclusion

3 Convolutional Neural Networks
3.1 Introduction
3.2 Background
3.2.1 Neocognitron
3.2.2 LeCun’s Convolutional Neural Network model
3.3 Training Convolutional Neural Networks
3.3.1 Error Backpropagation with Convolutional Neural Networks
3.3.2 Other training algorithms proposed in the literature .
3.4 Extensions and variants
3.4.1 LeNet-5
3.4.2 Space Displacement Neural Networks
3.4.3 Siamese CNNs
3.4.4 Shunting Inhibitory Convolutional Neural Networks .
3.4.5 Sparse Convolutional Neural Networks
3.5 Some Applications
3.6 Conclusion

4 Face detection and normalization
4.1 Introduction
4.2 Face detection
4.2.1 Introduction
4.2.2 State-of-the-art
4.2.3 Convolutional Face Finder
4.3 Illumination Normalization
4.4 Pose Estimation
4.5 Face Alignment
4.5.1 Introduction
4.5.2 State-of-the-art
4.5.3 Face Alignment with Convolutional Neural Networks .
4.6 Conclusion
5 Facial Feature Detection
5.1 Introduction
5.2 State-of-the-art
5.3 Facial Feature Detection with Convolutional Neural Networks .
5.3.1 Introduction
5.3.2 Architecture of the Facial Feature Detection System .
5.3.3 Training the Facial Feature Detectors
5.3.4 Facial Feature Detection Procedure
5.3.5 Experimental Results
5.4 Conclusion

6 Face and Gender Recognition
6.1 Introduction
6.2 State-of-the-art in Face Recognition
6.3 Face Recognition with Convolutional Neural Networks
6.3.1 Introduction
6.3.2 Neural Network Architecture
6.3.3 Training Procedure
6.3.4 Recognizing Faces
6.3.5 Experimental Results
6.4 Gender Recognition
6.4.1 Introduction
6.4.2 State-of-the-art
6.4.3 Gender Recognition with Convolutional Neural Networks
6.5 Conclusion

7 Conclusion and Perspectives
7.1 Conclusion
7.2 Perspectives
7.2.1 Convolutional Neural Networks
7.2.2 Facial analysis with Convolutional Neural Networks . . .

A Excerpts from the used face databases
A.1 AR
A.2 BioID
A.4 Google Images
A.7 Yale


First of all, I would like to thank Dr. Christophe Garcia for his guidance and support over the last three years. This work would not have been possible without his excellent scientific as well as human qualities and the enormous amount of time he spent for me.

I also want to express my gratitude to my supervisor Prof. Dr. Hans Burkhardt who accompanied me during my thesis, gave me helpful advice and who always welcomed me in Freiburg.

Further, I would like to thank all my colleagues at France Telecom R&D (now Orange Labs), Rennes (France) where I spent three very pleasant years. Notably, Franck Mamalet, Sébastien Roux, Patrick Lechat, Sid-Ahmed Berrani, Zohra Saidane, Muriel Visani, Antoine Lehuger, Grégoire Lefebvre, Manolis Delakis and Christine Barbot.

Finally, I want to say thank you to my parents for their continuing support in every respect.


In this work, we present the problem of automatic appearance-based facial analysis with machine learning techniques and describe common specific subproblems like face detection, facial feature detection and face recognition which are the crucial parts of many applications in the context of indexation, surveillance, access-control or human-computer interaction.

To tackle this problem, we particularly focus on a technique called Convolu- tional Neural Network (CNN) which is inspired by biological evidence found in the visual cortex of mammalian brains and which has already been applied to many different classification problems. Existing CNN-based methods, like the face detection system proposed by Garcia and Delakis, show that this can be a very effective, efficient and robust approach to non-linear image processing tasks such as facial analysis.

An important step in many automatic facial analysis applications, e.g. face recognition, is face alignment which tries to translate, scale and rotate the face image such that specific facial features are roughly at predefined positions in the image. We propose an efficient approach to this problem using CNNs and experimentally show its very good performance on difficult test images.

We further present a CNN-based method for automatic facial feature detec- tion. The proposed system employs a hierarchical procedure which first roughly localizes the eyes, the nose and the mouth and then refines the result by detect- ing 10 different facial feature points. The detection rate of this method is 96% for the AR database and 87% for the BioID database tolerating an error of 10% of the inter-ocular distance.

Finally, we propose a novel face recognition approach based on a specific CNN architecture learning a non-linear mapping of the image space into a lower- dimensional sub-space where the different classes are more easily separable. We applied this method to several public face databases and obtained better recognition rates than with classical face recognition approaches based on PCA or LDA. Moreover, the proposed system is particularly robust to noise and partial occlusions.

We also present a CNN-based method for the binary classification problem of gender recognition with face images and achieve a state-of-the-art accuracy. The results presented in this work show that CNNs perform very well on various facial image processing tasks, such as face alignment, facial feature de-tection and face recognition and clearly demonstrate that the CNN technique is a versatile, efficient and robust approach for facial image analysis.


In dieser Arbeit stellen wir das Problem der automatischen, erscheinungsbasier- ten Gesichts-Analyse dar und beschreiben gängige, spezifische Unterprobleme wie z.B. Gesichts- und Gesichtsmerkmals-Lokalisierung oder Gesichtserkennung, welche grundlegende Bestandteile vieler Anwendungen im Bereich Indexierung, Überwachung, Zugangskontrolle oder Mensch-Maschine-Interaktion sind.

Um dieses Problem anzugehen, konzentrieren wir uns auf einen bestimmten Ansatz, genannt Neuronales Faltungs-Netzwerk, englisch Convolutional Neural Network (CNN), welcher auf biologischen Befunden, die im visuellen Kortex von Säugetierhirnen entdeckt wurden, beruht und welcher bereits auf viele Klassifizierungsprobleme angewandt wurde. Bestehende CNN-basierte Methoden, wie das Gesichts-Lokalisierungs-System von Garcia und Delakis, zeigen, dass dies ein sehr effektiver, effizienter und robuster Ansatz für nicht-lineare Bildverarbeitungs-Aufgaben wie Gesichts-Analyse sein kann.

Ein wichtiger Schritt in vielen Anwendungen der automatischen GesichtsAnalyse, z.B. Gesichtserkennung, ist die Gesichts-Ausrichtung und -Zentrierung. Diese versucht das Gesichts-Bild so zu verschieben, zu drehen und zu vergrößern bzw. verkleinern, dass sich bestimmte Gesichtsmerkmale an vordefinierten BildPositionen befinden. Wir stellen einen effizienten Ansatz für dieses Problem vor, der auf CNNs beruht, und zeigen experimentell und anhand schwieriger Testbilder die sehr gute Leistungsfähigkeit des Systems.

Darüberhinaus stellen wir eine CNN-basierte Methode zur automatischen Gesichtsmerkmals-Lokalisierung vor. Das System bedient sich einem hierarchi- schen Verfahren, das zuerst grob die Augen, die Nase und den Mund lokalisiert, und dann das Ergebnis verfeinert indem es 10 verschiedene Gesichtsmerkmals- Punkte erkennt. Die Erkennungsrate dieser Methode liegt bei 96% für die AR- Datenbank und 87% für die BioID-Datenbank mit einer Fehler-Toleranz von 10% des Augenabstandes.

Schließlich stellen wir einen neuen Gesichtserkennungs-Ansatz vor, welcher auf einer spezifischen CNN-Architektur beruht und welcher eine nicht-lineare Abbildung vom Bildraum in einen niedrig-dimensionalen Unterraum lernt, in dem die verschiedenen Klassen leichter trennbar sind. Diese Methode wurde auf verschiedene öffentliche Gesichts-Datenbanken angewandt und erzielte bes- sere Erkennungsraten als klassische Gesichtserkennungs-Ansätze, die auf PCA oder LDA beruhen. Darüberhinaus ist das System besonders robust bezüglich Rauschen und partiellen Verdeckungen.

Wir stellen ferner eine CNN-basierte Methode zum binären KlassifizierungProblem der Geschlechtserkennung mittels Gesichts-Bildern vor und erzielen eine Genauigkeit, die dem aktuellen Stand der Technik entspricht.

Die Ergebnisse, die in dieser Arbeit dargestellt sind, beweisen, dass CNNs sehr gute Leistungen in verschiedenen Gesichts-Bildverarbeitungs-Aufgaben erzielen, wie z.B. Gesichts-Ausrichtung, Gesichtsmerkmals-Lokalisierung und Gesichtserkennung. Sie zeigen außerdem deutlich, dass CNNs ein vielseitiges, effizientes und robustes Verfahren zur Gesichts-Analyse sind.


Dans cette thèse, nous proposons le problème de l’analyse faciale basée sur l’apparence avec des techniques d’apprentissage automatique et nous décrivons des sous-problèmes spécifiques tels que la détection de visage, la détection de caractéristiques faciales et la reconnaissance de visage qui sont des composants indispensables dans de nombreuses applications dans le contexte de l’indexation, la surveillance, le contrôle d’accès et l’interaction homme-machine.

Afin d’aborder ce problème, nous nous concentrons sur une technique nommée réseau de neurones à convolution, en anglais Convolutional Neural Network (CNN), qui est inspirée des découvertes biologiques dans le cortex visuel des mammifères et qui a déjà été appliquée à de nombreux problèmes de classifica- tion. Des méthodes existantes, comme le système de détection de visage proposé par Garcia et Delakis, montrent que cela peut être une approche très efficace et robuste pour des applications de traitement non-linéaire d’images tel que l’analyse faciale.

Une étape importante dans beaucoup d’applications d’analyse facial, comme la reconnaissance de visage, constitue le recadrage automatique de visage. Cette technique cherche à décaler, tourner et agrandir ou reduire l’image de visage de sorte que des caractéristiques faciales se trouvent environ à des positions définies préalablement dans l’image. Nous proposons une approche efficace pour ce problème en utilisant des CNNs et nous montrons une très bonne performance de cette approche sur des images de test difficiles.

Nous présentons également une méthode basée CNN pour la détection de caractéristiques faciales. Le système proposé utilise une procedure hiérarchique qui localise d’abord les yeux, le nez et la bouche pour ensuite affiner le résultat en détectant 10 points de caractéristiques faciales différentes. Le taux de détection est de 96 % pour la base AR et de 87 % pour la base BioID avec une tolérance d’erreur de 10 % de la distance inter-oculaire.

Enfin, nous proposons une nouvelle approche de reconnaissance de visage basée sur une architecture spécifique de CNN qui apprend une projection non- linéaire de l’espace de l’image dans un espace de dimension réduite où les classes différentes sont séparables plus facilement. Nous appliquons cette méthode à plusieurs bases publiques de visage et nous obtenons des taux de reconnais- sance meilleurs qu’en utilisant des approches classiques basées sur l’Analyse en Composantes Principales (ACP) ou l’Analyse Discriminante Linéaire (ADL). En outre, le système proposé est particulièrement robuste par rapport au bruit et aux occultations partielles.

Nous présentons également une methode basée CNN pour le problème de reconnaissance de genre à partir d’images de visage et nous obtenons un taux comparable à l’état de l’art.

Les résultats présentés dans cette thèse montrent que les CNNs sont très performants dans de nombreuses applications de traitement d’images faciales telles que le recadrage de visage, la détection de caractéristiques faciales et la reconnaissance de visage. Ils démontrent également que la technique de CNN est une approche très variée, efficace et robuste pour l’analyse automatique d’image faciale.

List of Figures

1.1 An example face under a fixed view and varying illumination . .

1.2 An example face under fixed illumination and varying pose . . .

1.3 An example face under fixed illumination and pose but varying facial expression

2.1 Active Appearance Models: annotated training example and cor- responding shape-free patch

2.2 A left-right Hidden Markov Model

2.3 Two simple approaches to image analysis with 1D HMMs

2.4 Illustration of a 2D Pseudo-HMM

2.5 Graphical illustration of a linear SVM

2.6 The histogram creation procedure with the Bag-of-local-signature approach

2.7 The Perceptron

2.8 A Multi-Layer Perceptron

2.9 Different types of activation functions

2.10 Auto-Associative Neural Networks

2.11 Typical evolution of training and validation error

2.12 The two possible cases that can occur when the minimum on the validation set is reached

2.13 A typical evolution of the error criteria on the validation set using the proposed learning algorithm

2.14 The evolution of the validation error on the NIST database using Backpropagation and the proposed algorithm

2.15 The validation error curves of the proposed approach with differ- ent initial global learning rates

2.16 The architecture of a RBF Network

2.17 A two-dimensional SOM with rectangular topology

2.18 Evolution of a two-dimensional SOM during training

3.1 The model of a S-cell used in the Neocognitron

3.2 The topology of the basic Neocognitron

3.3 Some training examples used to train the first two S-layers of Fukushima’s Neocognitron

3.4 The architecture of LeNet-1

3.5 Convolution and sub-sampling

3.6 Error Backpropagation with convolution maps

3.7 Error Backpropagation with sub-sampling maps


3.8 The architecture of LeNet-5

3.9 A Space Displacement Neural Network

3.10 Illustration of a Siamese Convolutional Neural Network

3.11 Example of positive (genuine) and negative (impostor) error func- tions for Siamese CNNs

3.12 The shunting inhibitory neuron model

3.13 The SICoNNet architecture

3.14 The connection scheme of the SCNN proposed by Gepperth . . .

3.15 The sparse, shift-invariant CNN model proposed by Ranzato et al .

4.1 The architecture of the Convolutional Face Finder

4.2 Training examples for the Convolutional Face Finder

4.3 The face localization procedure of the Convolutional Face Finder

4.4 Convolutional Face Finder: ROC curves for different test sets . .

4.5 Some face detection results of the Convolutional Face Finder ob- tained with the CMU test set

4.6 The three rotation axes defined with respect to a frontal head . .

4.7 The face alignment process of the proposed approach

4.8 The Neural Network architecture of the proposed face alignment system

4.9 Training examples for the proposed face alignment system

4.10 The overall face alignment procedure of the proposed system . .

4.11 Correct alignment rate vs. allowed mean corner distance of the proposed approach

4.12 Precision of the proposed alignment approach and the approach based on facial feature detection

4.13 Sensitivity analysis of the proposed alignment approach: Gaus- sian noise

4.14 Sensitivity analysis of the proposed alignment approach: partial occlusion

4.15 Some face alignment results of the proposed approach on the Internet test set

5.1 Principal stages of the feature detection process of the proposed approach

5.2 Some input images and corresponding desired output feature maps

5.3 Architecture of the proposed facial feature detector

5.4 Eye feature detector: example of an input image with desired facial feature points, desired output maps and superposed desired output maps

5.5 Mouth feature detector: example of an input image with desired facial feature points, desired output maps and superposed desired output maps

5.6 Facial feature detector: virtual face images created by applying various geometric transformations

5.7 Facial feature detector: detection rate versus me of the four features

5.8 Facial feature detector: detection rate versus mei of each facial feature (FERET)

5.9 Facial feature detector: detection rate versus mei of each facial feature (Google images)


5.10 Facial feature detector: detection rate versus mei of each facial feature (PIE subset)

5.11 The different types of CNN input features that have been tested

5.12 ROC curves comparing the CNNs trained with different input features (FERET database)

5.13 ROC curves comparing the CNNs trained with different input features (Google images)

5.14 ROC curves comparing the CNNs trained with different input features (PIE subset)

5.15 Sensitivity analysis of the proposed facial feature detector: Gaus- sian noise

5.16 Sensitivity analysis of the proposed facial feature detector: partial occlusion

5.17 Facial feature detection results on different face databases

5.18 Overall detection rate of the proposed facial feature detection method for AR

5.19 Overall detection rate of the proposed facial feature detection method for BioID

5.20 Some results of combined face and facial feature detection with the proposed approach

6.1 The basic schema of our face recognition approach showing two different individuals

6.2 Architecture of the proposed Neural Network for face recognition

6.3 ROC curves of the proposed face recognition algorithm for the ORL and Yale databases

6.4 Examples of image reconstruction of the proposed face recogni- tion approach

6.5 Comparison of the proposed approach with the Eigenfaces and Fisherfaces approach: ORL database

6.6 Comparison of the proposed approach with the Eigenfaces and Fisherfaces approach: Yale database

6.7 Sensitivity analysis of the proposed face recognition approach: Gaussian noise

6.8 Sensitivity analysis of the proposed face recognition approach: partial occlusion

6.9 Examples of training images for gender classification

6.10 ROC curve of the gender recognition CNN applied to the unmixed FERET test set

List of Tables

2.1 Comparison of the proposed learning algorithm with Backpropa- gation and the bold driver method (10 hidden neurons)

2.2 Comparison of the proposed learning algorithm with Backpropa- gation and the bold driver method (40 hidden neurons)

3.1 The connection scheme of layer C3 of Lenet-5

4.1 Detection rate vs. false alarm rate of selected face detection meth- ods on the CMU test set

4.2 The connection scheme of layer C2 of the Convolutional Face Finder

4.3 Comparison of face detection results evaluated on the CMU and MIT test sets

4.4 Execution speed of the CFF on different platforms

5.1 Overview of detection rates of some published facial feature de- tection methods

5.2 Comparison of eye pupil detection rates of some published meth- ods on the BioID database

6.1 Recognition rates of the proposed approach compared to Eigen- faces and Fisherfaces

List of Algorithms

1 The Viterbi algorithm

2 The Adaboost algorithm

3 The standard online Backpropagation algorithm for MLPs

4 The proposed online Backpropagation algorithm with adaptive learning rate

5 The RPROP algorithm

6 The line search algorithm

7 A training algorithm for Self-Organizing Maps

8 The online Backpropagation algorithm for Convolutional Neural Networks

Chapter Introduction

1.1 Context

The automatic processing of images to extract semantic content is a task that has gained a lot of importance during the last years due to the constantly increasing number of digital photographs on the Internet or being stored on personal home computers. The need to organize them automatically in a intel- ligent way using indexing and image retrieval techniques requires effective and efficient image analysis and pattern recognition algorithms that are capable to extract relevant semantic information.

Especially faces contain a great deal of valuable information compared to other objects or visual items in images. For example, recognizing a person on a photograph, in general, tells a lot about the overall content of the picture.

In the context of human-computer interaction (HCI), it might also be im- portant to detect the position of specific facial characteristics or recognize facial expressions, in order to allow, for example, a more intuitive communication be- tween the device and the user or to efficiently encode and transmit facial images coming from a camera. Thus, the automatic analysis of face images is crucial for many applications involving visual content retrieval or extraction.

The principal aim of facial analysis is to extract valuable information from face images, such as its position in the image, facial characteristics, facial expressions, the person’s gender or identity.

We will outline the most important existing approaches to facial image anal- ysis and present novel methods based on Convolutional Neural Networks (CNN) to detect, normalize and recognize faces and facial features. CNNs show to be a powerful and flexible feature extraction and classification technique which has been successfully applied in other contexts, i.e. hand-written character recogni- tion, and which is very appropriate for face analysis problems as we will exper- imentally show in this work.

We will focus on the processing of two-dimensional gray-level images as this is the most widespread form of digital images and thus allows the proposed approaches to be applied in the most extensive and generic way. However, many techniques described in this work could also be extended to color images, 3D data or multi-modal data.

1.2 Applications

There are numerous possible applications for facial image processing algorithms. The most important of them concern face recognition. In this regard, one has to differentiate between closed world and open world settings. In a closed world application, the algorithm is dedicated to a limited group of persons, e.g. to recognize the members of a family. In an open world context the algorithm should be able to deal with images from “unknown” persons, i.e. persons that have not been presented to the system during its design or training. For example, an application indexing large image databases like Google images or television programs should recognize learned persons and respond with “unknown” if the person is not in the database of registered persons.

Concerning face recognition, there further exist two types of problems: face identification and face verification (or authentication). The first problem, face identification, is to determine the identity of a person on an image. The second one only deals with the question: “Is ‘X’ the identity of the person shown on the image?” or “Is the person shown on the image the one he claims to be?”. These questions only require “yes” or “no” as the answer.

Possible applications for face authentication are mainly concerned with ac- cess control, e.g. restricting the physical access to a building, such as a corporate building, a secured zone of an airport, a house etc. Instead of opening a door by a key or a code, the respective person would communicate an identifier, e.g. his/her name, and present his/her face to a camera. The face authentication system would then verify the identity of the person and grant or refuse the access accordingly. This principle could equally be applied to the access to sys- tems, automatic teller machines, mobile phones, Internet sites etc. where one would present his face to a camera instead of entering an identification number or password.

Clearly, also face identification can be used for controlling access. In this case the person only has to present his/her face to the camera without claiming his/her identity. A system recognizing the identity of a person can further be employed to control more specifically the rights of the respective persons stored in its database. For instance, parents could allow their children to watch only certain television programs or web sites, while the television or computer would automatically recognize the persons in front of it.

Video surveillance is another application of face identification. The aim here is to recognize suspects or criminals using video cameras installed at public places, such as banks or airports, in order to increase the overall security of these places. In this context, the database of suspects to recognize is often very large and the images captured by the camera are of low quality, which makes the task rather difficult.

With the vast propagation of digital cameras in the last years the number of digital images stored on servers and personal home computers is rapidly growing. Consequently, there is an increasing need of indexation systems that automatically categorize and annotate this huge amount of images in order to allow effective searching and so-called content-based image retrieval. Here, face detection and recognition methods play a crucial role because a great part of photographs actually contain faces. A similar application is the temporal segmentation and indexation of video sequences, such as TV programs, where different scenes are often characterized by different faces.

illustration not visible in this excerpt

Figure 1.1: An example face under a fixed view and varying illumination

Another field of application is facial image compression, i.e. parts of images containing faces can be coded by a specialized algorithm that incorporates a generic face model and thus leads to very high compression rates compared to universal techniques.

Finally, there are many possible applications in the field of advanced Human- Computer Interaction (HCI), e.g. the control and animation of avatars, i.e. com- puter synthesized characters. Such systems capture the position and movement of the face and facial features and accordingly animate a virtual avatar, which can be seen by the interlocutor. Another example would be the facilitation of the interaction of disabled persons with computers or other machines or the automatic recognition of facial expressions in order to detect the reaction of the person(s) sitting in front of a camera (e.g. smiling, laughing, yawning, sleeping).

1.3 Difficulties

There are some inherent properties of faces as well as the way the images are captured which make the automatic processing of face images a rather difficult task. In the case of face recognition, this leads to the problem that the intra-class variance, i.e. variations of the face of the same person due to lighting, pose etc., is often higher than the inter-class variance, i.e. variations of facial appearance of different persons, and thus reduces the recognition rate. In many face analysis applications, the appearance variation resulting from these circumstances can also be considered as noise as it makes the desired information, i.e. the identity of the person, harder to extract and reduces the overall performance of the respective systems.

In the following, we will outline the most important difficulties encountered in common real-world applications.

1.3.1 Illumination

Changes in illumination can entail considerable variations of the appearance of faces and thus face images. Two main types of light sources influence the overall illumination: ambient light and point light (or directed light). The former is somehow easier to handle because it only affects the overall brightness of the resulting image. The latter however is far more difficult to analyze, as face images taken under varying light source directions follow a highly non-linear function. Additionally, the face can cast shadows on itself. Figure 1.1 illustrates the impact of different illumination on face images.

illustration not visible in this excerpt

Figure 1.2: An example face under fixed illumination and varying pose

Figure 1.3: An example face under fixed illumination and pose but varying facial expression

Many approaches have been proposed to deal with this problem. Some face detection or recognition methods try to be invariant to illumination changes by implicitly modeling them or extracting invariant features. Others propose a separate processing step, a kind of normalization, in order to reduce the effect of illumination changes. In section 4.3 some of these illumination normalization methods will be outlined.

1.3.2 Pose

The variation of head pose or, in other words, the viewing angle from which the image of the face was taken is another difficulty and essentially impacts the performance of automatic face analysis methods. For this reason, many applications limit themselves to more or less frontal face images or otherwise perform a pose-specific processing that requires a preceding estimation of the pose, like in multi-view face recognition approaches. Section 4.4 outlines some 2D pose estimation approaches that have been presented in the literature.

If the rotation of the head coincides with the image plane the pose can be normalized by estimating the rotation angle and turning the image such that the face is in an upright position. This type of normalization is part of a procedure called face alignment or face registration and is described in more detail in section 4.5.

illustration not visible in this excerpt

Figure 1.2 shows some example face images with varying head pose.

1.3.3 Facial Expressions

The appearance of a face with different facial expressions varies considerably (see Fig. 1.3). Depending on the application, this can be of more or less importance. For example, for access control systems the subjects are often required to show a neutral expression. Thus, invariance to facial expression might not be an issue in this case. On the contrary, in an image or video indexation system, for example, this would be more important as the persons are shown in every-day situations and might speak, smile, laugh etc.

In general, the mouth is subject to the largest variation. The respective person on an image can have an open or closed mouth, can be speaking, smiling, laughing or even making grimaces.

Eyes and eyebrows are also changing subject to varying facial expressions, e.g. when the respective person blinks, sleeps or widely opens his/her eyes.

1.3.4 Partial Occlusions

Partial occlusions occur quite frequently in real-world face images. They can be caused by a hand occluding a part of the face, e.g. the mouth, by long hear, glasses, sun glasses or other objects or persons.

In most of the cases, however, the face occludes parts of itself. For example, in a view from the side the other side of the face is hidden. Also, a part of the cheek can be occluded by the nose or an eye can be covered by its orbit for example.

1.3.5 Other types of variations

Appearance variations a also caused by varying make-up, varying hair-cut and the presence of facial hear (beard, mustache etc.).

Varying age is also an important factor influencing the performance of many face analysis methods. This is the case for example in face recognition when the reference face image has been taken some years before the image to recognize.

Finally, there are also variations across the subjects’ identities, such as race, skin color or, more generally, ethnic origin. The respective differences in the appearance of the face images can cause difficulties in applications like face or facial feature detection or gender recognition.

1.4 Objectives

The goals pursued in this work principally concern the evaluation of Convolutional Neural Networks (CNN) in the context of facial analysis applications. More specifically, we will focus on the following objectives:

- evaluate the performance of CNNs w.r.t. appearance-based facial analysis
- investigate the robustness of CNNs against classical sources of noise in the context of facial analysis
- propose different CNN architectures designed for specific facial analysis problems such as face alignment, facial feature detection, face recognition and gender classification
- improve upon the state-of-the-art in appearance-based facial feature detec- tion, face alignment as well as face recognition under real-world conditions
- investigate different solutions improving the performance of automatic face recognition systems

1.5 Outline

In the following chapter we will outline some of the most important machine learning techniques used for object detection and recognition in images, such as statistical projection methods, Hidden Markov Models, Support Vector Machines and Neural Networks.

In chapter 3, we will then focus on one particular approach, called Convolutional Neural Networks (CNN), which is the foundation for the methods proposed in this work.

Having described, among other aspects, the principle architecture and training methods for CNNs, in chapter 4 we will outline the problem of face detection and normalization and how CNNs can tackle these types of problems. Using an existing CNN-based face detection system, called Convolutional Face Finder (CFF), we will further present an effective approach for face alignment which is an important step in many facial analysis applications.

In chapter 5, we will describe the problem of facial feature detection which shows to be crucial for any facial image processing task. We will propose an approach based on a specific type of CNN to solve this problem and experimentally show its performance in terms of precision and robustness to noise.

Chapter 6 outlines two further facial analysis problems, namely automatic face recognition and gender recognition. We will also present CNN-based ap- proaches to these problems and experimentally show their effectiveness com- pared to other machine learning techniques proposed in the literature.

Finally, chapter 7 will conclude this work with a short summary and some perspectives for future research.

Chapter 2 Machine Learning Techniques for Object Detection and Recognition

2.1 Introduction

In this chapter we will outline some of the most common machine learning approaches to object detection and recognition. Machine Learning techniques automatically learn from a set of examples how to classify new instances of the same type of data. The capacity to generalize, i.e. the ability to success- fully classify unknown data and possibly infer generic rules or functions, is an important property of these approaches and is sought to be maximized.

Usually, one distinguishes between three types of learning:

Supervised learning A training set and the corresponding desired outputs of the function to learn are available. Thus, during training the algorithm iteratively presents examples to the system and adapts its parameters according to the distance between the produced and the desired outputs.

Unsupervised learning The underlying structure of the training data, i.e. the desired output, is unknown and is to be determined by the training algorithm. For example, for a classification method this means that the class information is not available and has to be approximated by grouping the training examples using some distance measure, a technique called clustering.

Reinforcement learning Here, the exact output of the function to learn is unknown, and training consists in a parameter adjustment based on only two concepts, reward and penalty. That is, if the system does not perform well (enough) it is “penalized” and the parameters are adapted accord- ingly. Otherwise, it is “rewarded”, i.e. some positive reinforcement takes place.

Most of the algorithms described in the following are supervised, but they are employed for rather different purposes: some of them are used to extract features from the input data, some are used to classify the extracted features, and others perform both tasks.

The application context varies also largely, i.e. some of the approaches can be used for detection of features and/or objects, some only for recognition and others for both. Further, in many systems a combination of several of the techniques described in this chapter is used. Thus, in a sense, they could be considered as some kind of building blocks for effective object detection and recognition systems.

Let us begin with some of the most universal techniques used in machine learning which are based on a statistical analysis of the data allowing to significantly reduce its dimensionality and extract valuable information.

2.2 Statistical Projection Methods

In order to be able to automatically analyze images, they are often resized to have a certain width w and height h. Then, the respective image rows or columns of each image are concatenated to build a vector of dimension n = w × h. The resulting vector space is called image space, denoted I in the following.

In signal processing tasks there is often a lot of redundancy in the respective images/vectors because, firstly, images of the same class of objects are likely to be similar and, secondly, neighboring pixels in an image are highly correlated.

Thus, it seems obvious to represent the images in a more compact form, i.e. to project the vectors into a subspace S of I by means of a statistical projection method. In the literature, the terms dimensionality reduction or feature selection are often employed in the context of these techniques. These methods aim at computing S which, in general, is of lower dimension than I, such that the transformed image vectors are statistically less correlated. There are two main groups of projections: linear and non-linear projections.

Linear projection techniques transform an image vector x = (x1, . . . , xn)T , of dimension n into a vector s = (s1, . . . ,sk)T of dimension k, by a linear k × n transformation matrix W :

illustration not visible in this excerpt

In general, one eliminates those basis vectors that are supposed to contain the least important information for a given application using a predefined criteria. Thus, the dimension k of the resulting subspace S can be chosen after calculating the basis vectors spanning the entire subspace.

The most common and fundamental projection methods are the Principal Component Analysis (PCA) and the Linear Discriminant Analysis (LDA) which will be described in the following sections.

Non-linear approaches are applied when a linear projection does not suffice to represent the data in a way that allows the extraction of discriminant features. This is the case for more complex distributions where mere hyperplanes fail to separate the classes to distinguish. As most of these approaches are iterative, they require an a priori choice of the dimension k of the resulting subspace S.

2.2.1 Principal Component Analysis

Principal Component Analysis (PCA), also known as the discrete Karhunen- Loève Transform (KLT) or Hotelling Transform as it is due to Hotelling[101], is a linear orthogonal projection into the subspace where the first dimension (or axis) corresponds to the direction of I having the greatest variance, the second dimension to the direction with the second greatest variance and so on.

Thus, the resulting orthogonal subspace S, called principal subspace, de- scribes best the distribution of the input space I. It finds the directions of greatest variance, which are supposed to reflect the most “important” aspects of the data.

Given a certain number N of input vectors [Abbildung in dieser Leseprobe nicht enthalten]that are assumed to have a multi-normal distribution and to be centered, i.e. 1 [Abbildung in dieser Leseprobe nicht enthalten]

illustration not visible in this excerpt

where si ∈ Rk . Now let Σ be the covariance matrix of the input vectors

illustration not visible in this excerpt

Hence, the covariance matrix of the projected vectors si is defined as

illustration not visible in this excerpt

Finally, the projection matrix W is supposed to maximize the variance of the projected vectors. Thus,

illustration not visible in this excerpt

The k columns of W , i.e. the basis vectors of S, are called the principal components and represent the eigenvectors corresponding to the largest eigenvalues of the covariance matrix Σ.

An important characteristic of PCA is that if k < n the reconstruction error e in terms of the Euclidean distance is minimal,

illustration not visible in this excerpt

Thus, the first k eigenvectors form a subspace that optimally encodes or represents the input space I. This fact is exploited for example in compression algorithms and template matching techniques.

The choice of k depends largely upon the actual application. Additionally, for some applications it might not even be optimal to select the eigenvectors corresponding to the largest eigenvalues.

Kirby et al .[122] introduced a classical selection criteria which they call energy dimension. Let λj be the eigenvalue associated with the jth eigenvector. Then, the energy dimension of the ith eigenvector is:

illustration not visible in this excerpt

One can show that the Mean Squared Error (MSE) produced by the last n−i rejected eigenvectors [Abbildung in dieser Leseprobe nicht enthalten] . The selection of k now consists in determining a threshold τ such that Ek−1 > τ and Ek < τ .

Apart from image compression and template matching, PCA is often applied to classification tasks, e.g. the Eigenfaces approach[243] in face recognition. Here, the projected vectors si are the signatures to be classified. To this end, the signatures of the N input images are each associated with a class label and used to build a classifier. The most simple classifier would be a nearest neighbor classifier using an Euclidean distance measure.

To sum up, PCA calculates the linear orthogonal subspace having its axes oriented with the directions of greatest variances. It thus optimally represents the input data. However, in a classification context it is not guaranteed that in the subspace calculated by PCA the separability of the data is improved. In this regard, the Linear Discriminant Analysis (LDA) described in the following section is more suitable.

2.2.2 Linear Discriminant Analysis

The Linear Discriminant Analysis (LDA) has been introduced by Fisher[69] in 1936 but generalized later on to the so-called Fisher’s Linear Discriminant (FLD). It is, in contrast to the PCA, not only concerned with the best representation of the data but also with its separability in the projected subspace with regard to the different classes.

Let Ω = {x1, . . .xN} be the training set partitioned into c annotated classes denoted Ωi (i ∈ 1..c). We are now searching the subspace S that maximizes the inter-class variability while minimizing the intra-class variability, thus improving the separability of the respective classes. To this end, one maximizes the socalled Fisher’s criterion [69, 16]:

illustration not visible in this excerpt

the between-class variance. Nj is the number of examples in Ωj (i.e. of class j)∑ and xj are the respective means, i.e. [Abbildung in dieser Leseprobe nicht enthalten] is the overall mean of the data which is assumed to be centered, i.e. x = 0.

The projection matrix W is obtained by calculating the eigenvectors associated with the largest eigenvalues of the matrix Σ−[1] w Σb. These eigenvectors form the columns of W .

A problem occurs when the number of examples N is smaller than the size of the input vectors, i.e. for images the number of pixels n. Then, Σw is singular since its rank is at most N −c. The calculation of Σ−[1] w isthusimpossible.Several approaches have been proposed to overcome this problem. One is to produce additional examples by adding noise to the images of the training database. Another approach consists in first applying PCA to reduce the input vector space to the dimension N − c and then perform LDA as described above.

2.2.3 Other Projection Methods

There are many other projection techniques proposed in the literature and which can possibly be applied to object detection and recognition. For example, Independent Component Analysis (ICA) [17, 2, 35, 109, 108] is a technique often used for blind source separation[120], i.e. to find the different independent sources a given signal is composed of. ICA seeks a linear sub-space where the data is not only uncorrelated but statistically independent. In its most simple form, the model is the following:

illustration not visible in this excerpt

where x is the observed data, s are the independent sources and A is the so- called mixing matrix. ICA consists in optimizing an objective function, denoted contrast function, that can be based on different criteria. The contrast function has to ensure that the projected data is independent and non-Gaussian. Note that ICA does not reduce the dimensionality of the input data. Hence, it is often employed in combination with PCA or any other dimensionality reduction technique. Numerous implementations of ICA exist, e.g. INFOMAX[17], JADE [35] or FastICA[109].

Yang et al .[263] introduced the so-called two-dimensional PCA, which does not require the input image to be transformed into a one-dimensional vector beforehand. Instead, a generalized covariance matrix is directly estimated using the image matrices. Then, the eigenvectors are determined in a similar manner than for 1D-PCA by minimizing a special criterion based on this covariance matrix. Finally, in order to perform classification a distance measure between matrix signatures has to be defined. It has been shown that this method outper- forms one-dimensional PCA in terms of classification rate[263] and robustness [254].

Visani et al .[252] presented a similar approach based on LDA: the two- dimensional oriented LDA. The procedure is analogical to the 2D-PCA method where the projection is directly performed on the image matrices, either column- wise or row-wise. A generalized Fisher’s criterion is defined and minimized in order to obtain the projection matrix. Further, the authors showed that in contrast to LDA, the two-dimensional oriented LDA can implicitly circumvent the singularity problem. In a later work[253], they generalized this approach to the Bilinear Discriminant Analysis (BDA) where column-wise and row-wise 2D-LDA is iteratively applied to estimate the pair of projection matrices min- imizing an expression similar to the Fisher’s criterion which combines the two projections.

Note that the projection methods presented so far are all linear projection techniques. However, in some cases the different classes cannot be correctly separated in a linear sub-space. Then, non-linear projection methods can help to improve the classification rate. Most of the linear projection methods can be made non-linear by projecting the input data into a higher-dimensional space where the classes are more likely to be linearly separable. That means, the separating hyperplane in this sub-space represents a non-linear sub-space of the input vector space. Fortunately, it is not necessary to explicitly describe this higher-dimensional space and the respective projection function if we find a so- called kernel function that implements a simple dot-product in this vector space and satisfies the Mercer’s condition (see Theorem 1 on p. 22). For a more formal explanation see section 2.6.3 on non-linear SVMs. The kernel function allows to perform a dot-product in the target vector space and can be used to construct non-linear versions of the previously described projection techniques e.g. PCA [219, 264], LDA[161] or ICA[6].

The projection approaches that have been outlined in this section can in principal be applied to any type of data in order to perform a statistical anal- ysis on the respective examples. A technique called Active Appearance Model (AAM)[41] can also be classified as a statistical projection approach but it is much more specialized to model images of deformable objects under varying external conditions. Thus, in contrast to methods like PCA or LDA, where the input image is treated as a “static” vector, small local deformations are taken into account. AAMs have been especially applied to face analysis, and we will therefore describe this technique in more detail in the following section.

2.3 Active Appearance Models

Active Appearance Models (AAM), introduced by Cootes et al .[41] as an exten- sion to Active Shape Models (ASM)[43], represent an approach that statistically describes not only the texture of an object but also its shape. Given a new im- age of the class of objects to analyze, the idea is here to interpret the object by synthesizing an image of the respective object while approximating as good as possible its appearance in the real image. It has mainly been applied to face analysis problems [60, 41]. Therefore, face images we will used in the following to illustrated this technique. Modeling the shape of faces appears to be helpful in most face analysis applications where the face images are subject to changes in pose and facial expressions.

2.3.1 Modeling shape and appearance

The basis of the algorithm is a set of training images with a certain number of an- notated feature points, so-called landmark points, i.e. two-dimensional vectors. Each set of landmarks is represented as a single vector x, and PCA is applied to the whole set of vectors. Thus any shape example can be approximated by the equation:

illustration not visible in this excerpt

where x is the mean shape, and Ps is the linear subspace representing the possible variations of shape parameterized by the vector bs. Then, the annotated control points of each training example are matched to the mean shape while warping the pixel intensities using a triangulation algorithm. This leads to a so-called shape-free face patch for each example.

illustration not visible in this excerpt

Figure 2.1: Active Appearance Models: annotated training example and corresponding shape-free patch

Figure 2.1 illustrates this with an example face image. Subsequently, a PCA is performed on the gray values g of the shape-free images forming a statistical model of texture:

illustration not visible in this excerpt

where g represents the mean texture, and the matrix Pg linearly describes the texture variations parameterized by the vector bg .

Since shape and texture are correlated, another PCA is applied on the concatenated vectors of bs and bg leading to the combined model:

illustration not visible in this excerpt

where c is a parameter controlling the overall appearance, i.e. both shape and texture, and Qs and Qg represent the combined linear shape-texture subspace. Given a parameter vector c, the respective face can be synthesized by first building the shape-free image, i.e. the texture, using equation 2.16 and then warping the face image by applying equation 2.15 and the triangulation algo-rithm used to build the shape-free patches.

2.3.2 Matching the model

Having built the statistical shape and texture models, the objective is to match the model to an image by synthesizing the approximate appearance of the object in the real image. Thus, we want to minimize:

illustration not visible in this excerpt

where Ii is the vector of gray-values of the real image and Im is the one of the synthesized image.

The approach assumes that the object is roughly localized in the input image, i.e. during the matching process, the model with its landmark points must not be too far away from the resulting locations.

Now, the decisive question is how to change the model parameters c in order to minimize Δ. A good approximation appears to be a linear model:

illustration not visible in this excerpt


where A is determined by a multi-variate linear regression on the training data augmented by examples with manually added perturbations. To calculate Ii − Im, the respective real and synthesized images are trans-formed to be shape-free using a preliminary estimate of the shape model. Thus, we compute:

illustration not visible in this excerpt

This linear approximation shows to perform well over a limited range of the model parameters, i.e. about 0.5 standard deviations of the training data. Finally, this estimation is put into an iterative framework, i.e. at each iter-ation we calculate:

illustration not visible in this excerpt

until convergence, where the matrix A is scaled such that it minimizes |δg|.

The final result can then be used, for example, to localize specific feature points, to estimate the 3D orientation of the object, to generate a compressed representation of the image or, in the context of face analysis, to identify the respective person, gender or facial expression.

Clearly, AAMs can cope with small local image transformations and ele- gantly model shape and texture of an object based on a preceding statistical analysis of the training examples. However, the resulting projection space can be rather large, and the search in this space, i.e. the matching process, can be slow. A fundamentally different approach to take into account local transfor- mations of a signal are Hidden Markov Models (HMM). This is a probabilistic method that represents a signal, e.g. an image, as a sequence of observations. The following section outlines this approach.

2.4 Hidden Markov Models

2.4.1 Introduction

Hidden Markov Models (HMM), introduced by Rabiner et al . [190, 191], are commonly used to model the sequential aspect of data. In the signal process- ing context for example, they have been frequently applied to speech recog- nition problems modeling the temporal sequence of states and observations, e.g. phonemes. An image can also be seen as a sequence of observations, e.g. image subregions, and here the image either has to be linearized into a one- dimensional structure or special types of HMMs have to be used, for example two-dimensional Pseudo HMMs or Markov Random Fields.

Being the most common approaches in image analysis, we will focus on 1D and Pseudo 2D HMMs in the following. The major disadvantage of “real” 2D HMMs is their relatively high complexity in terms of computation time.

A HMM is characterized by a finite number of states, and it can be in only one state at a time (as a finite state machine). The initial state probabilities define, for every state, the probability of the HMM being in that state at time t = 1. For each following time step t = 2..T it can either change the state or stay in the same state with a certain probability defined by the so-called transition probabilities. Further, in any state it creates an output from a pre-defined

illustration not visible in this excerpt

Figure 2.2: A left-right Hidden Markov Model

vocabulary with a certain probability, determined by the output probabilities. At time t = T the HMM will have produced a certain sequence of outputs, called observations; O = {o1, . . .,oT }. The sequence of states Q = {q1, . . . ,qT } it has traversed, however, is unknown (hidden) and has to be estimated by analyzing the observation whereas a single observation could be produced by different state sequences.

Fig. 2.2 illustrates a simple example of a HMM with 3 states. This type of HMM is called left-right model or Bakis model.

More formally we can describe a HMM as follows:

Definition 1 A Hidden Markov Model is defined as λ = {S, V, A, B, Π}, where

- S = {s1,...,sN} is the set of N possible states,
- V = {v1,...,vL} is the set of L possible outputs constituting the vocabu- lary,
- A = {aij}i,j=1..N is the set transition probabilities from state i to state j,
- B = {bi(l)}i=1..N,l=1..L define the output probabilities of output l in state i,
- Π = {π1,...,πN} is the set of initial state probabilities.

Note that

illustration not visible in this excerpt

Given a HMM λ, the goal is to determine the probability of a new observation sequence O = {o1, . . . ,oT }, i.e. P[O|λ]. For this purpose, there are several algorithms, the most simple one being explained in the following section.

2.4.2 Finding the most likely state sequence

There are many algorithms for estimating P [O|λ] and the most likely state sequence Q∗ = {q∗[1], ...,q∗T} having generated O. The most well known of these are called Viterbi algorithm and Baum-Welsh algorithm. Algorithm 1 describes the former which is a kind of simplification of the latter. Note that δti denotes

illustration not visible in this excerpt


Algorithm 1 The Viterbi algorithm for i = 1 to N do

the probability of being in state si at time t, and φti denotes the most probable preceding state being in si at time t. Thus, the φti store the most probable state sequence. The last loop allows to retrieve the final most likely state sequence Q∗ by recursively traversing φti.

When applying a HMM to a given observation sequence O it suffices for most applications to calculate P [O|λ] as stated above. The actual state sequence Q∗ however is necessary for the training process explained in the following section.

2.4.3 Training

In order to automatically determine and re-adjust the parameters of λ a set of training observations Otr = {ot1, . . . , otM } is used, and a training algorithm, for example algorithm 1, is applied to estimate the probabilities: P [Otr |λ] and P [Otr , qt = si|λ] for every state si at every time step t.

Then each parameter can be re-estimated by re-generating the observation sequences Otr and “counting” the number of events determining the respective parameter. For example, to adjust aij one calculates:

illustration not visible in this excerpt

The output probabilities B and the initial state probabilities Π are estimated in an analogical way. However, the number and topology of states S has to be determined experimentally in most cases.

2.4.4 HMMs for Image Analysis

HMMs are one-dimensional models and have initially been applied to the processing of audio data[190]. However, there are several approaches to adapt this technique to 2D data like images.

illustration not visible in this excerpt

Figure 2.3: Two simple approaches to image analysis with 1D HMMs

illustration not visible in this excerpt

Figure 2.4: Illustration of a 2D Pseudo-HMM

One of them[214] is to consider an image as a sequence of horizontal bands, possibly overlapping and spreading from top to bottom. Fig. 2.3(a) illustrates this. The HMM consequently has a left-right topology. Visual features of the image bands, e.g. pixel intensities, then correspond to the outputs of the HMM.

A similar approach is to partition the image into a set of blocks of predefined size. A one-dimensional sequence is then formed by concatenating the lines (or columns) of blocks. Fig. 2.3(b) illustrates this procedure. Additional constraints can be added in order to ensure that certain states correspond to the end of lines in the image.

Finally, an approach called 2D Pseudo-HMM uses a hierarchical concept of super-states, illustrated in Fig. 2.4. The super-states form a vertical 1D sequence corresponding to the lines (or bands) of the image. Each super-state in turn contains a 1D HMM modeling the sequence of horizontal observations (pixels or blocks) in a line. Thus, determining the hidden state sequence Q of an observation O implies a two-level procedure, i.e. first, to calculate the most likely sequence of super-states using the lines or bands of the image and, secondly, to determine the most likely sequence of sub-states corresponding to each line independently.

Obviously, HMMs are very suitable for modeling sequential data, and thus they are principally used in signal processing tasks. Let us now consider some more general machine learning techniques which do not explicitely model this sequential aspect but, on the other hand, can more easily and efficiently be applied to higher dimensional data such as images. Adaptive Boosting is one such approach and will be explained in the following section.

2.5 Adaboost

2.5.1 Introduction

Adaptive Boosting, short Adaboost, is a classification technique introduced by Freund and Schapire[70]. The basic idea here is to combine several “weak” classifiers into a single “strong” classifier, where the weak classifiers perform only slightly better than just random guessing.

The principle of the algorithm is to learn a global binary decision function by iteratively adding and training weak classifiers, e.g. wavelets networks or Neural Networks, while focusing on more and more difficult examples. It has been applied to many classification problems and has become a widely used machine learning technique due to its simplicity and performance in terms of classification rate and computation time.

2.5.2 Training

Let {(x1,y1), . . . ,(xm,ym)} be the training set where the xi ∈ X are the training examples and yi ∈ Y the respective class labels. We will focus here on the basic Adaboost algorithm where Y = {−1,+1} but extensions to multi-class classification have been proposed in the literature [71, 216].

The procedure is as follows: at each iteration t = 1..T a weak classifier ht : X → {−1,+1} is trained using the training examples weighted by a set of weights Dt(i), i = 1..m. Then, the weights corresponding to misclassified ex- amples are increased and weights corresponding to correctly classified examples are decreased. Thus, the algorithm focuses more and more on harder examples. The final decision H(x) calculated by the strong classifier is then a weighted sum of the weak decisions ht(x) where the weights αt are chosen to be inversely proportional to the error ǫt of the classifier ht, i.e. if the error is large the respec- tive classifier will have less influence on the final decision. Algorithm 2 describes the basic Adaboost algorithm. The variable Zt is a normalization constant in order to make Dt+1 a distribution.

illustration not visible in this excerpt

Thus, if γt > 0, i.e. each hypothesis is only slightly better than random, the training error drops exponentially fast.

Schapire et al .[215] also conducted theoretical studies in terms of the gen- eralization error. To this end, they define the margin of the training examples

Algorithm 2 The Adaboost algorithm 1: D1(i) = 1/m ∀i = 1..m

illustration not visible in this excerpt


rate, or structural risk R(α), is upper bounded by the training error rate, or empirical risk Remp and an additional term called VC-confidence which depends on the so-called Vapnik-Chervonenkis (VC)-dimension h of the classification function. More precisely, with the probability 1 − η, the following holds[246]:

illustration not visible in this excerpt

where α are the parameters of the function to learn and l is the number of training examples. The VC-dimension h of a class of functions describes its “capacity” to classify a set of training data points. For example, in the case of a two-class classification problem, if a function f has a VC-dimension of h there exists at least one set of h data points that can be correctly classified by f, i.e. assigned the label −1 or +1 to it. If the VC-dimension is too high the learning machine will overfit and show poor generalization. If it is too low, the function will not sufficiently approximate the distribution of the data and the empirical error will be too high. Thus, the goal of SRM is to find a h that minimizes the structural risk R(α), which is supposed to lead to maximum generalization.

2.6.2 Linear Support Vector Machines

Vapnik[246] showed that for linear hyperplane decision functions:

illustration not visible in this excerpt

the VC-dimension is determined by the norm of the weight vector w.

Let {(xi,yi), . . . ,(xl,yl)} (xi ∈ Rn, yi ∈ {−1,+1}) be the training set. Then, for a linearly separable training set we have:

illustration not visible in this excerpt

The margin between the positive and negative points is defined by two hyper- planes x · w + b = ±1 where the above term actually is zero. Fig. 2.5 illustrates this. Further, no points lie between these hyperplanes and the width of the mar- gin is 2/||w||. The support vector algorithm now tries to maximize the margin by minimizing ||w||, which is supposed to be an optimal solution, i.e. where generalization is maximal. Once the maximum margin is obtained, data points lying on one of the separating hyperplanes, i.e. for which equation 2.31 yields zero, are called support vectors (illustrated by double circles in Fig. 2.5).

To simplify the calculation, the problem is formulated in a Lagrangian frame- work (see[246] for details). This leads to the maximization of the Lagrangians:

illustration not visible in this excerpt

Figure 2.5: Graphical illustration of a linear SVM

illustration not visible in this excerpt

where αi (i = 1..l) are the Lagrangian multipliers that are to be determined. Further, the solutions to αi and condition 2.31 imply a value for b. Note that all αi are zero except those corresponding to the support vectors.

Finally, new examples can simply be classified using the decision function 2.30.

In many cases, however, the training data cannot be completely separated because of some “outliers”. Then, we might simply loosen the constraint 2.31 by introducing the constants ξi > 0 in the following way:

illustration not visible in this excerpt

and condition 2.35 becomes

illustration not visible in this excerpt

2.6.3 Non-linear Support Vector Machines

In order to use a non-linear decision function, the above formulas can quite easily be generalized. Boser et al.[23] proposed a simple method based on the so-called kernel trick. That is, before applying the dot product xi·xj in equation

2.32 the d-dimensional data is projected into a higher dimensional space where it is supposed to be linearly separable. Thus, a function Φ : Rd → H is defined and xi · xj becomes Φ(xi ) · Φ(xj ). Now, instead of calculating Φ each time we use a kernel function K(xi, xj) = Φ(xi ) · Φ(xj ), i.e. each occurrence of the dot product is replaced by K(·, ·). Thus, if we want to classify a new data point s the decision function

illustration not visible in this excerpt


illustration not visible in this excerpt

With the kernel function K we don’t need to calculate Φ or H but we must know if for a given K there exists a mapping Φ and some space H in which K is the dot product K(xi, xj) = Φ(xi) · Φ(xj ). This property is ensured by the Mercer’s condition[246]:

Theorem 1 There exists a mapping Φ and an expansion ∑

illustration not visible in this excerpt

2.6.4 Extension to multiple classes

Up to this point, we only considered two-class problems. However, there are simple ways to extend the SVM method to several classes. One approach, called one-against-all, consists in training one classifier for each class that distinguishes between the examples of that class and the examples of all other classes. Thus, the number of SVMs equals the number of classes n.

Another approach trains a SVM for each possible pair of classes. To classify an example, it is input to each SVM and the class label corresponding to the maximal number of “winning” SVMs represents the final answer. The number of classifiers needed by this approach is n(n−1)/2, which is a drawback in terms of complexity compared to the first approach.

2.7 Bag of Local Signatures

As opposed to SVMs, being a very general classification technique, an approach called Bag-of-Local-Signatures (BOLS) has recently been introduced by Csurka et al .[51] for image classification problems, particularly object detection and

illustration not visible in this excerpt

Figure 2.6: The histogram creation procedure with the Bag-of-local-signature approach: a) input image I, b) detected salient points, c) extracted local signatures, d) quantized vectors (dictionary entries), e) histogram h(I). recognition. It was motivated by the bag-of-words approach for text categorization which simply counts the number of pre-defined key words in a document in order to classify it into one of several categories.

In the first step of the BOLS method, n salient points pi = (xi , yi) of the input image are detected using an interest point detection algorithm, e.g. the Harris affine detector[162]. The small image region around each detected point is then represented by some local descriptors, such as the Scale-Invariant Feature Transform (SIFT) descriptors[148], leading to a local signature si for each salient point.

In the following step, the extracted signatures are classified applying any kind of vector quantization method. To this end, a dictionary of k representative signatures dj (j = 1..k) is calculated from the training set using a clustering algorithm. For example, Csurka et al .[51] used the k-means clustering algorithm and Ros et al .[199] used a Self-Organizing Map (SOM).

Thus, for an image I to classify a bag of local signatures vi, i.e. entries of the dictionary, is obtained representing the appearance of the object in the image. However, for two different images of the same object the respective representations might differ due to the varying appearance in different views or partial occlusions making an efficient comparison difficult.

Therefore, discrete histograms h(I) of the bag of local signatures vi are calcu- lated by simply counting the number of occurrences of the respective signatures. Finally, the histograms can be classified by using classical histogram distance measures, such as χ[2] or the Earth Mover’s Distance (EMD) or by training a classifier on the vectors obtained from the histogram values, such as a Bayes classifier or SVMs[51].

Figure 2.6 illustrates the overall procedure for generating the Bag-of-Local- Signatures representation. A major advantage of this approach compared to statistical projection methods, for example, is its robustness to partial occlusions and to changing pose of the object to recognize. This is due to the purely local representation and the rotation- and scale-invariant description of the local image patches.

As this technique is a relatively new approach in the field of machine learning and very specific to image classification we won’t describe it here in more detail. We will rather concentrate on a very versatile and powerful machine learning technique constituting the basis for all of the face analysis approaches proposed in this work, namely Artificial Neural Networks.

illustration not visible in this excerpt

Figure 2.7: The Perceptron

2.8 Neural Networks

2.8.1 Introduction

Artificial Neural Networks (ANN), short Neural Networks (NN), denote a ma- chine learning technique that has been inspired by the human brain and its capacity to perform complex tasks by means of inter-connected neurons per- forming each a very simple operation. Likewise, a NN is a trainable structure consisting of a set of inter-connected units, each implementing a very simple function, and together eventually performing a complex classification function or approximation task.

2.8.2 Perceptron

The most well known type of neural unit is called Perceptron and has been introduced by Rosenblatt[200]. Its basic structure is illustrated in Fig. 2.7. It has n inputs and one output where the output is a simple function of the sum of the input signals x weighted by w and an additional bias b. Thus,

illustration not visible in this excerpt

Often, the bias is put inside the weight vector w such that w0 = b and the input vector x is extended correspondingly to have x0 = 1. Equation 2.46 then becomes:

illustration not visible in this excerpt

where φ is the Heavyside step function: φ:R→R

illustration not visible in this excerpt

The Perceptron thus implements a very simple two-class classifier where w is the separating hyperplane such that w · x ≥ 0 for examples from one class and w · x < 0 for examples from the other.

In 1962, Rosenblatt introduced the perceptron convergence theorem[201], a supervised training algorithm capable of learning arbitrary two-class classi- fication problems. However, Minsky and Papert[163] pointed out that there are very simple classification problems where the perceptron fails, namely when the two classes are not linearly separable like in the XOR-problem, where the

illustration not visible in this excerpt

Figure 2.8: A Multi-Layer Perceptron

pattern (0, 0) and (1, 1) belong to one class and (0, 1) and (1, 0) to the other. This motivated the use of several interconnected perceptrons which are able to form more complex decision boundaries by combining several hyperplanes. The most common type of these NNs is the Multi-Layer Perceptron described in the following section.

2.8.3 Multi-Layer Perceptron

Multi-Layer Perceptrons (MLP) are capable of approximating arbitrarily complex decision functions. With the advent of a practicable training algorithm in the 1980’s, the so-called Backpropagation algorithm[208], they became the most widely used form of NNs.

Fig. 2.8 illustrates the structure of a MLP. There is an input layer, one or more hidden layer(s) and an output layer of neurons, where each neuron except the input neurons implements a perceptron as described in the previous section. Moreover, the neurons of one layer are only connected to the following layer. We call this type of network: feed-forward network, i.e. the activation of the neurons is propagated layer-wise from the input to the output layer. And if there is a connection from each neuron to every neuron in the following layer, as in Fig. 2.8, the network is called fully connected. Further, the neurons’ activation function has to be differentiable in order to adjust the weights by the Backpropagation algorithm. Commonly used activation functions are for example:

illustration not visible in this excerpt

for any θ > 0, where Pr[·] denotes the empirical probability on the training set and d the VC-dimension of the weak classifiers.

Adaboost is a very powerful machine learning technique as it can turn any weak classifier into a strong one by linearly combining several instances of it. A completely different classification approach called Support Vector Machine (SVM) is based on the principal of Structural Risk Minimization which not only tries to minimize the classification error on the training examples but also takes into account the ability of the classifier to generalize to new data. The following section explains this approach in more detail.

2.6 Support Vector Machines

2.6.1 Structural Risk Minimization

The classification technique called Support Vector Machine (SVM) [23, 246, 44] is based on the principle of Structural Risk Minimization (SRM) formulated by Vapnik et al .[245]. One of the basic ideas of this theory is that the test error

illustration not visible in this excerpt



ISBN (eBook)
ISBN (Buch)
6.8 MB
Institution / Hochschule
Albert-Ludwigs-Universität Freiburg – Lehrstuhl für Mustererkennung und Bildverarbeitung
Convolutional Neural Networks Face Detection Face Recognition Gender Recognition Face Alignment Neural Networks Machine Learning Object Recognition Computer Vision Deep Learning




Titel: Face Image Analysis with Convolutional Neural Networks