Wearable Face Detection

Computer Vision Final Project
Bradley A. Singletary


I.    Motivation

Face detection is a tough problem. It requires that a computer be able to decide whether a certain mildly deformable albedo and mildly deformable 3D surface structure are present in a 2D image.   Lighting, noisy sensor response, bad dynamic range,  target motion, and target occlusion are the primary sources of detection failure. Many solutions to this problem have emerged in the past 10 years of computer vision literature. The most recent detectors are starting to look functional enough to be deployed in building security applications and indoor smart environments with useful results.[1,2,3,4]

These detectors are typically developed to function in environments where they are trained and installed/used. The minute something challenging comes along (light switches in the off position, occlusion during an important event,  you want to recognize faces during any part of your daily routine no matter where you are) these environmental smart-services will likely break.  Furthermore, putting expert hidden observers/cameras into the environment can violate the network of trust that already exists between humans during personal interactions.

I view giving someone the ability to recognize faces on demand over a wide range of lighting conditions a very powerful thing.  Someone using this technology would suddenly be able to never forget a face and in some cases to be able to remember one for the first time (c.f. face blindness).  This problem also attacks the poorly developed area of computer vision where both target and camera are in motion.  Moving-camera-moving-target problems occurs repeatedly for intelligent mobile and wearable computational devices which ask: Where am I? What am I doing? Who am I interacting with? What is relevant? What did I do before? with the goal of better serving a user.

So, for my project, I've worked towards defining the wearable face recognition/detetion problem better, and applied some computer vision and pattern recognition techniques to try and make anytime face recognition possible. I have two minor results from the class project I will detail: 1) The front of the face has a characteristic frequency response under perspective projection 2) Face detection can be avoided by engineering, detecting, and understanding human behavior.

To do these experiments I needed train and test images of people's face that were captured from a head-mounted camera. I used the capture vest (shown in the images section below) to grab images of peoples face during the course of a conference. The conference provided a dense base of humans for extended interaction and some very challenging lighting conditions. Don't miss the avi's below. They help understand what meeting someone is like, and how a day of heavy interaction might be summarized.
 

II.    Approach

I originally set out to make a broad comparison between various existing face databases. I wrote some 3d histogram comparison utilities for comparing color content of the databases by various metrics.   While working with the data closely, I came across an idea that could make face detection either highly constrained or maybe even moot. Why not learn when a wearer is trying to remember someone's identity or when they are about to query someone for their identity? Assuming the user has some sort of behavior that may be recovered simply by head/body motion and 2D scene content, this may be possible.  Wearables offer a unique perspective; their wearer not only limits the scope of the recognition task,  they help with the detection process directly (as long as it's not too much bother).

If you watch the AVI below, you will see the wearer approach an individual standing in the middle of an open space.  The scene background is cluttered with many different sized and shaped objects.  As the wearer approaches, intent upon re-introduction to the subject, the face expands to fill a sizeable portion of the screen. Furthermore, the background is now out of focus with respect to the face.  This apparent difference in focus denotes a difference in frequencies spanned by the target face and background frequencies.  So we make the logical conclusion, for this simple example, that we could just filter the image and keep only frequencies that comprise the face. So, how do we know which frequencies to filter for in general? I haven't tried this yet, though I assume the problem is ill posed due to lighting, motion blur, and camera pixel defects. Though I believe it can be solved for special case recognition tasks where the camera is static.

To bandpass the face I used the 2D-DFFT in the fftw library (example 1D FFT image borrowed from fftw's documentation)

to derive the frequency content of the image.  Simple 2D-FFTs and linear systems theory is covered in [5].  I then zeroed out power and phase portions of the transform where the euclidean distance from the pole of the transform was greater than or less than the watermarks of the bandpass.  This band was picked to remove several very low and many very high frequencies.  Bandpass filters look like donuts in 2d (see the animated fft gif below). Next, I picked off stragglers left in the band that had power below a certain threshold (assuming they were fine grain noise). Then the image was reconstructed using the inverse 2D-DFFT. This leaves you with an image that appears to be blurred, but in fine detail in regions where the face and other matching frequency regions are.  So it is clear that a difference between the now blurred image and the original image will reveal the regions that were least affected by bandpass.  And the animated gifs below show two things occurring: The face moves from obscurity into good detail and the fft compresseses and assumes specific structure as the smooth face dominates the screen.   As predicted, this did not generalize well, and needed to be tuned for different lighting environments.

So recognizing that even though lighting varys significantly from environment to environment, we might be able to recognize when frequency content varys in a particular way, and to discover that people drive said variation in a learnable fashion. If this is true, then face detection could be learned as a social behavior and performed by feeding an inexpensive local fft output into some time segmenting/classifying mechanism like the HMM. Which, could be far cheaper than many methods listed in the bibliography.

To try this idea out, I created two classes APPROACH (which represents the social behavior of approaching someone to obtain their visual identity) and OTHER.  I used 6 state Left to Right (L-R) HMMs to discriminate between the APPROACH and OTHER gesture classes. This example HMM looks something like mine, and was borrowed from Microsoft HTK's documentation:

These models are great because HTK handles many of the details of learning to recognize your data, if this is possible. Though HTK is awfully cranky about how it acquires data, and how much data it needs to converge.  The features I used to train the APPROACH and OTHER gestures were the bins of a power histogram with respect to frequency (I collapsed the 2D power spectra onto the euclidean distance from the pole). This feature set should show frequency response, and connote change in frequency response. Though it is slightly biased because of the L2 metric used for distance from the pole.
 

III.    Results

It turns out that the OTHER garbage class is too complicated to be fit by one HMM and perform with any style.  My training error was 1% for the APPROACH class and 30% for the OTHER class.  My test error was 12% for the APPROACH class and 50% for the OTHER class.  If I had time to generate enough quality results, I would have used the following equations to evaluate my APPROACH gesture detector performance:

N=number of examples
D=deletion errors
I=insertion errors
S=substitution errors

Percent_Correct=100% * (N-D-S)/N
Percent_Accuracy=100% * (N-D-S-I)/N
 

And I would have provided confusion matrices.  Unfortunately, untimely complications with HTK barred pushing the approach gesture detector further by the deadline.

IV.    Future Work

The first thing I would like to try next is adding my work with RGB histograms into the mix. They should easily integrate as as second stream in the HMM, or as part of a joint with power .
Also, I would distill more compact features for instance: compute entropy of phase portion of FFT, or use simple optical flow parameters,and cluster more HMMs to represent APPROACH and OTHER.
If a solid technique for this method may be found, I can represent other objects and their approaches by their human gesture.

The following vest was used to capture all data used in the experiments. Details on the vest may be found at the following page:





The following pair of images are a frequency study of the face in a dynamic environment.  The face image is segmented appropriately when the camera moves within a certain distance. These are animated gifs, if you missed the animation, reload your page by centering the animations on the page, then holding shift and clicking reload on your browser. Hopefully, the static image gets the point across if you can't view the animation.


[Original-IFFT(Thresh(Bandpass(FFT(Original))]^2  power spectrum filter.

AVI Files
 AVI Movie of lineup gesture
 AVI Movie of a faces extracted from 1 hour of video



Bibliography
 
  1. Henry A. Rowley, Shumeet Baluja, and Takeo Kanade. Neural network-based face detection. IEEE Transactions on    Pattern Analysis and Machine Intelligence. 20(1). January, 1998.
  2. Henry Schneiderman and Takeo Kanade. A statistical method for 3d object detectionn applied to faces and cars. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. June. 2000.
  3. Kah-Kay Sung and Tomaso Poggio. Example-based learning for view-based human face detection. IEEE Transaction on Pattern Analysis and Machine Intelligence. 20(1). January 1998.
  4. M. Lew and N. Huijsmans. Information theory and face detection. Proceedings of the International Conference on Pattern Recognition. August. 1996.
  5. Milan Sonka, Vaclav Hlavac and Roger Boyle. Image Processing, Analysis, and Machine Vision, Second Edition. ITP. 1999.