These detectors are typically developed to function in environments where they are trained and installed/used. The minute something challenging comes along (light switches in the off position, occlusion during an important event, you want to recognize faces during any part of your daily routine no matter where you are) these environmental smart-services will likely break. Furthermore, putting expert hidden observers/cameras into the environment can violate the network of trust that already exists between humans during personal interactions.
I view giving someone the ability to recognize faces on demand over a wide range of lighting conditions a very powerful thing. Someone using this technology would suddenly be able to never forget a face and in some cases to be able to remember one for the first time (c.f. face blindness). This problem also attacks the poorly developed area of computer vision where both target and camera are in motion. Moving-camera-moving-target problems occurs repeatedly for intelligent mobile and wearable computational devices which ask: Where am I? What am I doing? Who am I interacting with? What is relevant? What did I do before? with the goal of better serving a user.
So, for my project, I've worked towards defining the wearable face recognition/detetion problem better, and applied some computer vision and pattern recognition techniques to try and make anytime face recognition possible. I have two minor results from the class project I will detail: 1) The front of the face has a characteristic frequency response under perspective projection 2) Face detection can be avoided by engineering, detecting, and understanding human behavior.
To do these experiments I needed train and test images of people's face
that were captured from a head-mounted camera. I used the capture vest
(shown in the images section below) to grab images of peoples face during
the course of a conference. The conference provided a dense base of humans
for extended interaction and some very challenging lighting conditions.
Don't miss the avi's below. They help understand what meeting someone is
like, and how a day of heavy interaction might be summarized.
If you watch the AVI below, you will see the wearer approach an individual standing in the middle of an open space. The scene background is cluttered with many different sized and shaped objects. As the wearer approaches, intent upon re-introduction to the subject, the face expands to fill a sizeable portion of the screen. Furthermore, the background is now out of focus with respect to the face. This apparent difference in focus denotes a difference in frequencies spanned by the target face and background frequencies. So we make the logical conclusion, for this simple example, that we could just filter the image and keep only frequencies that comprise the face. So, how do we know which frequencies to filter for in general? I haven't tried this yet, though I assume the problem is ill posed due to lighting, motion blur, and camera pixel defects. Though I believe it can be solved for special case recognition tasks where the camera is static.
To bandpass the face I used the 2D-DFFT in the fftw library (example 1D FFT image borrowed from fftw's documentation)

So recognizing that even though lighting varys significantly from environment to environment, we might be able to recognize when frequency content varys in a particular way, and to discover that people drive said variation in a learnable fashion. If this is true, then face detection could be learned as a social behavior and performed by feeding an inexpensive local fft output into some time segmenting/classifying mechanism like the HMM. Which, could be far cheaper than many methods listed in the bibliography.
To try this idea out, I created two classes APPROACH (which represents the social behavior of approaching someone to obtain their visual identity) and OTHER. I used 6 state Left to Right (L-R) HMMs to discriminate between the APPROACH and OTHER gesture classes. This example HMM looks something like mine, and was borrowed from Microsoft HTK's documentation:

N=number of examples
D=deletion errors
I=insertion errors
S=substitution errors
Percent_Correct=100% * (N-D-S)/N
Percent_Accuracy=100% * (N-D-S-I)/N
And I would have provided confusion matrices. Unfortunately, untimely complications with HTK barred pushing the approach gesture detector further by the deadline.
The following vest was used to capture all data used in the experiments. Details on the vest may be found at the following page:


The following pair of images are a frequency study of the face in a dynamic environment. The face image is segmented appropriately when the camera moves within a certain distance. These are animated gifs, if you missed the animation, reload your page by centering the animations on the page, then holding shift and clicking reload on your browser. Hopefully, the static image gets the point across if you can't view the animation.

[Original-IFFT(Thresh(Bandpass(FFT(Original))]^2 power spectrum filter.
AVI Files
AVI Movie of lineup gesture
AVI
Movie of a faces extracted from 1 hour of video