FLEXIBLE IMAGING SYSTEMS
Report 5/2004
DARPA HID program contract
number N00014-00-1-0929
Principal
Investigators:
Terrence E. Boult
Department
of Computer Science
University of Colorado at Colorado
Springs
1420
Austin Bluffs Parkway
Colorado
Springs CO 80933-7150
Fax
719 262 3900
email
Shree K. Nayar
1214
Amsterdam Avenue
Department of Computer Science, Columbia University
New York, NY 10027.
Phone: 212-939-7092
Fax: 212-939-7172
Email:
2. Omni-directional and
Super wide field-of-view Sensors for HID
4. Synthetic Sensor
Evaluations
Typical pan-tilt-zoom sensors have a five to fifty
degree field of view, and track a single target. Not only is locating and
retaining a facial target over a large range difficult, but also limited by
sensor parameters. To address these limitations, this project designed,
developed, and implemented new catadioptric omnidirectional sensors capable of
a large effective optical zoom. The fundamental feature underlying the sensors
is the ability to transition between an omnidirectional image and a fine
resolution image for use in a HID system.
The first sensor, dubbed the Zoomnicam uses a
combination of catadioptrics and a physical zoom. By physically zooming into
different parts of an omnidirectional image, the effective Zoomnicam field of
view ranges between 360 to 15 degrees.
The second alternative is using Mega-Pixel digital still cameras. It is our hypothesis that if the resolution
on the face is approximately the same, the sensors will have the same
performance in facial recognition across these sensors and will be comparable
to traditional cameras.
Field of view, however, is not the only senor
parameter that impacts recognition. The
project evaluated a number of potential sensor parameters, blur, gamma,
compression, dynamic range for the impact on face recognition. These synthetic
sensor evaluations help direct other aspects of the project and also answered
important questions about the impact of blur, gamma, dynamic range and
compression on the resulting images.
Our analysis and experimentation made it very clear
that dynamic range was a significant issue and it probably is a major factor in
why outdoor face recognition is so much weaker than indoor. This led us to design and development of
sensors for adaptive dynamic range, and for studying other approaches for
improving dynamic range for HID.
Finally, there is the need for a method to evaluate
the efficacy of the new sensors. Previous HID system evaluation has focused on
algorithm evaluation. In algorithm evaluation, the only parameters varied are
the algorithms themselves. Sensor evaluation introduces significantly more
degrees of freedom in HID system input and, more importantly, makes it
impossible to test with identical inputs. To address these issues, we developed
a new evaluation paradigm defined with statistical confidence measures. In addition we developed techniques to
predict when face-recognition system were going to fail, then found ways, at
the system level, to help reduce those failures.
We briefly review the project results in each of these
areas. For details see the papers in
the reference sections, which also relate our work to others in the field.
2. Omni-directional and Super wide
field-of-view Sensors for HID
For
flexible imaging we have undertaken 3 data collections, (omni-directional,
zoomnicam and long-distance), with the resulting data submitted as part of the
DARPA HBASE, and also available directly from Dr. Boult. These are in addition the much larger
photo-head dataset, which is used both in this project and the Columbia lead
Vision in Bad Weather project.
Combined these were almost a Terabyte of test data exploring sensor
design, resolution, lighting distance and weather effects. We briefly review those sensors, the
experiments and the results.

The first of our
non-traditional sensor projects sought to address the question of how well an
omni-directional camera could be use for face recognition. To allow it to operate over a wider range of
distance we chose to use a 3.1 Megapixel camera, the Nikon 990. The omni-directional images were obtained
using Remote Reality’s OneShot lens attached to the Nikon 990. Dr. Boult and students developed Linux
software that controlled the Nikon 990 camera being used and combined video
rate person tracking (using its analog TV output) with high-resolution image
capture to provide face images suitable for recognition. While the camera supported un-compressed
TIFFs, the project always used the high-resolution jpeg format. This software simplified our data
collection process and was used in four different data collections at
NIST. In

all but one collection
the subject stood at fixed distances to support repeatable measurements.

In the collections there are
multiple images of each subject at each setting with over 8000 images in
total. Variations included view angle,
lighting (artificial and natural), distance and time. Standard camera images of
each subject were also available.

Two examples of the
analysis are shown above, with the example subject's gallery image shown in the
middle. The left show face recognition
performances at off angle viewing of 10 degrees as the number of additional
lights are added. The images were taken
with a fixed aperture and shutter speed so that the brightness variations are
not masked by automatic gain controls.
The three graphs are for variations of the matching algorithms in the
FactIt SDK, with the algorithm F13 the algorithm using full template matching
and the other two using smaller (faster) templates. The error bars show 95% confidence intervals computed using the
BRR we presented in [Micheal-Boult-01].
With two lights, consistent with what is used in the standard images and
gallery, the recognition rate for the unwarped omni-directional images are
approximately 90%, consistent with off axis regular images of comparable
resolution. The second examples
considered the impact the graph on the
right shows the impact of distance on recognition from omni-directional cameras
in ambient lighting conditions. It is
clear the images are quite dark, yet the recognition rates are reasonable. It is important to remember these are
omni-directional images, with the face image cropped from that, so one might
interpret the results as suggesting the system could recognize 60% of all the
people within 12ft of the camera, and 70% of those within 6ft. Combining the results with the
omni-directional imaging with lighting results its clear that in a well-lit
area the system can increase that to near 90% recognition. The results show that omni-directional
sensors have potential for human identification at a distance.

The
second flexible imaging sensor explored was an omni-directional system capable
of zooming, or a zoomnicamera. Design and implementation of the zoomnicam
prototype was done at Columbia. The
unit uses a Sony DFW-V500 color zoom and a relay lens to make the imaging
approximately telecentric and allow it to focus on the nearby parabolic mirror.
The mirror was mounted on a xy-stage with 4” motion, though most motions were much
smaller and hence much faster than a traditional Pan/Tilt system. Dr. Nayar and the team at Columbia
developed a controller for the stage and calibration software that allowed
unwarping the resulting image to a perspectively correct image. The unit was transfer to Dr. Boult and
students at Lehigh where the control software was extended to support facial
image collections and experiments.
Again real-time tracking was
possible from the video output, but the focus of this project was the facial
recognition. The Zoomnicam data was
collected on two different dates, imaging stationary subjects at 4 or 5
distances producing a total collection of over 2000 image from 85 subjects with
72 overlapping between the collections.
In addition to the zoomnicamera images, all subjects had same and
different day traditional camera images taken as part of parallel collections
at NIST and most also were imaged by the mega-pixel omni-directional system. This figure shows a standard camera image
(upper left) from 3 feet and then a range of zoomnicameras at distances from 6
to 15 feet. Note how the zoomincam
images also have strong directional lighting effects, effects that would impact
the recognition rate of standard face recognition.

Using the results of these
data collections were have been addressing the facial recognition quality of
these sensors. The experimental
analysis was to test the hypothesis that the zoomnicamera, when it was zoomed
to provide a similar resolution as the MegaPixel omni-camera would have
essentially the same recognition rate.
Since the first experiments showed the omnicamera in a wide-range of
settings, a smaller set of experiments were done with the zoomnicameras. The next graph summarizes these
results. The error bars show the
results of our STRAT/BRR technique for performance analysis, allowing us to
draw conclusions across different data sets.
Clearly the two curves are not statistically different. The upper curve shows the results when the
zoomnicamera data was used as both probe and gallery images. Multiple images were taken so it is
different images in the probe/gallery, but they have similar lighting. This
suggests that lighting is a much stronger factor than the difference in
sensors.
Ultra-wide field of view is
not the only issue for sensors for HID.
We can think of the omni-directional sensor as a special case of
defining mapping from the scene to an image.
Dr. Nayar and students at Columbia developed a general approach to
catadioptric system design that allows them to solve for the mirror shape for
any given image-to-scene mapping.
Using such a flexible system one might desire an Anamorphic imaging
system where the mirror is such that faces at expected distances all have the
same size. Such an example is show here where the faces in the bottom for of
people are show for a perspective image and for an anamorphic system. This is only one of infinitely many
constraints that might be imposed on the scene to image mapping and then
realized using the general imaging theory developed.

In
the final year of the Flexible imaging project Dr. Boult and students
instituted a new direction, building on our earlier work in visual surveillance
we began looking at issues for recognition of particular human activities. The work builds on our geo-spatial detection
and tracking work, which began as part of the DARPA VSAM program. With regard to wide-area sensing and
detailed assessment Dr. Boult, as part of the Army SmartSensorWeb(SSW) program,
extended his system for tracking using a 360FOV camera. Not only did it detect
and track targets, it geo-located their position and used that to pass-off
targets to a PTZ that could then be used for detailed assessment. The omni-directional video supports tracking
multiple simultaneously and is very useful for crowded areas
While
detection and tracking still needs research, especially for following
individuals within crowds, we postulate that it is not the most significant
problem. More important is developing
an approach that allows one to specify what activity is of interest and
recognize complex activities. Well
known models used for event and anomaly detection, such as Hidden Markov Models
(HMMS) or stochastic grammars, both require lots of training data and are very
difficult to use, even for expert computer users. In a recent study, we had graduate students in CS and EE learn to
use an HMM system, then use it to specify/learn simple events. After training, those who could figure out
how to specify given events took more than 25min, on average, to specify a
single event of interest. Furthermore, 13 of the 20 students unable develop HMM
for the relatively simple events within an hour.

We proposed [Yu-Boult-2003]
a radically new approach, UI-GUI: Image
Understanding of Graphical User Interfaces, which offers a new solution and
significant promise for human activity recognition. The approach ties the “event recognition” to the GUI display of
targets and sensory data. The definition
of an event is done using what the end-user sees in the GUI (so it still
depends on good low-level processing), and combines different icons in
spatio-temporal patterns. This figure is an example from our video tracking,
with target type and localization is displayed graphically. The rule (the box) looks for someone being
dropped off. Furthermore, the IUGUI
supports effective in-the-field sensor integration – if the results from new
senor can be displayed on the users screen it can be used as part of an IU-GUI
rule.
Using
the IUGUI approach, the same 20 students as above were trained to use the
system, in under 5 min, and specified each of their activities of interest in,
on average, 11 seconds. That is 10000 times faster than using HMMs. Here we show an ROC curve of performance,
and clearly of the UI-GUI was also significantly better
than HMM. While preliminary, these
early experiments show the approach is very significant potential for activity
recognition.
4. Synthetic Sensor Evaluations
We also pursued a collection
of “simulation” experiments were simulated sensor effects were applied to a
subset of the FERET data. The subset consists of 256 subjects with 4 images per
person, to permit the use of STRAT for estimate of confidence intervals. The first three of these synthetic
experiments examined the impact of blur, gamma, and compression. Spatial blurring was done using 7x7 windows
with an approximate Gaussian of a given standard deviation (sigma) given in
pixel. The results for were not
surprising, showing the expected impact of blur, which was statistically
significant even for a single pixel sigma. While there have been other informal
reports of blur improving results our results did not find such a pattern. Again each bar in the graph is a 95%
confidence interval from STRAT/BRR.

In the analysis for gamma, the simulations do
not have the ability to change the gamma at capture time, rather they are
reprocessed images from good quality images with moderate dynamic range but
unknown gamma (the original FERET images used were digitized from film). Thus it is not a measure of the impact of
improved dynamic range, but only the results of brightness variations. The images were gamma corrected with
gamma=1 equal to the original image.
The analysis for the jpeg
compression did present some unexpected results. The images uses were
uncompressed FERET images that were converted from film. The results considered compression of the
gallery (first number in the pair), the probe
(second number) or both gallery and probe.