Real-time Motion Templates using Intel CVL

Real-time Motion Template Gradients using Intel CVLib

James Davis	Gary Bradski
MIT Media Lab	Intel Corporation, SC12-303
E15-390, 20 Ames St.	2200 Mission College Blvd.
Cambridge, MA 02139 USA	Santa Clara, CA 95052 USA
davis@media.mit.edu	gary.bradski@intel.com

Abstract In this paper, we present an extension to the real-time motion template research for computer vision as previously developed in (Davis 1997). The underlying representation is a Motion History Image (MHI) that temporally layers consecutive image silhouettes (or motion properties) of a moving person into a single template form. Originally a global, label-based method was used for recognition. In this work, we construct a more localized motion characterization for the MHI that extracts motion orientations in real-time. Basically, directional motion information can be recovered directly from the intensity gradients within the MHI. In addition, we provide a few simple motion features using these orientations. The approach presented is implemented in real-time on a standard PC platform employing optimized routines, developed in part for this research, from the Intel® Computer Vision Library (CVLib). We conclude with an overview of this library and also a performance evaluation in terms of this research. pdf version of this paper (186K)

AVI Example Movie

1. Introduction

In earlier work (Davis 1997), a real-time computer vision representation of human movement known as a Motion History Image (MHI) was presented. The MHI is a compact template representation of movement originally based on the layering of successive image motions. The recognition method presented for these motion templates used a global moment feature vector constructed from image intensities, resulting in a token-based (label-based) matching scheme. Though this recognition method showed promising results using a large database of human movements, no method has yet been proposed to compute the raw motion information directly from the template without the necessity of labeling the entire motion pattern. Raw motion information may be favored for situations when a precisely labeled action is not possible or required. For example, a system may be designed to respond to leftward motion, but may not care if it was a person, hand, or car moving.

In this paper, we present an extension to the original motion template approach to now enable the computation of local motion information directly from the template. In this extension, silhouettes of the moving person are layered over time within the MHI motion template. The idea is that the motion template itself implicitly encodes directional motion information along the layered silhouette contours (similar to normal flow). Motion orientations can then be extracted by convolving image gradient masks throughout the MHI. With the resulting motion field, various forms of motion analysis and recognition are possible (e.g. matching histograms of orientations). To ensure fast and stable computation of this approach, we exploit the Intel Computer Vision Library (CVLib) procedures designed in part for this research. This software library has been optimized to take advantage of StrongArm and Pentium® MMX™ instructions. We offer the research presented in this paper both as a useful computer vision algorithm and as a demonstration of the Intel CVLib.

The remainder of this paper first examines the previous motion template work (Section 2). Next, we present the approach of computing motion orientations from a silhouette-based motion template representation (Section 3), along with describing potential motion features and generalities of the approach. We then present a brief overview of the Intel CVLib, the advantages resulting from the CVLib enhancement in terms of vision processing, and its comparative computational performance for this work (Section 4). We then summarize results and conclusions (Section 5). Appendix A details the calculation of moments and Hu moments used for the pose recognition approach for the templates (Section 3.1.1).

2. Previous Motion Template Research In previous work, Davis and Bobick (Davis 1997) presented a real-time computer vision approach for representing and recognizing simple human movements. The motivation for the approach was based on how easily people can recognize common human movements (like sitting or push-ups) from low-resolution (blurred) imagery. Here there is an almost total lack of recognizable features, thus showing the role of motion for recognition. Accordingly, the method given relies on "patterns of motion" rather than on structural features as the representation for human motion. In that method, the space-time image volume containing the motion is collapsed into a single 2-D template while still perceptually capturing the essence of the movement and its temporal structure. The template is generated by layering successive image differences of a moving person, and is referred to as a Motion History Image (MHI) (Similar to (Jain 1979)). An example is shown in Figure 1 for the movement of a person stretching her arm over her head.

Frame-0	Frame-35	Frame-70	MHI
Figure 1. Motion History Image for arm-stretching movement generated from image differences.

For recognition of the templates, seven higher-order moments (Hu 1962) are initially extracted from the MHI and also from a binarized version of the MHI. These fourteen moments are used as global shape descriptors and temporal recency localizers for the MHI. The moments are then statistically matched to stored examples of different movements. This method of recognition has shown promising results using a large database of movements.

Already, interactive systems have been successfully constructed using the underlying motion template technology as a primary sensing mechanism. One example includes a virtual aerobics trainer (Davis 1998) that watches and responds to the user as he/she performs the workout (See Figure 2(a)). In this system, motion templates are used to watch and recognize the various exercise movements of the person, which in turn affects the response of the virtual instructor. Another application using the motion template approach is The KidsRoom (Bobick 1997). At one point in this interactive, narrative play space for children, virtual monsters appear on large video projection screens and teach the children how to do a dance. The monsters then dance with the children, complementing the kids whenever they perform the dance movements (See Figure 2(b)). The dancing movements of the children are recognized using motion templates. Thirdly, a simple interactive art demonstration can be constructed from the motion templates. By mapping different colors to the various time-stamps (or graylevels) within the MHI and displaying the result on a large projection screen, a person can have fun "body-painting" over the screen (See Figure 2(c)), reminiscent of Krueger-style interactive installations (Krueger 1991). Other applications in general that must be "aware" of the movements of the person (or people) could also benefit from using the motion template approach. For example, gesture-based computer video games (Freeman 1998) or immersive experiences (e.g. Hawaii flight simulator, as shown in Figure 2(d)) designed to respond to user gesture control could use the previous template approach or employ our new method as an additional sensing measurement or qualification of the person’s movements. Even simpler gesture-based systems which rely on detecting characteristic motion (e.g. hand swipes to change a channel or a slide in a presentation) could profit from the fast and simple motion orientation features to be described.

(a)	(b)
(c)	(d)
Figure 2. Interactive systems employing motion templates. (a) Virtual aerobics trainer (photo © Sam Ogden). (b) The KidsRoom. (c) Interactive body-painting. (d) Hawaii flight simulator.

The main limitation of single templates is that characterization of the template is token (label) based (e.g. "crouching" or "sitting"), where it cannot yield much information other than recognition matches (i.e. it cannot determine that, say, a lot of "up" motion is currently happening). We therefore wish to develop additional methods for analysis and recognition that are more descriptive of the local motion information. In the next section, we develop a method of analysis which extracts directional motion information from the MHI. 3. Motion Templates and Orientations

presence

accumulation

3.1 Input

in front of

Given the silhouettes extracted for constructing the MHI, we can also use the pose of the silhouette to give additional information about the action of the person. For example, we could use the pose to recognize the body configuration, and then use the motion information to qualify how the person moved to that position.

(a)	(b)	(c)
Figure 3. Motion template generation. (a) Input video frame. (b) Silhouette in extracted from video. (c) MHI motion template (red shows where valid motion was detected, white line shows global direction of motion).

3.2 Pose Recognition from Silhouettes

1.1

However, to indicate the discriminatory power of these moment features for the silhouette poses, we need only a few examples of each pose (at least sufficient for matrix inversion) to relate the distances between the classes.

For this example, the training set consisted of 5 repetitions of 3 gestural poses ("Y", "T", and "Left Arm") shown in Figure 4 done by each of five people. A sixth person who had not practiced the gestures was brought in to perform the gestures. Table 1 shows the Mahalanobis distances for these new test poses T_i matched against the stored training models P_i . The table correctly shows that the true matches for the test poses (along the diagonal) have distances considerably smaller than to the other model poses in each row even though the first two poses, "Y" and "T", are fairly close to one another. This is a typical result showing that thresholds tend to be easy to set with distances between even fairly similar gestures ("Y" and "T") still about an order of magnitude apart. We next describe the MHI motion template representation (generated from successive silhouettes) used to extract the directional motion information.

P₁	P₂	P₃
Figure 4. Three example poses used for Mahalanobis distance measurements in Table 1. P₁= "Y", P₂= "T", and P₃= "Left Arm".

	P₁	P₂	P₃
T₁	14	204	2167
T₂	411	11	11085
T₃	2807	257	28

Table 1. Discrimination results. Table entries report the Mahalanobis distance between the test poses (T_i) and model poses (P_i). P₁= "Y", P₂= "T", and P₃= "Left Arm". Diagonal entries are the true correspondences. Entries in bold show the minimum distance for each test input to the models.

3.3 MHI Representation

1.2

explicitly

3.4 Motion Gradient

direction

As shown in Figure 5(a) and (b), the local intensity gradients of the MHI directly yield the orientation of the silhouette contour motion. Hence, we can convolve classic intensity gradient masks with the MHI to compute the motion orientations. For the convolution, we used Sobel gradient masks:

. 1.3

. 1.4

interior

(a)	(b)	(c)
Figure 5. Motion gradients. (a) MHI for left arm movement up. (b) Sample motion orientations using Sobel convolution masks. (c) Radial histogram of the motion orientations with global direction identified.

global

leftward

upward

1.5

(a)	(b)	(c)
Figure 6. Global motion orientation measures. (a) Kneeling, viewed from the side: . (b) Walking left, viewed from the side: . (c) Lifting arms, viewed from the front: .

3.5 Generalities of This Approach

4. Intel CVLib and Performance Evaluation

4.1 CVLib Description

The Intel computer vision library (CVLib) is intended to form a broad substrate of computer vision algorithms, highly optimized across Intel microprocessor products from StrongArm to Pentium, and is intended to run on most popular operating systems: Windows, BeOS, and Linux. CVLib will join other existing, supported high performance libraries in Image Processing, JPEG, MPEG, Signal Processing, Math (matrix algebra), and Pattern Recognition freely available on the web at http://developer.intel.com/vtune/perflibst/. As of this writing, CVLib is in alpha testing and will be released on the web towards the last quarter of 1999.

It is intended that CVLib will eventually include functionality in the following areas:

Image Statistical Measures: Mean, median, moments, Hu moments.

Linear Algebra: SVD, eigenvalues , eigenvectors, cross-product, transpose product, matrix math.

Optimization routines: Newton’s, L-M.

Image Pyramids: Gaussian and Laplacian.

Clouds of points: Delaunay triangulations, voronoi diagrams, convex hulls.

Drawing support: Lines, conic sections, flood fills.

Segmentation: Skeletonization, distance transform, shape descriptors, morphological operators, background subtraction support, histogram backprojection, fast anisotropic diffusion.

Feature finding: Canny edge detector, Hough transform, face features, corner, ridge, line detectors, general and separable convolution, fast fixed convolution kernels, image derivatives to third order.

Pattern Recognition: Multi-dimensional histogram collection and comparison, MLP, HMM, DTW, Bayesian, Mahalanobis, K-NN, Eigen objects.

Motion: Optical flow, normal optical flow, motion segmentation.

Camera Calibration: Intrinsic, Extrinsic, radiometric, geometric, sub-pixel accuracy corner and line finders.

3D, Stereo: 8 point algorithm, matching on epipolar lines, dense stereo, single image model based pose in 6-DOF, finding the Homography matrix from vanishing points.

Tracking: Color based, Snakes, Kalman filter, condensation filter, correlation.

Transforms: FFT, DCT, wavelets.

Utilities: Fast pixel access, cordic algorithm, random number generator, fast square root, inverse square root, image data shuffling.

Works with all routines in Intel’s Image Processing Library.

The low-level motion template routines are included in this library and we used the optimized C code versions of these routines to compare performance against the optimized assembly MMX™ instruction versions in the section below.

4.2 CVLib Performance Evaluation

Un-optimized C	Optimized C	Optimized MMX Assembly
22 frames/second	27 frames/second	30+ frames/second

Table 2. Frame rates for different implementations.

A video was recorded of a hand waving back and forth across the scene with a frequency of about one hertz. During the wave, the hand size fluctuated between taking up one to three quarters of the image. The video was two and a half minutes in duration. Intel’s VTUNE™ Performance Analyzer code was used to record the performance of the motion template gradient algorithm on this video sequence using both the optimized C

(PX) and assembly (A6) versions of CVLib. VTUNE™ rapidly samples the application and produces a report as to the percent of time spent in each module. Since optimized code runs faster, it might get called more often,

so we can’t use total percent of time spent in PX vs. A6 as a comparison. Instead we have the situation as shown in Table 3.

	PX	A6
Motion Template Time	A	A/K
Time in rest of code	1-A	1-A

Table 3. "A" is the percentage of time spent in the optimized C code PX. The MMX assembly optimized code A6 has a speed-up factor "K" as compared to PX. The total amount of time spent in the outside code, 1-A is the same for both PX and A6.

. 16

Figure 7. Motion Template Gradient code A6 was active B = 35.89% of the time.

Figure 8. Motion Template Gradient Code PX was active A = 51% of the time.

4.3 CVLib Performance Discussion

In terms of the motion template research described in this paper, there are two main benefits of using the CVLib. First, faster processing rates enabled by using CVLib offer a more stable and consistent MHI representation with silhouettes. Unlike standard optical flow routines which calculate motion usually between successive frames (motion over milliseconds), our silhouette layering approach is based on a larger time scale (motion over seconds). Since the position of the person (silhouette) over time is layered into the template, faster frame rates only reduce the contour disparities, corresponding to a finer and more dense motion resolution. Thus faster frame rates increase the granularity of the motion template. The second advantage of using the CVLib is that it may allow more views to be processed on a single machine. Currently, only one view of the person is processed. As previously stated in Sect. 3.1, silhouettes have the problem of masking "internal" motion within the silhouette region, but using multiple views of the person is one suggestion to overcome this hindrance. Therefore, if the motion template method is fast enough while not occupying most of the CPU, we could simultaneously process multiple views of the moving person on a single machine to account for the possible silhouette occlusions (as similarly performed in (Davis 1997) for robust matching). Having a single machine processing these multiple views of the person (via multi-video inputs or video multiplexer), rather than multiple machines each processing a single view, is obviously advantageous in terms of cost and synchronization.

Given the methods described in this work and the overall functionality of the CVLib, many applications could benefit from the proposed methodology and implementation. Systems incorporating human movements for input must recognize and respond quickly to the user without noticeable lag to give a sense of immersion and control. The quickness of the characterization and its response is paramount. We believe that the motion template research, as optimized in CVLib, offers such a system.

5. Summary

global

6. Appendix A: Moments and Hu Moments

This appendix is included to make this paper self-contained for implementation purposes, and may be omitted otherwise. The appendix describes central moments to the 3^rd order, and then their combination into rotationally invariant Hu moments (Hu 1962).

A.1 Moments, Their Use up to The 3^rd Order

Since we’re dealing with images, we need to work with two dimensional moments. Moments give us an indication of the center, spread, skewness etc. of what we are measuring. In our case, our measures are pixel values and locations. In this case, moments can give us a highly compressed indication of the shape of the object being measured in our image. The two-dimensional moment,

, of order

, of a density distribution function,

, is defined as

(A1)

The two-dimensional moment for a discretized image,

, is

(A2)

A complete moment set of order n consists of all moments,

, such that

. The use of moments for image analysis and object representation was introduced in 1962 by Hu in his seminal paper (Hu 1962). Hu’s uniqueness theorem states that if

is piecewise continuous and has nonzero values only in a finite region, then the moments of all orders exist. Moreover, the theorem proves that the moment set

is uniquely determined by

and conversely,

is uniquely determined by

. Since an image segment has a finite area and, in the worst case, is piece-wise continuous, moments of all orders exist, and a moment set can be computed that will uniquely describe the information contained in the image segment. To characterize all of the information contained in an image segment requires a potentially infinite number of moments. The challenge is to select a meaningful subset of moments that contain sufficient information to accurately characterize the image. Since noise effects are large in video images and we are looking for a highly compressed indication of shape, we only consider moments up to the third order.

To gain a better intuition of how to reason about moments and get an insight into how to derive invariant features based on moments, we next consider several low-order moments and describe their physical meaning. The definition of the zero-th order moment, , of the image is

(A3)

This moment represents the total mass of the given image. When computed for a silhouette image, the zero-th moment represents the total object area.

The two first order moments, , are used to locate the center of mass of the object. The center of mass defines a unique location with respect to the object that may be used as a reference point to describe the position of the object within the field of view. The coordinates of the center of mass can be defined through moments as shown below

(A4)

A.2 Hu Moments

To center a visual object such that its center of mass is at the origin,

and

, of the moments calculation, we define central moments as follows:

(A5)

We note that central moments are translation invariant. To achieve scale invariance we normalize central moments as follows

(A6)

For arbitrary shapes, a potentially infinite number of moments

can uniquely describe that shape. These image moments are invariant with respect to translation and scale operations, but an infinite number of moments are obviously impractical, and in any case, we want a compressed representation of shape. Since objects in video images contain a large amount of noise, only moments up to the 3^rd order are generally practicable.

There are ten moments up to the 3^rd order, but scale normalization and translation invariance fix three of these moments at constant values. Rotation invariance takes away one more degree of freedom leaving us with six independent dimensions. Hu uses seven variables however: six to span the six degrees of freedom, and a final seventh variable who’s sign removes reflection invariance. Only the first six of the Hu variables below give reflection invariance.

The seven moment-based features proposed by Hu that are functions of normalized moments up to the third order are:

(A7)

In the first two equations,

and

, provide scale and translation independence, equations

ensure rotation with reflection invariance, and

provides reflection discrimination in it’s sign.

7. Bibliography Aggarwal, J. and Q. Cai. Human motion analysis: a review. IEEE Wkshp. Nonrigid and Articulated Motion Workshop, Pages 90-102, 1997.

Akita, K. Image sequence analysis of real world human motion. Pattern Recognition, 17, 1984.

Bergen, J., Anandan, P., Hanna, K., and R. Hingorani. Hierarchical model-based motion estimation. In Proc. European Conf. on Comp. Vis., pages 237-252, 1992.

Black, M. and P. Anandan. A frame-work for robust estimation of optical flow. In Proc. Int. Conf. Comp. Vis., pages 231-236, 1993.

Bobick, A., Davis, J., and S. Intille. The KidsRoom: an example application using a deep perceptual interface. In Proc. Perceptual user Interfaces, pages 1-4, October 1997.

Bradski, G., Yeo, B-L. and M. Yeung. Gesture for video content navigation. In SPIE’99, 3656-24 S6, 1999.

Bradski, G. Computer Vision Face Tracking For Use in a Perceptual User Interface. In Intel Technology Journal, http://developer.intel.com/technology/itj/q21998/articles/art_2.htm, Q2 1998.

Braun, M. Picturing time: The work of Etienne-Jules Marey (1830-1904). University of Chicago Press, 1992.

Bregler, C. Learning and recognizing human dynamics in video sequences. In Proc. Comp. Vis. And Pattern Rec., pages 568-574, June 1997.

Capelli, R. Fast approximation to the arctangent. Graphics Gems II, Academic Press, James Arvo Editor, pages 389-391, 1991.

Cedras, C. and M. Shah. Motion-based recognition: a survey. Image and Vision Computing, Vol. 13, Num 2, pages 129-155, March 1995.

Cham, T. and J. Rehg. A multiple hypothesis approach to figure tracking. In Proc. Perceptual User Interfaces, pages 19-24, November 1998.

Cuttler, R. and M. Turk. View-based interpretation of real-time optical flow for gesture recognition. Int. Conf. On Automatic Face and Gesture Recognition, page 416-421, 1998.

Darrell, T., Maes, P., Blumberg, B., and A. Pentland. A novel environment for situated vision and behavior. In IEEE Wkshp. For Visual Behaviors (CVPR), 1994.

Davis, J. Recognizing movement using motion histograms. MIT Media lab Technical Report #487, March 1999.

Davis, J. and A. Bobick. Virtual PAT: a virtual personal aerobics trainer. In Proc. Perceptual User Interfaces, pages 13-18, November 1998.

Davis, J. and A. Bobick. A robust human-silhouette extraction technique for interactive virtual environments. In Proc. Modelling and Motion capture Techniques for Virtual Environments, pages 12-25, 1998.

Davis, J. and A. Bobick. The representation and recognition of human movement using temporal templates. In Proc. Comp. Vis. and Pattern Rec., pages 928-934, June 1997.

Edgerton, H. and J. Killian. Moments of vision: the stroboscopic revolution in photography. MIT Press, 1979.

Freeman, W., Anderson, D., Beardsley, P., et al. Computer vision for interactive computer graphics. IEEE Computer Graphics and Applications, Vol. 18, Num 3, pages 42-53, May-June 1998.

Freeman, W., and M. Roth. Orientation histograms for hand gesture recognition. In Int’l Wkshp. On Automatic Face- and Gesture-Recognition, 1995.

Gavrila, D. The visual analysis of human movement: a survey. Computer Vision and Image Understanding, Vol. 73, Num 1, pages 82-98, January, 1999.

Haritaoglu, I. Harwood, D., and L. Davis. W4S: A real-time system for detecting and tracking people in 2 ½ D. European. Conf. On Comp. Vis., pages 877-892, 1998.

Hogg, D. Model-based vision: a paradigm to see a walking person. Image and Vision Computing, Vol. 1, Num. 1, 1983.

Horn, B. Robot Vision. MIT Press, 1986.

Horprasert, T. Haritaoglu, I., Harwood, D., et al. Real-time 3D motion capture. In Proc. Perceptual User Interfaces, pages 87-90, November 1998.

Hu, M. Visual pattern recognition by moment invariants. IRE Trans. Information Theory, Vol. IT-8, Num. 2, 1962.

Jain, R. and H. Nagel. On the analysis of accumulative difference pictures from image sequences of real world scenes. IEEE Trans. On Pattern Analysis and Machine Intelligence, Vol. 1, Num. 2, April 1979.

Ju, S., Black, M., and Y. Yacoob. Cardboard people: a parameterized model of articulated image motion. Int. Conference on Automatic Face and Gesture Recognition, pages 38-44, 1996.

Krueger, M. Artificial reality II, Addison-Wesley, 1991.

Little, J., and J. Boyd. Describing motion for recognition. In International Symposium on Computer Vision, pages 235-240, November 1995.

Moeslund, T. Summaries of 107 computer vision-based human motion capture papers. University of Aalborg Technical Report LIA 99-01, March 1999.

Moeslund, T. Computer vision-based human motion capture – a survey. University of Aalborg Technical Report LIA 99-02, March 1999.

Pinhanez, C. Representation and recognition of action in interactive spaces. MIT Media Lab Ph.D. Thesis, June 1999.

Pinhanez, C., Mase, K., and A. Bobick. Interval scripts: a design paradigm for story-based interactive systems. Conference on Human Factors in Computing Systems, pages 287-294, 1997.

Polana, R. and R. Nelson. Low level recognition of human motion. In IEEE Wkshp. On Nonrigid and Articulated Motion, 1994.

Rohr, K. Towards model-based recognition of human movements in image sequences. CVGIP, Image Understanding, Vol. 59, Num 1, 1994.

Shah, M. and R. Jain. Motion-Based Recognition. Kluwer Academic , 1997.

Therrien, C. Decision Estimation and Classification. John Wiley and Sons, Inc., 1989.

Wren, C., Azarbayejani, A., Darrell, T., and A. Pentland. Pfinder: real-time tracking of the human body. SPIE Conference on Integration Issues in Large Commercial Media Delivery Systems, 1995.

Yamato, J., J. Ohya, and K. Ishii. Recognizing human action in time sequential images using hidden markov models. In Proc. Comp. Vis. and Pattern Rec., 1992.

Assembly optimized performance libraries in Image Processing, Signal Processing, JPEG, Pattern Recognition and Matrix math are at http://developer.intel.com/vtune/perflibst/.