Institute of Information Science, Academia Sinica



Press Ctrl+P to print from browser


Recognizing Human Activities from Low-Resolution Video

  • LecturerMr. Chia-Chih Chen (PhD Candidate, ECE, U. of Texas at Austin)
    Host: Dr. Tyng-Luh Liu
  • Time2011-07-29 (Fri.) 10:30 – 12:00
  • LocationAuditorium 106 at new IIS Building

Recognition of human activities from low-resolution video is a challenging problem in computer vision. It is of significant interest in security, automated surveillance, and sports and aerial video analysis. Due to limited object resolution, blurring effects of air turbulence, and constantly moving camera platform, the detection, tracking of objects, and the recognition of human activities are far less reliable. In this talk, we introduce two novel descriptors for activity recognition and a general framework for human-vehicle interaction recognition in low-resolution settings.

The first activity descriptor [1] is a composition of discriminative human pose and motion features which are selected guided by training data. Our space-time joint feature descriptor is computational efficient and shows superior results on different resolution datasets. The second descriptor [2] models an activity sequence as simultaneous temporal signals emitted from active body parts. This is achieved by boosting action associated space-time interest point detectors from video interest points. Comparing to the use of visual features alone, we show that our method furthers recognition accuracy by employing features computed from both local video content and the spectral properties of body parts' movements.


For the recognition of human-vehicle interactions, we introduce a temporal logic based approach [3] which does not require training from event examples. At the low-level, we employ dynamic programming to perform fast model fitting between the tracked vehicle and the rendered 3-D vehicle models. At the semantic-level, given the localized event region of interest (ROI), we verify the time series of human-vehicle relationships with the pre-specified event definitions in a piecewise fashion. Our results on VIRAT Aerial Video dataset [4] demonstrate the effectiveness and generality of this framework.

In each case, we demonstrate our approach on real-world benchmark datasets and compare our results with those published by various researchers. In most cases presented, our work is superior or comparable to the state of the art.

[1] C.-C. Chen and J. K. Aggarwal, “Recognizing Human Action from a Far Field of View”. IEEE Workshop on Motion and Video Computing (WMVC), Utah, USA, Dec. 2009.

[2] C.-C. Chen and J. K. Aggarwal, “Modeling Human Activities as Speech”, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, June 2011.

[3] J. T. Lee*, C.-C. Chen*, and J. K. Aggarwal, “Recognizing Human-Vehicle Interactions from Aerial Video without Training”, Workshop of Aerial Video Processing in conjunction with CVPR (WAVP), Colorado Springs, CO, June 2011. *Equal contribution authorship.

[4] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J. T. Lee, S. Mukherjee, J. K. Aggarwal, H. Lee, L. Davis, E. Swears, X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick, H. Pirsiavash, D. Ramanan, J. Yuen, A. Torralba, B. Song, A. Roy-Chowdhury, and M. Desai, “A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video”, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, June 2011.