Page 26 - 2017 Brochure
P. 26
earch Laboratories

Multimedia Technology Lab

Multimedia technology is considered to be one of the three generated video (UGV) is a challenging but desirable
most promising industries of the twenty-first century, along task. A major reason for the challenge is that the distance
with biotechnology and nanotechnology. Over the past two between video and music cannot be measured directly.
decades, we have witnessed how multimedia technology Motivated by recent developments in affective computing
can influence various aspects of our daily life. Its wide of multimedia signals, we map low-level acoustic and visual
spectrum of applications presents a constant challenge to features into an emotional space, and match these two
advance a broad range of multimedia techniques, including modalities therein. Our research simultaneously tackles
those related to music, videos, images, text and 3D video editing and soundtrack recommendation. The
animation. correlation among music, video, and semantic annotations
such as emotion will be actively explored and modeled. A
The main research interests of individual Multimedia music-accompanied video that is composed in this way is
Technology Group members include multimedia signal attractive, since the perception of emotion naturally occurs
processing, computer vision and machine learning. In during video watching and music listening.
addition to these individual research interests, joint research
activity in this group can be best characterized by its two 2. Deep Learning for Multimedia Information
ongoing major projects: (A) Integration of Video and Audio Processing
Intelligence for Multimedia Applications and (B) Deep
Learning for Multimedia Information Processing. Owing to its effectiveness in solving various challenging
tasks, deep learning-related research has attracted
1. Integration of Video and Audio Intelligence for great attention in recent years. In the field of Multimedia
Multimedia Applications Information Processing, deep learning has revealed new
opportunities for solving both conventional and modern
This project explores new multimedia techniques and research problems. During the upcoming few years, we
applications that require the fusion of video and audio aim to rigorously re-formulate or better solve emergent
intelligence. With the prevalence of mobile devices, people multimedia-related problems using deep learning. These
can now easily film a live concert and create video clips efforts are highlighted by the following three collaborative
of performances. Popular websites such as YouTube or projects.
Vimeo have further enabled this phenomenon as data
sharing becomes easy. Videos of this kind, recorded by 1. Visual information processing: based on the recent
audiences at different locations of the scene, provide those progress¬ with GPUs and large-scale databases,
who could not attend the event the opportunity to enjoy the the deep convolution network (CNN) has received
performance. However, the viewing experience is usually extensive attention. Deep CNN can learn rich
degraded because the multiple source videos are captured feature representations from images, and performs
without coordination, inevitably leading to incompleteness impressively on image classification and object
or redundancy. To amplify the pleasure of the viewing/ detection tasks. While most deep learning methods
listening experience, effectively fusing videos with a smooth are designed for classification purposes, they are not
“decoration” process, generating a single, near-professional necessarily suitable for search or retrieval of relevant
audio/visual stream would be highly desirable. images. Our study focuses on the development
of new deep learning methods for image or video
Video mashup, an emerging research topic in multimedia, retrieval. We address two main issues, retrieval
can satisfy these requirements. A successful mashup efficiency and accuracy. To improve the retrieval
process should consider all videos captured at different efficiency of deep networks, we proposed a simple
locations and convert them into a complete, non- yet effective supervised deep hash approach that
overlapping, seamless, and high-quality product. To constructs binary hash codes from labeled data for
successfully conduct a concert video mashup process, we large-scale image searching. Our approach, dubbed
propose to address the following issues: (1) To make the supervised semantics-preserving deep hashing
mashup outcome as professional as possible, we propose (SSDH), constructs hash functions as a latent layer
to integrate the rules defined in the language of film; (2) To in a deep network and the binary codes are learned
navigate the visual order of different video clips, we have by minimizing an objective function defined over
to solve the causality problem of the visual component; (3) classification error and other desirable hash codes
In addition to the visual component, alignment of multiple properties. The work has been accepted by IEEE
audio sequences should also be carefully addressed. TPAMI in 2017 (with a preliminary version in CVPRW,
For example, automatic soundtrack suggestions for user- 2015). To improve retrieval accuracy, we introduce

24 研究群 Research Laboratories
   21   22   23   24   25   26   27   28   29   30   31