Page 125 - My FlipBook
P. 125
Brochure 2020

Figure 1 : Illustration of the proposed probabilistic-based ensemble model for translation
of music to visual storytelling using shots. Here, STFT stands for short-time Fourier
transform, and VQ stands for vector quantization.

of data-dependent equations and constraints. Applying the First, it can avoid forgetting (i.e., it learns new tasks while
un-rectifying procedure layer by layer in a non-linear network remembering all previous tasks). Second, it allows model
can generate a representation comprising data-dependent expansion but maintains model compactness when handling
equations over those constraints. As a result, optimization sequential tasks.
problems on networks can be re-framed as data-dependent
and constrained optimization problems in which the number 3. Crowd behavior analysis: a crowd can stay still or move
of constraints is finite. By applying the technique to L-layer with respect to time. Moreover, when a crowd moves, it
networks comprised of ReLU and Max-pooling activation moves in a non-rigid manner or may form sub-groups of
functions, we show that: 1) the polytope domains of the people. Therefore, effectively analyzing crowd behavior is
affine mappings induced by representations partition the not a trivial task. In addition, in real-world scenarios, real-
input space, and that partitioning is re ned with increasing time detection of crowd behavior is often required. In those
numbers of layers; 2) the linear parts of the a ne mappings cases, it is necessary to employ edge computing with a
allow atomic decompositions as the matrix products WLCW1, lightweight neural network architecture. We aim to propose
where data-dependent matrix C is induced from the un- heuristic definitions to characterize a crowd. Based on
rectifying process, and WL and W1 are the weight matrices; these criteria, we hope to generate various deep-learning
and 3) a Lipschitz bound can be estimated for the a ne linear modules to construct a complete framework for analyzing
transforms of a network and used to characterize its stability, crowd behaviors, including sub-group detection, sub-group
rendering a network asymptotically stable if the Lipschitz merging, and abnormal behavior detection, amongst other
constant is a bounded function of the network depth L. attributes.
Accordingly, there are connections between the stability
of the network and the sparse or compressible weight 4. Music information retrieval: most descriptions of music
distributions among its layers. Currently, our research is apply at multiple granularities, since music signals are
focused on developing optimization algorithms for learning constructed with multiple instruments, hierarchical meter
representation induced from un-rectifying a network, structures and mixed genres. The problem of music
studying invertible DNN networks, and analyzing dynamic information retrieval can therefore be considered in the
behaviors of networks containing loops. context of deep neural networks (DNNs) with multi-task
learning (MTL). MTL-based DNNs are applicable in musical
2. Lifelong learning of deep models: Continual lifelong chord recognition (i.e., recognizing chord and root note in
learning is essential to many applications. We are proposing parallel). In addition, we are introducing the concept of chord
a simple but e ective approach to continual deep learning. segmentation. First, we train the neural network to identify
Our approach leverages the principles of deep model the position of the chord unit and then infer harmony
compression, critical weights selection, and progressive progression at the chord level. This approach effectively
networks expansion. By enforcing their integration in an improves the neural network's expressive power for chord
iterative manner, we are establishing an incremental learning identi cation and functional harmony analysis.
method that is scalable to the number of sequential tasks
in a continual learning process. Our approach is easy to
implement and exhibits several favorable characteristics.

Other ongoing research by our group includes multi-modal deep learning for audio-visual speech enhancement and user identi cation. In
the future, we will develop e ective approaches that can discover characteristics of a movie and mine their relationships by combining image
and natural language information.

123
   120   121   122   123   124   125   126   127   128   129   130