Page 116 - My FlipBook
P. 116
大
實驗
室
Computer System Laboratory
Research Laboratories Research Faculty The Computer Systems Lab was established in 2009. Its primary research areas include binary
translation, compiler and parallel architecture, deep learning with non-volatile memories, and
Jan-Jan Wu / Chair one-memory computer systems.
Research Fellow I. Auto-parallelism with Dynamic Binary Translation
Chien-Min Wang Parallelization is critical for multicore computing and cloud computing. Hardware
manufacturers have adopted many distinct strategies to improve parallelism in
Associate Research microprocessor design. These strategies include multi-cores, many-cores, GPU, GPGPU, and
Fellow SIMD (single instruction, multiple data), among others. However, these parallel architectures
have very different parallel execution models and several issues arise when migrating
PeiZong Lee applications from one system to another, such as: (1) application developers have to
rewrite programs based on the target parallel model, increasing ‘time to market’; (2) legacy
Research Fellow applications become under-optimized due to under-utilization of parallelism in the target
hardware, signi cantly diminishing potential performance gain; and (3) execution migration
Yuan-Hao Chang among heterogeneous architectures is difficult. To overcome these problems, we have
developed an e cient and retargetable dynamic binary translator (DBT) to transparently
Research Fellow transform application binaries among different parallel execution models. In its current
iteration, the DBT dynamically transforms binaries of short-SIMD loops to equivalent long-
SIMD ones, thereby exploiting the wider SIMD lanes of the hosts. We have shown that
SIMD transformation from ARM NEON to x86 AVX2 can improve performance by 45% for a
collection of applications, while doubling the parallelism factor. We plan to extend the DBT
system by supporting more parallel architectures and execution models.
Figure 1 : Dynamic Binary Translation and parallelism optimization system.
II. AI Compiler for Servers and Embedded Systems
In recent years, deep learning has become a rapidly growing trend in big data analysis
and has been successfully applied to various elds. Many deep learning models (e.g. CNN,
RNN, LSTM, and GAN) have been proven to work very well for recognition of images,
natural language, etc. For this AI domain, we need a solution to e ciently run such deep
learning models on a wide diversity of computing architectures. High-end servers may
be equipped with powerful computing devices, e.g., a combination of high-end CPUs,
GPUs, FPGAs, and AI accelerators. Small embedded systems may have a low-end CPU or
DSP, and small memory capacity. Di erent compilation strategies are required to achieve
optimal performance for any con guration of computing device and deep learning model.
114
實驗
室
Computer System Laboratory
Research Laboratories Research Faculty The Computer Systems Lab was established in 2009. Its primary research areas include binary
translation, compiler and parallel architecture, deep learning with non-volatile memories, and
Jan-Jan Wu / Chair one-memory computer systems.
Research Fellow I. Auto-parallelism with Dynamic Binary Translation
Chien-Min Wang Parallelization is critical for multicore computing and cloud computing. Hardware
manufacturers have adopted many distinct strategies to improve parallelism in
Associate Research microprocessor design. These strategies include multi-cores, many-cores, GPU, GPGPU, and
Fellow SIMD (single instruction, multiple data), among others. However, these parallel architectures
have very different parallel execution models and several issues arise when migrating
PeiZong Lee applications from one system to another, such as: (1) application developers have to
rewrite programs based on the target parallel model, increasing ‘time to market’; (2) legacy
Research Fellow applications become under-optimized due to under-utilization of parallelism in the target
hardware, signi cantly diminishing potential performance gain; and (3) execution migration
Yuan-Hao Chang among heterogeneous architectures is difficult. To overcome these problems, we have
developed an e cient and retargetable dynamic binary translator (DBT) to transparently
Research Fellow transform application binaries among different parallel execution models. In its current
iteration, the DBT dynamically transforms binaries of short-SIMD loops to equivalent long-
SIMD ones, thereby exploiting the wider SIMD lanes of the hosts. We have shown that
SIMD transformation from ARM NEON to x86 AVX2 can improve performance by 45% for a
collection of applications, while doubling the parallelism factor. We plan to extend the DBT
system by supporting more parallel architectures and execution models.
Figure 1 : Dynamic Binary Translation and parallelism optimization system.
II. AI Compiler for Servers and Embedded Systems
In recent years, deep learning has become a rapidly growing trend in big data analysis
and has been successfully applied to various elds. Many deep learning models (e.g. CNN,
RNN, LSTM, and GAN) have been proven to work very well for recognition of images,
natural language, etc. For this AI domain, we need a solution to e ciently run such deep
learning models on a wide diversity of computing architectures. High-end servers may
be equipped with powerful computing devices, e.g., a combination of high-end CPUs,
GPUs, FPGAs, and AI accelerators. Small embedded systems may have a low-end CPU or
DSP, and small memory capacity. Di erent compilation strategies are required to achieve
optimal performance for any con guration of computing device and deep learning model.
114