Institute of Information Science Academia Sinica
 Recent Research Results Current Research Results "We Like, We Post: A Joint User-Post Approach for Facebook Post Stance Labeling," IEEE Transactions on Knowledge and Data Engineering, To Appear. Authors: Wei-Fan Chen and Lun-Wei Ku Abstract: Web post and user stance labeling is challenging not only because of the informality and variation in language on the Web but also because of the lack of labeled data on fast-emerging new topics—even the labeled data we do have are usually heavily skewed. In this paper, we propose a joint user-post approach for stance labeling to mitigate the latter two difficulties. In labeling post stances, the proposed approach considers post content as well as posting and liking behavior, which involves users. Sentiment analysis is applied to posts to acquire their initial stance, and then the post and user stance are updated iteratively with correlated posting-related actions. The whole process works with few labeled data, which solves the first problem. We use the real interactions between authors and readers for stance labeling. Experimental results show that the proposed approach not only substantially improves content-based post stance labeling, but also achieves better performance for the minor stance class, which solves the second problem. Current Research Results "Minimizing Write Amplification to Enhance Lifetime of Large-page Flash-Memory Storage Devices," ACM/IEEE Design Automation Conference (DAC), June 2018. Authors: Wei-Lin Wang, Tseng-Yi Chen, Yuan-Hao Chang, Hsin-Wen Wei, and Wei-Kuan Shih Abstract: Due to the decreasing endurance of flash chips, the lifetime of flash drives has become a critical issue. To resolve this issue, various techniques such as wear-leveling and error correction code have been proposed to reduce the bit error rates of a flash drive. In contrast to these techniques, we observe that minimizing write amplification (or reducing the amount of extra writes to flash chips) is another promising direction to enhance the lifetime of a flash drive. In this work, we propose a partial update strategy to support partial updates to the data in flash pages. Thus, it can minimize write amplification by only updating the modified part of data in flash pages with the support of data reduction techniques. This strategy is orthogonal to wear-leveling and error correction techniques, and thus can cooperate with them to further enhance the lifetime of a flash drive. Based on a series of experiments, the results demonstrate that the proposed strategy can effectively improve the lifetime of a flash drive by reducing write amplification. Current Research Results "Proactive Channel Adjustment to Improve Polar Code Capability for Flash Storage Devices," ACM/IEEE Design Automation Conference (DAC), June 2018. Authors: Kun-Cheng Hsu, Che-Wei Tsao, Yuan-Hao Chang, Tei-Wei Kuo, and Yu-Ming Huang Abstract: Low-density parity-check (LDPC) codes have made a great success on correcting errors in flash storage devices, but its hardware cost and error correction time keep increasing as the error rate of flash memory keeps increasing. In addition to improving the lifetime of devices, researchers are seeking alternative methods. Fortunately, with the low encoding/decoding complexity and the high error correction capability, polar code with the support of list-decoding and cyclic redundancy check can outperform LDPC code in the area of data communication. Thus, it also draws a lot of attentions on how to adopt and enable polar codes in storage applications. However, the \\textit{code construction} and \\textit{encoding length limitation} issues obstruct the adoption of polar codes in flash storage devices. To enable polar codes in flash storage devices, we propose a proactive channel adjustment design to extend the effective time of a code construction to improve theerror correction capability of polar codes. This design pro-actively tunes the quality of the critical flash cells to maintain the correctness of the code construction and relax the constraint of the encoding length limitation, so that polar codes can be enabled in flash storage devices. A series of experiments was conducted to evaluate the efficacy of the proposed design. It shows that the proposed design can effectively improve the error correction capability of polar codes in flash storage devices. Current Research Results "Improving SIMD Parallelism via Dynamic Binary Translation," ACM Transactions on Embedded Computing Systems (TECS), February 2018. Authors: Ding-Yong Hong, Yu-Ping Liu, Sheng-Yu Fu, Jan-Jan Wu, Wei-Chung Hsu Abstract: Recent trends in SIMD architecture have tended toward longer vector lengths, and more enhanced SIMD features have been introduced in newer vector instruction sets. However, legacy or proprietary applications compiled with short-SIMD ISA cannot benefit from the long-SIMD architecture that supports improved parallelism and enhanced vector primitives, resulting in only a small fraction of potential peak performance. This article presents a dynamic binary translation technique that enables short-SIMD binaries to exploit benefits of new SIMD architectures by rewriting short-SIMD loop code. We propose a general approach that translates loops consisting of short-SIMD instructions to machine-independent IR, conducts SIMD loop transformation/optimization at this IR level, and finally translates to long-SIMD instructions. Two solutions are presented to enforce SIMD load/store alignment, one for the problem caused by the binary translator’s internal translation condition and one general approach using dynamic loop peeling optimization. Benchmark results show that average speedups of 1.51× and 2.48× are achieved for an ARM NEON to x86 AVX2 and x86 AVX-512 loop transformation, respectively. Current Research Results "MONPA: Multi-objective Named-entity and Part-of-speech Annotator for Chinese using Recurrent Neural Network," The 8th International Joint Conference on Natural Language Processing (IJCNLP 2017), November 2017. Authors: Yu-Lun Hsieh, Yung-Chun Chang, Yi-Jie Huang, Shu-Hao Yeh, Chun-Hung Chen and Wen-Lian Hsu Abstract: Part-of-speech (POS) tagging and named entity recognition (NER) are crucial steps in natural language processing. In addition, the difﬁculty of word segmentation places extra burden on those who deal with languages such as Chinese, and pipelined systems often suffer from error propagation. This work proposes an end to-end model using character-based recurrent neural network (RNN) to jointly accomplish segmentation, POS tagging and NER of a Chinese sentence. Experiments on previous word segmentation and NER competition data sets show that a single joint model using the proposed architecture is comparable to those trained speciﬁcally for each task, and out performs freely-available softwares. Moreover, we provide a web-based interface for the public to easily access this resource. Current Research Results "Achieving Defect-Free Multilevel 3D Flash Memory with One-Shot Program Design," ACM/IEEE Design Automation Conference (DAC), June 2018. Authors: Chien-Chung Ho, Yung-Chun Li, Yuan-Hao Chang, and Yu-Ming Chang Abstract: The rapid growth of data volume for various applications demands a high memory capacity, and multi-level-cell technology storing multiple bits in a single cell, is a very popular way to satisfy this requirement, such as multi-level-cell (MLC) and triple-level-cell (TLC) flash memories. To store the desired data on MLC and TLC flash memories, the conventional programming strategies need to divide a fixed range of threshold voltage ($V_{t}$) window into several parts. The narrowly partitioned $V_{t}$ window in turn limits the design of programming strategy and becomes the main reason to cause flash-memory defects, i.e., the longer read/write latency and worse data reliability. This motivates this work to explore the innovative programming design for solving the flash-memory defects. Thus, to achieve the defect-free 3D NAND flash memory, this paper presents and realizes a one-shot program design to significantly eliminate the negative impacts caused by conventional programming strategies. The proposed one-shot program design includes two strategies, i.e., prophetic and classification programming, for MLC flash memories, and the idea is extended to TLC flash memories. The measurement results show that it can accelerate programming speed by 31x and reduce RBER by 1000x for the MLC flash memory, and it can broaden the available window of threshold voltage up to 5.1x for the TLC flash memory. Current Research Results "Improving Runtime Performance of Deduplication System with Host-Managed SMR Storage Drives," ACM/IEEE Design Automation Conference (DAC), June 2018. Authors: Chun-Feng Wu, Ming-Chang Yang, and Yuan-Hao Chang Abstract: Due to the cost consideration for data storage, high-areal-density shingled-magnetic-recording (SMR) drives and data deduplication techniques are getting popular in many data storage services for the improvement of profit per storage unit. However, naively applying deduplication techniques upon SMR drives may dramatically downgrade the runtime performance of data storage services, because of the time-consuming SMR space reclamation processes. This work advocates a vertical integration solution by jointly managing the host-managed SMR drives with deduplication system, in order to essentially relieve the time-consuming SMR space reclamation issue. The proposed design was evaluated by a series of realistic deduplication workloads with encouraging results. Current Research Results "Enabling Union Page Cache to Boost File Access Performance of NVRAM-Based Storage Devices," ACM/IEEE Design Automation Conference (DAC), June 2018. Authors: Shuo-Han Chen, Tseng-Yi Chen, Yuan-Hao Chang, Hsin-Wen Wei, and Wei-Kuan Shih Abstract: Due to the fast access performance, byte-addressability, and non-volatility, phase-change memory (PCM) is becoming a popular candidate in the design of memory/storage systems of embedded systems. When it is considered as both main memory and storage in an embedded system, existing page cache mechanisms, which were designed to hide the performance gap between main memory and secondary storage, turn out introducing too many unnecessary data movements between main memory and storage. To resolve this issue, we propose the concept of union page cache,'' which jointly manages data of the page cache in both main memory and storage. To realize this concept, a partial page cache strategy is designed to consider both main memory and storage as its management space. By utilizing the fact that both main memory and storage residing in the same PCM device share the same address space, this strategy can minimize unnecessary data movement between main memory and storage without sacrificing the data consistency of file systems. A series of experiments was conducted on an embedded system evaluation board. The results show that the proposed strategy can outperform the file accessing performance of the conventional page cache mechanism by 77.68 Current Research Results "DART: a fast and accurate RNA-seq mapper with a partitioning strategy," Bioinformatics, January 2018. Authors: Hsin-Nan Lin and Wen-Lian Hsu Abstract: Motivation In recent years, the massively parallel cDNA sequencing (RNA-Seq) technologies have become a powerful tool to provide high resolution measurement of expression and high sensitivity in detecting low abundance transcripts. However, RNA-seq data requires a huge amount of computational efforts. The very fundamental and critical step is to align each sequence fragment against the reference genome. Various de novo spliced RNA aligners have been developed in recent years. Though these aligners can handle spliced alignment and detect splice junctions, some challenges still remain to be solved. With the advances in sequencing technologies and the ongoing collection of sequencing data in the ENCODE project, more efficient alignment algorithms are highly demanded. Most read mappers follow the conventional seed-and-extend strategy to deal with inexact matches for sequence alignment. However, the extension is much more time consuming than the seeding step. Results We proposed a novel RNA-seq de novo mapping algorithm, call DART, which adopts a partitioning strategy to avoid the extension step. The experiment results on synthetic datasets and real NGS datasets showed that DART is a highly efficient aligner that yields the highest or comparable sensitivity and accuracy compared to most state-of-the-art aligners, and more importantly, it spends the least amount of time among the selected aligners. Current Research Results "Functional Characteristics of the Flying Squirrel's Cecal Microbiota under a Leaf-Based Diet, Based on Multiple Meta-Omic Profiling," Frontiers in Microbiology, January 2018. Authors: Hsiao-Pei Lu, Po-Yu Liu, Yu-bin Wang, Ji-Fan Hsieh, Han-Chen Ho, Shiao-Wei Huang, Chung-Yen Lin, Chih-hao Hsieh, Hon-Tsen Yu Abstract: Mammalian herbivores rely on microbial activities in an expanded gut chamber to convert plant biomass into absorbable nutrients. Distinct from ruminants, small herbivores typically have a simple stomach but an enlarged cecum to harbor symbiotic microbes; however, knowledge of this specialized gut structure and characteristics of its microbial contents is limited. Here, we used leaf-eating flying squirrels as a model to explore functional characteristics of the cecal microbiota adapted to a high-fiber, toxin-rich diet. Specifically, environmental conditions across gut regions were evaluated by measuring mass, pH, feed particle size, and metabolomes. Then, parallel metagenomes and metatranscriptomes were used to detect microbial functions corresponding to the cecal environment. Based on metabolomic profiles, >600 phytochemical compounds were detected, although many were present only in the foregut and probably degraded or transformed by gut microbes in the hindgut. Based on metagenomic (DNA) and metatranscriptomic (RNA) profiles, taxonomic compositions of the cecal microbiota were dominated by bacteria of the Firmicutes taxa; they contained major gene functions related to degradation and fermentation of leaf-derived compounds. Based on functional compositions, genes related to multidrug exporters were rich in microbial genomes, whereas genes involved in nutrient importers were rich in microbial transcriptomes. In addition, genes encoding chemotaxis-associated components and glycoside hydrolases specific for plant beta-glycosidic linkages were abundant in both DNA and RNA. This exploratory study provides findings which may help to form molecular-based hypotheses regarding functional contributions of symbiotic gut microbiota in small herbivores with folivorous dietary habits. Current Research Results "Singing voice correction using canonical time warping," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018. Authors: Yin-Jyun Luo, Ming-Tso Chen, Tai-Shih Chi, and Li Su Abstract: Expressive singing voice correction is an appealing but challenging problem. A robust time-warping algorithm which synchronizes two singing recordings can provide a promising solution. We thereby propose to address the problem by canonical time warping (CTW) which aligns amateur singing recordings to professional ones. A new pitch contour is generated given the alignment information, and a pitch-corrected singing is synthesized back through the vocoder. The objective evaluation shows that CTW is robust against pitch-shifting and time-stretching effects, and the subjective test demonstrates that CTW prevails the other methods including DTW and the commercial auto-tuning software. Finally, we demonstrate the applicability of the proposed method in a practical, real-world scenario. Current Research Results "Automatic music transcription leveraging generalized cepstral features and deep learning," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018. Authors: Yu-Te Wu, Berlin Chen, and Li Su Abstract: Spectral features are limited in modeling musical signals with multiple concurrent pitches due to the challenge to suppress the interference over the harmonic peaks from one pitch to another. In this paper, we show that using multiple features represented in both the frequency and time domains with deep learning modeling can reduce such interference. These features are derived systematically from conventional pitch detection functions that relate to one another through the Fourier transform and a nonlinear scaling function. Neural networks modeled with these features outperform state-of-the-art methods while using less training data. Current Research Results "Vocal melody extraction using patch-based CNN," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018. Authors: Li Su Abstract: A patch-based convolutional neural network (CNN) model presented in this paper for vocal melody extraction in polyphonic music is inspired from object detection in image processing. The input of the model is a novel time-frequency representation which enhances the pitch contours and suppresses the harmonic components of a signal. This succinct data representation and the patch-based CNN model enable an efficient training process with limited labeled data. Experiments on various datasets show excellent speed and competitive accuracy comparing to other deep learning approaches. Current Research Results "How sampling rate affects cross-domain transfer learning for video description," IEEE International Conference on Acoustics, Speech, and Signal Processing, April 2018. Authors: Y. S. Chou, P. H. Hsiao, S. D. Lin, and H. Y. Mark Liao Abstract: Translating video to language is very challenging due to diversified video contents originated from multiple activities and complicated integration of spatio-temporal information. There are two urgent issues associated with the video-to-language translation problem. First, how to transfer knowledge learned from a more general dataset to a specific application domain dataset? Second, how to generate stable video captioning (or description) results under different sampling rates? In this paper, we propose a novel temporal embedding method to better retain temporal representation under different video sampling rates. We present a transfer learning method that combines a stacked LSTM encoder-decoder structure and a temporal embedding learning with soft-attention (TELSA) mechanism. We evaluate the proposed approach on two public datasets, including MSR-VTT and MSVD. The promising experimental results confirm the effectiveness of the proposed approach. Current Research Results "Low precision deep learning training on mobile heterogeneous platform," 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2018), March 2018. Authors: Olivier Valery, Pangfeng Liu, Jan-Jan Wu Abstract: Recent advances in System-on-Chip architectures have made the use of deep learning suitable for a number of applications on mobile devices. Unfortunately, due to the computational cost of neural network training, it is often limited to inference task, e.g., prediction, on mobile devices. In this paper, we propose a deep learning framework that enables both deep learning training and inference tasks on mobile devices. While being able to accommodate with the heterogeneity of computing devices technology on mobile devices, it also uses OpenCL to efficiently leverages modern SoC capabilities, e.g., multi-core CPU, integrated GPU and shared memory architecture, and accelerates deep learning computation. In addition, our system encodes the arithmetic operations of deep networks down to 8-bit fixed-point on mobile devices. As a proof of concept, we trained three well-known neural networks on mobile devices and exhibits a significant performance gain, energy consumption reduction, and memory saving. Current Research Results "Workload Prediction and Balance for Distributed Reachability Processing for Attribute Graphs," Concurrency and Computation: Practice and Experience, To Appear. Authors: Li-Yung Ho, Jan-Jan Wu, Pangfeng Liu Abstract: Reachability query with label constraint in an attribute graph is one of the most fundamental and important operations in semantic network analysis. However, ever-growing graph size has resulted in intractable reachability problems on single machines. This work aims to devise efficient solutions for the reachability with label constraint problem in an attribute graph in a distributed environment. We focus on two issues in distributed processing data locality workload balancing since data locality reduces communication overhead and workload balancing improves the efficiency of cluster use. We propose three novel techniques to address the two issues: (1) a partition replication method that improves data locality while conserving community property, (2) a workload-prediction method that accurately predicts machine workloads for a given quer, and (3) a workload balancing method that uses these predictions to shift partial workloads among machines to produce a balanced workload. Experimental results suggest that these techniques significantly improve performance and reduce total execution time by 40%. Current Research Results "Automatic Image Cropping for Visual Aesthetic Enhancement Using Deep Neural Networks and Cascaded Regression," IEEE Transactions on Multimedia, To Appear. Authors: Guanjun Guo, Hanzi Wang, Chunhua Shen, Yan Yan, and Hong-Yuan Mark Liao Abstract: Despite recent progress, computational visual aesthetic is still challenging. Image cropping, which refers to the removal of unwanted scene areas, is an important step to improve the aesthetic quality of an image. However, it is challenging to evaluate whether cropping leads to aesthetically pleasing results because the assessment is typically subjective. In this paper, we propose a novel cascaded cropping regression (CCR) method to perform image cropping by learning the knowledge from professional photographers. The proposed CCR method improves the convergence speed of the cascaded method, which directly uses random-ferns regressors. In addition, a two-step learning strategy is proposed and used in the CCR method to address the problem of lacking labelled cropping data. Specifically, a deep convolutional neural network (CNN) classifier is first trained on large-scale visual aesthetic datasets. The deep CNN model is then designed to extract features from several image cropping datasets, upon which the cropping bounding boxes are predicted by the proposed CCR method. Experimental results on public image cropping datasets demonstrate that the proposed ethod significantly outperforms several state-of-the-art image cropping methods Current Research Results "Leveraging Linguistic Structures for Named Entity Recognition with Bidirectional Recursive Neural Networks," International Conference on EMNLP, September 2017. Authors: Peng-Hsuan Li, Ruo-Ping Dong, Yu-SiangWang, Ju-Chieh Chou, Wei-Yun Ma Abstract: In this paper, we utilize the linguistic structures of texts to improve named entity recognition by BRNN-CNN, a special bidirectional recursive network attached with a convolutional network. Motivated by the observation that named entities are highly related to linguistic constituents, we propose a constituent-based BRNN-CNN for named entity recognition. In contrast to classical sequential labeling methods, the system first identifies which text chunks are possible named entities by whether they are linguistic constituents. Then it classifies these chunks with a constituency tree structure by recursively propagating syntactic and semantic information to each constituent node. This method surpasses current state-of-the-art on OntoNotes 5.0 with automatically generated parses. Current Research Results "Forecasting Participants of Information Diffusion on Social Networks with Its Applications," Information Sciences, January 2018. Authors: Cheng-Te Li, Yu-Jen Lin, and Mi-Yen Yeh Abstract: Social networking services allow users to adopt and spread information via diffusion actions, e.g., share, retweet, and reply. Real applications such as viral marketing and trending topic detection rely on information diffusion. Given past items with diffusion records on a social network, this paper aims at forecasting who will participate in the diffusion of a new item c (we use hashtags in the paper) with its k earliest adopters, without using content and profile information, i.e., finding which users will adopt c in the future. We define the Diffusion Participation Forecasting (DPF) problem, which is challenging since all users except for early adopters can be the candidates, comparing to existing studies that predict which one-layer followers will adopt a new hashtag given past diffusion observations with content and profile info. To solve the DFP problem, we propose an Adoption-based Participation Ranking (APR) model, which aims to rank the actual participants in reality at higher positions. The first is to estimate the adoption probability of a new hashtag for each user while the second is a random walk-based model that incorporates nodes with higher adoption probability values and early adopters to generate the forecasted participants. Experiments conducted on Twitter exhibit that our model can significantly outperform several competing methods in terms of Precision and Recall. Moreover, we demonstrate that an accurate DPF can be applied for effective targeted marketing using influence maximization and boosting the accuracy of popularity prediction in social media. Current Research Results "Non-overlapping Subsequence Matching of Stream Synopses," IEEE Tans. on Knowledge and Data Mining, January 2018. Authors: Su-Chen Lin, Mi-Yen Yeh, and Ming-Syan Chen Abstract: In this paper, we propose SUbsequence Matching framework with cell MERgence (SUMMER) for online subsequence matching between histogram-based stream synopsis structures under the dynamic time warping distance. Given a query synopsis pattern, SUMMER continuously identifies all the matching subsequences for a stream as the bins are generated. To effectively reduce the computation time, we design a Weighted Dynamic Time Warping (WDTW) algorithm, which computes the warping distance directly between two histogram-based synopses. Furthermore, a Stack-based Overlapping Filter Algorithm (SOFA) is provided to remove the overlapping subsequences to avoid the redundant information. Finally, we design an optional refinement module to relax the subsequence range limit and improve the matching accuracy. Our experiments on real datasets show that the proposed method significantly speeds up the pattern matching without compromising the accuracy required when compared with other approaches. Current Research Results "PRUNE: Preserving Proximity and Global Ranking for Node Embedding," The 31st Annual Conference on Neural Information Processing Systems (NIPS-2017), December 2017. Authors: Yi-An Lai, Chin-Chi Hsu, Wen-Hao Chen, Mi-Yen Yeh, and Shou-De Lin Abstract: We investigate an unsupervised generative approach for network embedding. A multi-task Siamese neural network structure is formulated to connect embedding vectors and our objective to preserve the global node ranking and local proximity of nodes. We provide deeper analysis to connect the proposed proximity objective to link prediction and community detection in the network. We show our model can satisfy the following design properties: scalability, asymmetry, unity and simplicity. Experiment results not only verify the above design properties but also demonstrate the superior performance in learning-to-rank, classification, regression, and link prediction tasks. Current Research Results "Decoding the effect of isobaric substitutions on identifying missing proteins and variant peptides in human proteome," Journal of Proteome Research, December 2017. Authors: Wai-Kok Choong, T. Mamie Lih, Yu-Ju Chen, Ting-Yi Sung Abstract: To confirm the existence of missing proteins, we need to identify at least two unique peptides with length of 9–40 amino acids of a missing protein in bottom-up mass-spectrometry-based proteomic experiments. However, an identified unique peptide of the missing protein, even identified with high level of confidence, could possibly coincide with a peptide of a commonly observed protein due to isobaric substitutions, mass modifications, alternative splice isoforms, or single amino acid variants (SAAVs). Besides unique peptides of missing proteins, identified variant peptides (SAAV-containing peptides) could also alternatively map to peptides of other proteins due to the aforementioned issues. Therefore, we conducted a thorough comparative analysis on data sets in PeptideAtlas Tiered Human Integrated Search Proteome (THISP, 2017-03 release), including neXtProt (2017-01 release), to systematically investigate the possibility of unique peptides in missing proteins (PE2–4), unique peptides in dubious proteins, and variant peptides affected by isobaric substitutions, causing doubtful identification results. In this study, we considered 11 isobaric substitutions. From our analysis, we found <5% of the unique peptides of missing proteins and >6% of variant peptides became shared with peptides of PE1 proteins after isobaric substitutions. Current Research Results "iTop-Q: an intelligent tool for top-down proteomics quantita-tion using DYAMOND algorithm," Analytical Chemistry, December 2017. Authors: Hui-Yin Chang, Ching-Tai Chen, Chu-Ling Ko, Yi-Ju Chen, Yu-Ju Chen, Wen-Lian Hsu, Chiun-Gung Juo, Ting-Yi Sung Abstract: Top-down proteomics using liquid chromatogram coupled with mass spectrometry has been increasingly applied for analyzing intact proteins to study genetic variation, alternative splicing, and post-translational modifications (PTMs) of the proteins (proteoforms). However, only a few tools have been developed for charge state deconvolution, monoisotopic/average molecular weight determination and quantitation of proteoforms from LC-MS1 spectra. Though Decon2LS and MASH Suite Pro have been available to provide intra-spectrum charge state deconvolution and quantitation, manual processing is still required to quantify proteoforms across multiple MS1 spectra. An automated tool for inter-spectrum quantitation is a pressing need. Thus in this paper, we present a user-friendly tool, called iTop-Q (intelligent Top-down Proteomics Quantitation), that automatically performs large-scale proteoform quantitation based on inter-spectrum abundance in top-down proteomics. Instead of utilizing single spectrum for proteoform quantitation, iTop-Q constructs extracted ion chromatograms (XICs) of possible proteoform peaks across adjacent MS1 spectra to calculate abundances for accurate quantitation. Notably, iTop-Q is implemented with a newly proposed algorithm, called DYAMOND, using dynamic programming for charge state deconvolution. In addition, iTop-Q performs proteoform alignment to support quantitation analysis across replicates/samples. The performance evaluations on an in-house standard data set and a public large-scale yeast lysate data set show that iTop-Q achieves highly accurate quantitation, more consistent quantitation than using intra-spectrum quantitation. Furthermore, the DYAMOND algorithm is suitable for high charge state deconvolution and can distinguish shared peaks in co-eluting proteoforms. iTop-Q is publicly available for download at http://ms.iis.sinica.edu.tw/COmics/Software_iTop-Q. Current Research Results "Aesthetic Critiques Generation for Photos," International Conference on Computer Vision, ICCV 2017, October 2017. Authors: Kuang-Yu Chang, Kung-Hung Lu, and Chu-Song Chen Abstract: It is said that a picture is worth a thousand words. Thus, there are various ways to describe an image, especially in aesthetic quality analysis. Although aesthetic quality assessment has generated a great deal of interest in the last decade, most studies focus on providing a quality rating of good or bad for an image. In this work, we extend the task to produce captions related to photo aesthetics and/or photography skills. To the best of our knowledge, this is the first study that deals with aesthetics captioning instead of AQ scoring. In contrast to common image captioning tasks that depict the objects or their relations in a picture, our approach can select a particular aesthetics aspect and generate captions with respect to the aspect chosen. Meanwhile, the proposed aspect-fusion method further uses an attention mechanism to generate more abundant aesthetics captions. We also introduce a new dataset for aesthetics captioning called the Photo Critique Captioning Dataset (PCCD), which contains pair-wise image-comment data from professional photographers. The results of experiments on PCCD demonstrate that our approaches outperform existing methods for generating aesthetic-oriented captions for images. Current Research Results "IsoPlot: a database for comparison of mRNA isoform variations in the fruit fly and mosquitoes," Database, August 2017. Authors: Ng, I.M. , Huang, J.H., Tsai, S.C., and Tsai, H.K.* Abstract: Alternative splicing (AS), a mechanism by which different forms of mature messenger RNAs (mRNAs) are generated from the same gene, widely occurs in the metazoan genomes. Knowledge about isoform variants and abundance is crucial for understanding the functional context in the molecular diversity of the species. With increasing transcriptome data of model and non-model species, a database for visualization and comparison of AS events with up-to-date information is needed for further research. IsoPlot is a publicly available database with visualization tools for exploration of AS events, including three major species of mosquitoes, Aedes aegypti, Anopheles gambiae, and Culex quinquefasciatus, and fruit fly Drosophila melanogaster, the model insect species. IsoPlot includes not only 88,663 annotated transcripts but also 17,037 newly predicted transcripts from massive transcriptome data at different developmental stages of mosquitoes. The web interface enables users to explore the patterns and abundance of isoforms in different experimental conditions as well as cross-species sequence comparison of orthologous transcripts. IsoPlot provides a platform for researchers to access comprehensive information about AS events in mosquitoes and fruit fly. Our database is available on the web via an interactive user interface with an intuitive graphical design, which is applicable for the comparison of complex isoforms within or between species. Database URL: http://isoplot.iis.sinica.edu.tw/ Current Research Results "Kart: a divide-and-conquer algorithm for NGS read alignment," Bioinformatics, August 2017. Authors: Hsin-Nan Lin and Wen-Lian Hsu Abstract: Motivation: Next-generation sequencing (NGS) provides a great opportunity to investigate genome-wide variation at nucleotide resolution. Due to the huge amount of data, NGS applications require very fast and accurate alignment algorithms. Most existing algorithms for read mapping basically adopt seed-and-extend strategy, which is sequential in nature and takes much longer time on longer reads. Results: We develop a divide-and-conquer algorithm, called Kart, which can process long reads as fast as short reads by dividing a read into small fragments that can be aligned independently. Our experiment result indicates that the average size of fragments requiring the more time-consuming gapped alignment is around 20 bp regardless of the original read length. Furthermore, it can tolerate much higher error rates. The experiments show that Kart spends much less time on longer reads than other aligners and still produce reliable alignments even when the error rate is as high as 15%. Current Research Results "Phosphoproteomics reveals HMGA1, a CK2 substrate, as a drug-resistant target in non-small cell lung cancer," Scientific Reports, March 2017. Authors: Yi-Ting Wang, Szu-Hua Pan, Chia-Feng Tsai, Ting-Chun Kuo, Yuan-Ling Hsu, Hsin-Yung Yen, Wai-Kok Choong, Hsin-Yi Wu, Yen-Chen Liao, Tse-Ming Hong, Ting-Yi Sung, Pan-Chyr Yang, and Yu-Ju Chen Abstract: Although EGFR tyrosine kinase inhibitors (TKIs) have demonstrated good efficacy in non-small-cell lung cancer (NSCLC) patients harboring EGFR mutations, most patients develop intrinsic and acquired resistance. We quantitatively profiled the phosphoproteome and proteome of drug-sensitive and drug-resistant NSCLC cells under gefitinib treatment. The construction of a dose-dependent responsive kinase-substrate network of 1548 phosphoproteins and 3834 proteins revealed CK2-centric modules as the dominant core network for the potential gefitinib resistance-associated proteins. CK2 knockdown decreased cell survival in gefitinib-resistant NSCLCs. Using motif analysis to identify the CK2 core sub-network, we verified that elevated phosphorylation level of a CK2 substrate, HMGA1 was a critical node contributing to EGFR-TKI resistance in NSCLC cell. Both HMGA1 knockdown or mutation of the CK2 phosphorylation site, S102, of HMGA1 reinforced the efficacy of gefitinib in resistant NSCLC cells through reactivation of the downstream signaling of EGFR. Our results delineate the TKI resistance-associated kinase-substrate network, suggesting a potential therapeutic strategy for overcoming TKI-induced resistance in NSCLC. Current Research Results "ADF: an Anomaly Detection Framework for Large-scale PM2.5 Sensing Systems," IEEE Internet of Things Journal, To Appear. Authors: Ling-Jyh Chen, Yao-Hua Ho, Hsin-Hung Hsieh, Shih-Ting Huang, Hu-Cheng Lee, and Sachit Mahajan Abstract: As the population density continues to grow in the urban settings, air quality is degrading and becoming a serious issue. Air pollution, especially fine particulate matter (PM2.5), has raised a series of concerns for public health. As a result, a number of large-scale, low cost PM2.5 monitoring systems have been deployed in several international smart city projects. One of the major challenges for such environmental sensing systems is ensuring the data quality. In this paper, we propose an Anomaly Detection Framework (ADF) for large-scale, real-world environmental sensing systems. The framework is comprised of four modules: 1) Time-Sliced Anomaly Detection (TSAD), which detects Spatial, Temporal, and Spatio-temporal anomalies in the real-time sensor measurement data stream; 2) Real-time Emission Detection (RED), which detects potential regional emission sources; 3) Device Ranking (DR), which provides a ranking for each sensing device; and 4) Malfunction Detection (MD), which identifies malfunctioning devices. Using real world measurement data from the AirBox project, we demonstrate that the proposed framework can effectively identify outliers in the raw measurement data as well as infer anomalous events that are perceivable by the general public and government authorities. Because of its simple design, ADF is highly extensible to other advanced applications, and it can be exploited to support various large-scale environmental sensing systems. Current Research Results "A Gene Profiling Deconvolution Approach to Estimate Immune Cell Composition from Complex Tissues," BMC Bioinformatics, To Appear. Authors: Shu-Hwa Chen, Wen-Yu Kuo, Sheng-Yao Su, Wei-Chun Chung, Jen-Ming Ho, Henry Horng-Shing Lu, Chung-Yen Lin Abstract: A new emerged cancer treatment utilizes intrinsic immune surveillance mechanism that is silenced by those malicious cells. Hence, studies of tumor infiltrating lymphocyte populations (TILs) are key to the success of advanced treatments. In addition to laboratory methods such as immunohistochemistry and flow cytometry, in silico gene expression deconvolution methods are available for analyses of relative proportions of immune cell types. Herein, we used microarray data from the public domain to profile gene expression pattern of twenty-two immune cell types. Initially, outliers were detected based on the consistency of gene profiling clustering results and the original cell phenotype notation. Subsequently, we filtered out genes that are expressed in non-hematopoietic normal tissues and cancer cells. For every pair of immune cell types, we ran t-tests for each gene, and defined differentially expressed genes (DEGs) from this comparison. Equal numbers of DEGs were then collected as candidate lists and numbers of conditions and minimal values for building signature matrixes were calculated. Finally, we used 𝛎-Support Vector Regression to construct a deconvolution model. The performance of our system was finally evaluated using blood biopsies from 20 adults, in which 9 immune cell types were identified using flow cytometry. The present computations performed better than current state-of-the-art deconvolution methods. Finally, we implemented the proposed method into R and tested extensibility and usability on Windows, MacOS, and Linux operating systems. The method, MySort, is wrapped as the Galaxy platform pluggable tool and usage details are available at https://testtoolshed.g2.bx.psu.edu/view/moneycat/mysort/e3afe097e80a. Current Research Results "Identifying Protein-protein Interactions in Biomedical Literature using Recurrent Neural Networks with Long Short-Term Memory," The 8th International Joint Conference on Natural Language Processing (IJCNLP 2017), November 2017. Authors: Yu-Lun Hsieh, Yung-Chun Chang, Nai-Wen Chang and Wen-Lian Hsu Abstract: Accurate identification of protein-protein interaction (PPI) helps biomedical researchers to quickly capture crucial information in literatures. This work proposes a recurrent neural network (RNN) model to identify PPIs. Experiments on two largest public benchmark datasets, AIMed and BioInfer, demonstrate that RNN outperforms state-of-the-art methods with relative improvements of 10% and 18%, respectively. Cross-corpus evaluation also indicates that RNN is robust even when trained on data from different domains. These results suggest that RNN effectively captures semantic relationships among proteins without any feature engineering. Current Research Results "Comparative genomic analyses highlight the contribution of pseudogenized protein-coding genes to human lincRNAs," BMC Genomics, October 2017. Authors: Liu, W.H., Tsai, Z. T., and Tsai, H. K.* Abstract: Background The regulatory roles of long intergenic noncoding RNAs (lincRNAs) in humans have been revealed through the use of advanced sequencing technology. Recently, three possible scenarios of lincRNA origins have been proposed: de novo origination from intergenic regions, duplication from other long noncoding RNAs, and pseudogenization from protein-coding genes. The first two scenarios are largely studied and supported, yet few studies focused on the evolution from pseudogenized protein-coding sequence to lincRNA. Due to the non-mutually exclusive nature of these three scenarios and the need of systematic investigation of lincRNA origination, we conducted a comparative genomics study to investigate the evolution of human lincRNAs. Results Combining with syntenic analysis and stringent Blastn e-value cutoff, we found that the majority of lincRNAs are aligned to intergenic regions of other species. Interestingly, 193 human lincRNAs could have protein-coding orthologs in at least two of nine vertebrates. Transposable elements in these conserved regions in human genome are much less than expectation. Moreover, 19% of these lincRNAs have overlaps with or are close to pseudogenes in the human genome. Conclusions We suggest that a notable portion of lincRNAs could be derived from pseudogenized protein-coding genes. Furthermore, based on our computational analysis, we hypothesize that a subset of these lincRNAs could have potential to regulate their paralogs by functioning as competing endogenous RNAs. Our results provide evolutionary evidence of the relationship between human lincRNAs and protein-coding genes.