Institute of Information Science
Data Management and Information Discovery Laboratory
Principal Investigators:

:::Meng-Chang Chen(Chair) :::Ming-Syan Chen :::Hong-Yuan Liao :::Mi-Yen Yeh
:::Yuan-Hao Chang :::Chun-Nan Hsu :::De-Nian Yang

[ Group Profile ]
In the data explosion era, data of various types (e.g., sensor data, trajectory data, transaction data, multimedia data, Web browsing data, etc.) are generated in an increasing rate. Due to the abundant and inexpensive of hardware and network, it is never better timing to explore all possible emerging opportunity of utilizing those data to enhance existing applications or create new applications. Therefore, Data Management and Information Discovery Group was formed with main objectives to initiate innovative researches and to strengthen scientific and technological excellence in (1) effective collection, representation, storage and processing of massive data, and (2) exploring data mining technologies to discover valuable knowledge efficiently and effectively on various types of data. Currently, the research of this group focuses on the following categories: (1) Time Series Data Analysis and Mining, (2) Social network analysis and query processing, (3) Location-based Data Collection Platform and Applications, (4) Data Centric Storage System Designs. The research project descriptions are as follows.
1. Time Series Data Analysis and Mining
A time series is a sequence of data at consecutive time instants spaced at uniform/non-uniform time intervals. For example, hourly sensor readings of many sensors, daily stock trading data in financial market, GPS traces data of objects with mobility, and so on. By analyzing and mining the time series data we want to capture the characteristics of data and find interesting knowledge for developing further services and applications. Technical challenges of the research is to deal with growing, high-dimensional, and huge-volume data generated like streams, the main challenge is to develop algorithms with high processing efficiency while providing high-quality results. As many types of data can be modeled as time series, we can apply our developed techniques to many applications. For example, the co-evolving trend mined from the stock data can be provided to stock program traders as decision support, the moving behavior learned from huge GPS trajectories of humans and vehicles are good for developing location-based services or urban planning. We have designed offline/online clustering algorithm design for multiple streams, and similarity search alport gorithms within one time series stream or across multiple ones under the constraints such as streams are distributed, data with uncertain noise, and with various distance measurements. We have also designed trajectory mining and search algorithms to acquire knowledge from huge historical trajectories.
2. Social network analysis and query processing
Analysis of a large social network is a challenging problem since numerating all the possible graph patterns is expensive and intractable. Many existing graph analysis methods are designed for homogenous social networks. In contrast, the major challenge faced in analyzing heterogeneous social networks comes from multiple types of roles associated in the nodes, while the link relationship is allowed to be different. On the other hand, query processing and optimization in social networks are still in its infancy stage. Finding a solution following multiple constraints in a huge social network within limited time is difficult, due to the complicated network structure and parameters associated in nodes and links. Observing that the patterns are essential for social services and applications, we have identified unique characteristics of heterogeneous networks such as node/link type distributions, and studied the capability of existing sampling algorithms such as random-based and exploration-based ones on capturing these characteristics. Our goal is to design adaptive sampling algorithms to efficiently identify the heterogeneous graph patterns and network characteristics, while role-based information, such as role-based community detection, will also be examined. Noticing the growing importance of social queries that are potential to be very useful in various social applications, we have proposed a new social query to automatically identify a group of familiar individuals and find their common available time slot, when receiving the query from an initiator specifying the group size, activity length, and an acquaintance parameter that can be properly set for different kinds of activities. We will continue to formulate new query problems and design efficient query optimization algorithms and techniques for finding the optimal or approximate solutions in small time.
3. Location-based Data Collection and Application Deployment Platform
Location-based data has useful information to be mined to supalport or enhance various applications, or solve difficult location-based problems. However, there are difficulties in collecting large volume of data from ordinary users. In this research project, we proposed the PLASH platform designed to help location-based service (LBS) providers deploy their applications conveniently so that users can contribute their efforts and location-related data by using the services, which is the main difference from traditional location-aware services. The PLASH system provides a GUI to allow user to construct their LAS application and generate programs on both smartphone and server, while considering scalability and compatibility. It also allows users to donate software components to be mashed up as an integrated LBS application that it is unavoidable to have inherent security problem as well as other system risks. The data collected by PLASH can be used for further analysis to enhance existing application or to solve difficult tasks.
4. Data Centric Storage System Designs
Flash-based storage systems play an important role in the mobile storage system. In recent years, the flash-based solid-state drive (SSD) has become a popular candidate for the replacement of hard disk drives. Enterprises are also designing new storage systems with flash memory as the cache or the main-storage media to reduce the energy consumption and improve the performance/reliability of their data centers. However, due to the advance of manufacturing technologies, reliability and performance have become critical issues of flash-based storage systems. Meanwhile, emerging storage media such as phase-change memory provides alternatives in the storage system designs, but the key issue is how to improve the performance, reliability, and energy-efficiency of storage systems with the integration of the new storage media. Our research focuses on solving the performance, reliability, and energy-efficiency issues of storage systems. We exploited the file-system designs in the operating systems and the management firmware in the storage devices. For example, we developed new designs for native flash file systems to improve the performance and reliability of the data stored on flash-based storage systems. For enterprise data centers, due to the energy consumption and huge amount of data, we are exploiting the indexing problem for huge amount of data (or called big data) with fast-growing capacity, and are studying the solutions to get rid of the inherit issues of hard disk drives by adopting new storage media in enterprise data centers; meanwhile, various technologies such as bloom filters and data deduplication will be studied and redesigned to fully utilize the capability of the data centers that adopt new storage media to cooperate or replace traditional hard drives.


Academia Sinica Institue of Information Science Academia Sinica