Institute of Information Science
Data Management and Information Discovery Laboratory
Principal Investigators:

:::Meng-Chang Chen(Chair) :::Ming-Syan Chen :::Hong-Yuan Liao :::Mi-Yen Yeh
:::Yuan-Hao Chang :::De-Nian Yang

[ Group Profile ]
In the data explosion era, data of various types (e.g., sensor, trajectory, transaction, multimedia, social network, Web browsing log, etc.) are generated at an increasing rate. Due to the abundant and inexpensive trend of hardware and networks, the timing has never been better to explore all possible means of utilizing such data to enhance existing applications or to investigate new technologies to solve difficult problems. The Data Management and Information Discovery Group was formed with the main objectives of initiatinginnovative research and strengthening scientific and technological excellence in: (1) effective collection, representation, storage, processing, and analyzing of massive data; and (2) exploring data mining technologies to efficiently and effectively discover valuable knowledge within various types of data. Currently, the research of this group focuses on the following areas: (1) Time Series Data Analysis and Mining; (2) Social Network Analysis and Query Processing; (3) Designs for Embedded Databases on Mobile Devices and Sensors; and (4) Location-based Data Collection Platform and Applications. Within these research areas, we are conducting the following projects.

1. Time Series Data Analysis and Mining
A time series is a sequence of data of consecutive time instances spaced at uniform/nonuniform time intervals. Time series data are widely seen in daily life, including periodical readings of sensors in a machine-to-machine environment, stock trading records in financial markets, user activity monitoring in social networks, GPS traces of mobile objects, and so on. By analyzing and mining time series data, we can capture the characteristics of monitored objects over a period of time and obtain useful knowledge for developing further applications. For example, (i) the co-evolving trend mined from the stock trading time series can be provided to stock program traders as decision support, and (ii) the moving behavior derived from huge GPS trajectories of humans and vehicles are useful for developing location-based services or urban planning. Through this line of research, we aim to develop effective and efficient mining and analytical algorithms to discover interesting and useful patterns embedded in a single time series and between multiple ones, while considering various constraints. The designed algorithms should be able to handle time series data series that are rapidly-growing, high-dimensional, and huge-volume, while providing high-quality results. With relation to the aforementioned applications, we have designed offline/online clustering algorithms for multiple time series streams, similarity queries on distributed time series using both the Euclidean and dynamic time warping distances, random error reduction for similarity searches of time series, and GPS trajectory mining and search algorithms. In the future, we will seek more useful applications for time series analysis and mining, and design related algorithms.
2. Social network analysis and query processing
With the emergence of various social applications, social networks have become increasingly important in information sharing and analysis. An article published in Nature in 2013 demonstrated that social influence via online social networks (such as Facebook) impacts people’s decisions, and many new companies in Silicon Valley (such as Klout) now provide effective mechanisms to quantify social influence and diffusion for business marketing. Nevertheless, the literature mostly focuses on passive applications of social influence. By contrast, we propose social influence for active friending, where each user actively specifies a friending target, and the search engine returns a group of intermediate nodes for the user to friend the target systematically with social influence. We prove that given even the intermediate nodes, quantifying the social influence is NP-Hard. We have also proven that finding the best intermediate nodes is NP-Hard, but the problem can be solved in polynomial time in a simplified renowned social graph model in the literature. In addition, we have explored the effect of social influence on product sales. With information on user preference of products, we propose a new optimization problem to select suitable products to configure a bundle, such that the number of users adopting this bundle is maximized. We will continue to explore the effect of social influence, and propose efficient algorithms for real applications. 排版插圖 Most of the literature on social graph query focuses on only the social dimension, and ignores the real social applications that are necessary to examine multiple dimensions. It is expected that searching for people based on social, temporal, spatial, and preference dimensions will be very challenging, but has the potential to deliver useful applications in everyday life. Nowadays, most social group activities are still initiated manually via phone, e-mail, messenger, etc., even with the availability of ripe web collaboration tools (such as Google Calendar) and socio-spatial information websites (such as Facebook Places). To address this issue, we propose a group query to return a group of familiar individuals and their common available time slots. Different social familiarities can be flexibly specified for various types of social activities. The proposed algorithm can find the optimal solutions at least 20 times (100 times on average) faster than the commercial parallel CPLEX optimizer in the same server. Moreover, for location-based social networks, such as Loopt, Buddy Beacon, FindMe, and Facebook Place, we formulate a social spatial group query with a new index structure to efficiently find spatially close-by friends with tight social relations. We have also explored the group query to consider the user preference dimension. We propose a randomized algorithm based on the concept of Optimal Computational Budget Allocation (OCBA) and Cross-Entropy (CE) to efficiently obtain results. Our algorithms are implemented in Facebook to validate the effectiveness of query results. We will continue to formulate new query problems and design efficient query optimization algorithms and techniques for finding the optimal or approximate solutions in a small time.
3. Index Designs for Embedded Databases
Embedded database systems are widely adopted in cyber-physical systems (CPSes) to maintain and monitor the status of entities in deployed operation environments. Sensors or devices in CPSes periodically publish sensor measurements as updates to create new versions installed into an embedded database system. Various types of application queries are submitted to retrieve multiple versions of data items maintained in the embedded database. Thus, maintaining multi-versions of data are important to many CPSes for detecting the operation environment, detecting critical events, or collecting statistics about the dynamics in the operation environment. Therefore, it is critical to facilitate access performance to the current and the previous data versions maintained in an embedded database. At the same time, embedded systems usually adopt non-volatile memory, instead of hard disks, as their storage devices. Thus, the characteristics of the embedded storage media/systems should be also considered. Our research focuses on proposing index designs to facilitate access of multiversion data items in embedded database systems. Due to the dynamic nature of the entities in CPSes, the proposed multiversion index designs can support a large number of update transactions, and can efficiently support key-range query and version-range query transactions, which are critical functions of multiversion database systems. We also explore the impacts of using different non-volatile memories (e.g., flash memory and phasechange memory) as the storage media in embedded systems. In particular, we are exploring potential solutions to remove inherent problems with hard disks, and to take advantage of the special characteristics of these non-volatile memories. Various technologies, such as mining algorithms and data deduplication, will be studied and redesigned to fully utilize the capability of non-volatile memories. The inherent endurance and reliability issues of these non-volatile memories will be also investigated to reinforce their usefulness.
4. Location-based Data Collection and Application Deployment Platform
Location-based data embeds useful information to be mined to support or enhance various applications (such as route selection), or solve difficult location-based problems (such as city profiling). However, it is challenging to collect large volumes of data from users for various reasons. In this project, we propose the use of the PLASH platform to help location-based service (LBS) providers deploy their applications conveniently; users can contribute their efforts and location-related data by using the services, which is the main difference from other location-aware services. The PLASH system provides a GUI to allow users to construct their LAS applications, in such a way that the process generates programs on both smartphone and server. It also allows users to donate software components to be mashed up by other users in their applications. However, it has unavoidable and inherent security problems, as well as other system risks. PLASH considers scalability and compatibility, and we have been using road network route selection application to exemplify the performance characteristics and design goals of PLASH.


Academia Sinica Institue of Information Science Academia Sinica