您的瀏覽器不支援JavaScript語法,網站的部份功能在JavaScript沒有啟用的狀態下無法正常使用。

中央研究院 資訊科學研究所

活動訊息

友善列印

列印可使用瀏覽器提供的(Ctrl+P)功能

學術演講

:::

TIGP (SNHCC) –Efficient Page-Level Data Extraction Via Schema Induction & Verification

  • 講者張嘉惠 教授 (國立中央大學資訊工程系)
    邀請人:TIGP SNHCC Program
  • 時間2016-05-04 (Wed.) 14:00 ~ 14:00
  • 地點資訊所新館106演講廳
摘要

Page-level data extraction provides a complete solution for all kinds of information requirement, however very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, previous page-level systems focus on how to achieve unsupervised data extraction and pay less attention on schema/wrapper generation and verification. In this paper, we emphasize the importance of schema verification for large-scale extraction tasks. Given a large amount of web pages for data extraction, the system uses part of the input pages for training the schema without supervision, and then extracts data from the rest of the input pages through schema verification. To speed up the processing, we utilize leaf nodes of the DOM trees as the processing units and dynamically adjust the encoding for better alignment. The proposed system works better than other page-level extraction systems in terms of schema correctness and extraction efficiency. Overall, the extraction efficiency is 2.7 times faster than state-of-the-art unsupervised approaches that extract data page by page without schema verification.