TR-IIS-06-005    Fulltext

A Semantic Approach to Internet Tabular Information Extraction

Shih-Hung Wu, Huei-Long Wang, I-Chi Wang, Cheng-Lung Sung, Wei-Kuan Shih and Wen-Lian Hsu*


Extracting information from tables is essential for Internet information extraction. However, most web tables are designed in HTML format. To decipher their semantic meanings a system needs to deal with various layouts, which is quite cumbersome. Previous works have two major approaches: layout enumeration approach and wrapper approach. The first approach is to match the table with presorted layout. This approach normally does not perform appropriate information integration since it does not use table semantics. The second approach treats different tables in a case-by-case manner laboriously. We present a semantic approach to automatically recognize tables (in specified knowledge domains) with various layouts. Our system first tags table cells using domain knowledge, solve the multiple tagging ambiguities, and then apply layout-syntax rules to transform tables into database format. Experimental results show that the precision and recall are higher than 93% and 82% respectively in four different domains. Our approach has high precision and is suitable as a front end for wrapper construction.


Key Word: Table Understanding; Data Integration; Tabular information extraction, Domain Knowledge, Table layout syntax