Abstract
A table is a facility for presenting relational information structurally and concisely. As a prerequisite for extracting information from the Web, This paper presents an efficient method for extracting logical structures from HTML tables and transforming them into XML documents. The proposed method consists of two phases: area segmentation and structure analysis. The area segmentation step cleans up the table and segments the normalized table into attribute and value areas by checking visual and semantic coherency. Particularly, heuristic rules are also proposed to handle complex tables. In the structure analysis phase, the hierarchical structure between attribute and value areas is analyzed and transformed into an XML representation using the proposed table model. Experimental results with a large number of HTML tables show that the proposed method performs better than the conventional method.
Original language | English |
---|---|
Pages | 605-610 |
Number of pages | 6 |
DOIs | |
Publication status | Published - 2006 |
Event | 2006 International Conference on Hybrid Information Technology, ICHIT 2006 - Cheju Island, Korea, Republic of Duration: 2006 Nov 9 → 2006 Nov 11 |
Other
Other | 2006 International Conference on Hybrid Information Technology, ICHIT 2006 |
---|---|
Country/Territory | Korea, Republic of |
City | Cheju Island |
Period | 06/11/9 → 06/11/11 |
All Science Journal Classification (ASJC) codes
- Media Technology