Generating structured documents from HTML tables

Yeon Seok Kim, Kyong Ho Lee

Research output: Contribution to conferencePaperpeer-review


A table is a facility for presenting relational information structurally and concisely. As a prerequisite for extracting information from the Web, This paper presents an efficient method for extracting logical structures from HTML tables and transforming them into XML documents. The proposed method consists of two phases: area segmentation and structure analysis. The area segmentation step cleans up the table and segments the normalized table into attribute and value areas by checking visual and semantic coherency. Particularly, heuristic rules are also proposed to handle complex tables. In the structure analysis phase, the hierarchical structure between attribute and value areas is analyzed and transformed into an XML representation using the proposed table model. Experimental results with a large number of HTML tables show that the proposed method performs better than the conventional method.

Original languageEnglish
Number of pages6
Publication statusPublished - 2006
Event2006 International Conference on Hybrid Information Technology, ICHIT 2006 - Cheju Island, Korea, Republic of
Duration: 2006 Nov 92006 Nov 11


Other2006 International Conference on Hybrid Information Technology, ICHIT 2006
Country/TerritoryKorea, Republic of
CityCheju Island

All Science Journal Classification (ASJC) codes

  • Media Technology


Dive into the research topics of 'Generating structured documents from HTML tables'. Together they form a unique fingerprint.

Cite this