Generating structured documents from HTML tables

Yeon Seok Kim, Kyong Ho Lee

Research output: Contribution to conferencePaperpeer-review

Abstract

A table is a facility for presenting relational information structurally and concisely. As a prerequisite for extracting information from the Web, This paper presents an efficient method for extracting logical structures from HTML tables and transforming them into XML documents. The proposed method consists of two phases: area segmentation and structure analysis. The area segmentation step cleans up the table and segments the normalized table into attribute and value areas by checking visual and semantic coherency. Particularly, heuristic rules are also proposed to handle complex tables. In the structure analysis phase, the hierarchical structure between attribute and value areas is analyzed and transformed into an XML representation using the proposed table model. Experimental results with a large number of HTML tables show that the proposed method performs better than the conventional method.

Original languageEnglish
Pages605-610
Number of pages6
DOIs
Publication statusPublished - 2006
Event2006 International Conference on Hybrid Information Technology, ICHIT 2006 - Cheju Island, Korea, Republic of
Duration: 2006 Nov 92006 Nov 11

Other

Other2006 International Conference on Hybrid Information Technology, ICHIT 2006
Country/TerritoryKorea, Republic of
CityCheju Island
Period06/11/906/11/11

All Science Journal Classification (ASJC) codes

  • Media Technology

Fingerprint

Dive into the research topics of 'Generating structured documents from HTML tables'. Together they form a unique fingerprint.

Cite this