KPSpotter: A flexible information gain-based keyphrase extraction system

Min Song, Il Yeol Song, Xiaohua Hu

Research output: Contribution to conferencePaperpeer-review

45 Citations (Scopus)


To tackle the issue of information overload, we present an Information Gain-based KeyPhrase Extraction System, called KPSpotter. KPSpotter is a flexible web-enabled keyphrase extraction system, capable of processing various formats of input data, including web data, and generating the extraction model as well as the list of keyphrases in XML. In KPSpotter, the following two features were selected for training and extracting keyphrases: 1) TF*IDF and 2) Distance from First Occurrence. Input training and testing collections were processed in three stages: 1) Data Cleaning, 2) Data Tokenizing, and 3) Data Discretizing. To measure the system performance, the keyphrases extracted by KPSpotter are compared with the ones that the authors assigned. Our experiments show that the performance of KPSpotter was evaluated to be equivalent to KEA, a well-known keyphrase extraction system. KPSpotter, however, is differentiated from other extraction systems in the followings: First, KPSpotter employs a new keyphrase extraction technique that combines the Information Gain data mining measure and several Natural Language Processing techniques such as stemming and case-folding. Second, KPSpotter is able to process various types of input data such as XML, HTML, and unstructured text data and generate XML output. Third, the user can provide input data and execute KPSpotter through the Internet. Fourth, for efficiency and performance reason, KPSpotter stores candidate keyphrases and its related information such as frequency and stemmed form into an embedded database management system.

Original languageEnglish
Number of pages4
Publication statusPublished - 2003
EventWIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management - New Orleans, LA, United States
Duration: 2003 Nov 72003 Nov 8


OtherWIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management
Country/TerritoryUnited States
CityNew Orleans, LA

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems


Dive into the research topics of 'KPSpotter: A flexible information gain-based keyphrase extraction system'. Together they form a unique fingerprint.

Cite this