Abstract
To tackle the issue of information overload, we present an Information Gain-based KeyPhrase Extraction System, called KPSpotter. KPSpotter is a flexible web-enabled keyphrase extraction system, capable of processing various formats of input data, including web data, and generating the extraction model as well as the list of keyphrases in XML. In KPSpotter, the following two features were selected for training and extracting keyphrases: 1) TF*IDF and 2) Distance from First Occurrence. Input training and testing collections were processed in three stages: 1) Data Cleaning, 2) Data Tokenizing, and 3) Data Discretizing. To measure the system performance, the keyphrases extracted by KPSpotter are compared with the ones that the authors assigned. Our experiments show that the performance of KPSpotter was evaluated to be equivalent to KEA, a well-known keyphrase extraction system. KPSpotter, however, is differentiated from other extraction systems in the followings: First, KPSpotter employs a new keyphrase extraction technique that combines the Information Gain data mining measure and several Natural Language Processing techniques such as stemming and case-folding. Second, KPSpotter is able to process various types of input data such as XML, HTML, and unstructured text data and generate XML output. Third, the user can provide input data and execute KPSpotter through the Internet. Fourth, for efficiency and performance reason, KPSpotter stores candidate keyphrases and its related information such as frequency and stemmed form into an embedded database management system.
Original language | English |
---|---|
Pages | 50-53 |
Number of pages | 4 |
Publication status | Published - 2003 |
Event | WIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management - New Orleans, LA, United States Duration: 2003 Nov 7 → 2003 Nov 8 |
Other
Other | WIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management |
---|---|
Country/Territory | United States |
City | New Orleans, LA |
Period | 03/11/7 → 03/11/8 |
All Science Journal Classification (ASJC) codes
- Computer Networks and Communications
- Information Systems