Data-driven feature word selection for clustering online news comments

Heeryon Cho, Jong Seok Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Popular news articles attract thousands of online comments, making it tedious and time-consuming for a manual review. Automatically clustering similar comments can help reduce the burden of manual analyses, but appropriate feature words must be selected for successful clustering. In this paper, we present a data-driven feature word selection method which realizes structurally superior clustering of online comments. The top 1,000 most frequent nouns appearing across the entire 7.44 million Korean online comments are selected to construct an overall noun set. Frequent nouns in the online comments of each news article are selected to construct the local noun set. The intersection between the local and overall noun set is taken to construct the global noun set. The global noun set is removed from the corresponding local noun set to construct the distinct noun set. The top 250 most frequent nouns are selected for each of the local, global, and distinct noun sets for K-means clustering. The clustered results are evaluated using three internal cluster validation indices, Dunn, PBM, and Silhouette. As a result, online comments clustered using distinct nouns produced structurally superior clusters when compared to the other types of nouns, local and global.

Original languageEnglish
Title of host publication2016 International Conference on Big Data and Smart Computing, BigComp 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages494-497
Number of pages4
ISBN (Electronic)9781467387965
DOIs
Publication statusPublished - 2016 Mar 3
EventInternational Conference on Big Data and Smart Computing, BigComp 2016 - Hong Kong, China
Duration: 2016 Jan 182016 Jan 20

Publication series

Name2016 International Conference on Big Data and Smart Computing, BigComp 2016

Other

OtherInternational Conference on Big Data and Smart Computing, BigComp 2016
Country/TerritoryChina
CityHong Kong
Period16/1/1816/1/20

Bibliographical note

Funding Information:
This work was in part supported by the Convergence Research Center (CRC) Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (MSIP), Korea (NRF-2015R1A5A7037615), and in part by the Ministry of Science, ICT and Future Planning (MSIP), Korea, under the "IT Consilience Creative Program" (IITP-2015-R0346-15-1008) supervised by the Institute for Information & Communications Technology Promotion (IITP).

Publisher Copyright:
© 2016 IEEE.

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Data-driven feature word selection for clustering online news comments'. Together they form a unique fingerprint.

Cite this