ADC: Advanced document clustering using contextualized representations

J. Park, Chanhee Park, Jeongwoo Kim, Minsoo Cho, Sanghyun Park

Research output: Contribution to journalArticlepeer-review

20 Citations (Scopus)

Abstract

Document representation is central to modern natural language processing systems including document clustering. Empirical experiments in recent studies provide strong evidence that unsupervised language models can learn context-aware representations in the given documents and advance several NLP benchmark results. However, existing clustering approaches focus on the dimensionality reduction and do not exploit these informative representations. In this paper, we propose a conceptually simple but experimentally effective clustering framework called Advanced Document Clustering (ADC). In contrast to previous clustering methods, ADC is designed to leverage syntactically and semantically meaningful features through feature-extraction and clustering modules in the framework. We first extract features from pre-trained language models and initialize cluster centroids to spread out uniformly. In the clustering module of ADC, the semantic similarity can be measured using the cosine similarity and centroids update while assigning centroids to a mini-batch input. Also, we utilize cross entropy loss partially, as the self-training scheme can be biased when parameters in the model are inaccurate. As a result, ADC can take advantages of contextualized representations while mitigating the limitations introduced by high-dimensional vectors. In numerous experiments with four datasets, the proposed ADC outperforms other existing approaches. In particular, experiments on categorizing news corpus with fake news demonstrated the effectiveness of our method for contextualized representations.

Original languageEnglish
Pages (from-to)157-166
Number of pages10
JournalExpert Systems with Applications
Volume137
DOIs
Publication statusPublished - 2019 Dec 15

Bibliographical note

Publisher Copyright:
© 2019 Elsevier Ltd

All Science Journal Classification (ASJC) codes

  • Engineering(all)
  • Computer Science Applications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'ADC: Advanced document clustering using contextualized representations'. Together they form a unique fingerprint.

Cite this