Dynamic building defect categorization through enhanced unsupervised text classification with domain-specific corpus embedding methods

Kahyun Jeon, Ghang Lee, Seongmin Yang, Yonghan Kim, Seungah Suh

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Large amounts of data are often categorized using different systems. In such cases, few-shot and unsupervised text classification are the two main approaches for dynamically classifying text into a single classification. Unsupervised text classification typically exhibits lower performance but requires significantly less data preparation effort and computing resources than the few-shot approach. This study proposes two methods to enhance unsupervised text classification for domain-specific non-English text using improved domain corpus embedding: 1) weighted embedding-based anchor word clustering (wean-Clustering), and 2) cosine-similarity-based classification using a defect corpus that is vectorized by fine-tuned pretrained language models (sim-Classification-ftPLM). The proposed methods were tested on 40,765 Korean building defect complaints and achieved F1 scores of 89.12% and 84.66% respectively, outperforming the state-of-the-art zero-shot (53.79%) and few-shot (72.63%) text classification methods, with minimal data preparation effort and computing resources.

Original languageEnglish
Article number105182
JournalAutomation in Construction
Volume157
DOIs
Publication statusPublished - 2024 Jan

Bibliographical note

Publisher Copyright:
© 2023 Elsevier B.V.

All Science Journal Classification (ASJC) codes

  • Control and Systems Engineering
  • Civil and Structural Engineering
  • Building and Construction

Fingerprint

Dive into the research topics of 'Dynamic building defect categorization through enhanced unsupervised text classification with domain-specific corpus embedding methods'. Together they form a unique fingerprint.

Cite this