TY - JOUR
T1 - Dynamic building defect categorization through enhanced unsupervised text classification with domain-specific corpus embedding methods
AU - Jeon, Kahyun
AU - Lee, Ghang
AU - Yang, Seongmin
AU - Kim, Yonghan
AU - Suh, Seungah
N1 - Publisher Copyright:
© 2023 Elsevier B.V.
PY - 2024/1
Y1 - 2024/1
N2 - Large amounts of data are often categorized using different systems. In such cases, few-shot and unsupervised text classification are the two main approaches for dynamically classifying text into a single classification. Unsupervised text classification typically exhibits lower performance but requires significantly less data preparation effort and computing resources than the few-shot approach. This study proposes two methods to enhance unsupervised text classification for domain-specific non-English text using improved domain corpus embedding: 1) weighted embedding-based anchor word clustering (wean-Clustering), and 2) cosine-similarity-based classification using a defect corpus that is vectorized by fine-tuned pretrained language models (sim-Classification-ftPLM). The proposed methods were tested on 40,765 Korean building defect complaints and achieved F1 scores of 89.12% and 84.66% respectively, outperforming the state-of-the-art zero-shot (53.79%) and few-shot (72.63%) text classification methods, with minimal data preparation effort and computing resources.
AB - Large amounts of data are often categorized using different systems. In such cases, few-shot and unsupervised text classification are the two main approaches for dynamically classifying text into a single classification. Unsupervised text classification typically exhibits lower performance but requires significantly less data preparation effort and computing resources than the few-shot approach. This study proposes two methods to enhance unsupervised text classification for domain-specific non-English text using improved domain corpus embedding: 1) weighted embedding-based anchor word clustering (wean-Clustering), and 2) cosine-similarity-based classification using a defect corpus that is vectorized by fine-tuned pretrained language models (sim-Classification-ftPLM). The proposed methods were tested on 40,765 Korean building defect complaints and achieved F1 scores of 89.12% and 84.66% respectively, outperforming the state-of-the-art zero-shot (53.79%) and few-shot (72.63%) text classification methods, with minimal data preparation effort and computing resources.
KW - Building defect management
KW - Clustering
KW - Domain corpus embedding
KW - Dynamic text classification
KW - Few-shot learning
KW - Text similarity
KW - Unsupervised text classification
KW - Zero-shot learning
UR - http://www.scopus.com/inward/record.url?scp=85176127140&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85176127140&partnerID=8YFLogxK
U2 - 10.1016/j.autcon.2023.105182
DO - 10.1016/j.autcon.2023.105182
M3 - Article
AN - SCOPUS:85176127140
SN - 0926-5805
VL - 157
JO - Automation in Construction
JF - Automation in Construction
M1 - 105182
ER -