TY - JOUR
T1 - CASS
T2 - A distributed network clustering algorithm based on structure similarity for large-scale network
AU - Kim, Jungrim
AU - Shin, Mincheol
AU - Kim, Jeongwoo
AU - Park, Chihyun
AU - Lee, Sujin
AU - Woo, Jaemin
AU - Kim, Hyerim
AU - Seo, Dongmin
AU - Yu, Seokjong
AU - Park, Sanghyun
N1 - Publisher Copyright:
© 2018 EMH Swiss Medical Publishers Ltd. All rights reserved.
PY - 2018/10
Y1 - 2018/10
N2 - As the size of networks increases, it is becoming important to analyze large-scale network data. A network clustering algorithm is useful for analysis of network data. Conventional network clustering algorithms in a single machine environment rather than a parallel machine environment are actively being researched. However, these algorithms cannot analyze large-scale network data because of memory size issues. As a solution, we propose a network clustering algorithm for large-scale network data analysis using Apache Spark by changing the paradigm of the conventional clustering algorithm to improve its efficiency in the Apache Spark environment. We also apply optimization approaches such as Bloom filter and shuffle selection to reduce memory usage and execution time. By evaluating our proposed algorithm based on an average normalized cut, we confirmed that the algorithm can analyze diverse large-scale network datasets such as biological, co-authorship, internet topology and social networks. Experimental results show that the proposed algorithm can develop more accurate clusters than comparative algorithms with less memory usage. Furthermore, we confirm the proposed optimization approaches and the scalability of the proposed algorithm.
AB - As the size of networks increases, it is becoming important to analyze large-scale network data. A network clustering algorithm is useful for analysis of network data. Conventional network clustering algorithms in a single machine environment rather than a parallel machine environment are actively being researched. However, these algorithms cannot analyze large-scale network data because of memory size issues. As a solution, we propose a network clustering algorithm for large-scale network data analysis using Apache Spark by changing the paradigm of the conventional clustering algorithm to improve its efficiency in the Apache Spark environment. We also apply optimization approaches such as Bloom filter and shuffle selection to reduce memory usage and execution time. By evaluating our proposed algorithm based on an average normalized cut, we confirmed that the algorithm can analyze diverse large-scale network datasets such as biological, co-authorship, internet topology and social networks. Experimental results show that the proposed algorithm can develop more accurate clusters than comparative algorithms with less memory usage. Furthermore, we confirm the proposed optimization approaches and the scalability of the proposed algorithm.
UR - http://www.scopus.com/inward/record.url?scp=85054737561&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85054737561&partnerID=8YFLogxK
U2 - 10.1371/journal.pone.0203670
DO - 10.1371/journal.pone.0203670
M3 - Article
C2 - 30303961
AN - SCOPUS:85054737561
SN - 1932-6203
VL - 13
JO - PLoS One
JF - PLoS One
IS - 10
M1 - e0203670
ER -