TY - JOUR
T1 - Clustering high dimensional massive scientific datasets
AU - Otoo, Ekow J.
AU - Shoshani, Arie
AU - Hwang, Seung Won
PY - 2001
Y1 - 2001
N2 - Many scientific applications can benefit from efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(N) at best, and O(N log N) in the worst case, using no more than O(N M) storage, for it to be practical. A parallelized version of the same algorithm should achieve a linear speed-up in processing time with increasing number of processors. We introduce a hybrid algorithm called HyCeltyc, as an approach for clustering massively large high dimensional datasets. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction and selection of significant features on which to cluster the data.
AB - Many scientific applications can benefit from efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(N) at best, and O(N log N) in the worst case, using no more than O(N M) storage, for it to be practical. A parallelized version of the same algorithm should achieve a linear speed-up in processing time with increasing number of processors. We introduce a hybrid algorithm called HyCeltyc, as an approach for clustering massively large high dimensional datasets. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction and selection of significant features on which to cluster the data.
UR - http://www.scopus.com/inward/record.url?scp=53949100051&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=53949100051&partnerID=8YFLogxK
U2 - 10.1109/SSDM.2001.938547
DO - 10.1109/SSDM.2001.938547
M3 - Article
AN - SCOPUS:53949100051
SN - 1099-3371
SP - 147
EP - 157
JO - Proceedings of the International Conference on Scientific and Statistical Database Management, SSDBM
JF - Proceedings of the International Conference on Scientific and Statistical Database Management, SSDBM
ER -