TY - GEN
T1 - Scalable entity matching computation with materialization
AU - Lee, Sanghoon
AU - Lee, Jongwuk
AU - Hwang, Seung Won
PY - 2011
Y1 - 2011
N2 - Entity matching (EM) is the task of identifying records that refer to the same real-world entity from different data sources. While EM is widely used in data integration and data cleaning applications, the naive method for EM incurs quadratic cost with respect to the size of the datasets. To address this problem, this paper proposes a scalable EM algorithm that employs a pre-materialized structure. Specifically, once the structure is built, our proposed algorithm can identify the EM results with sub-linear cost. In addition, as the rules evolve, our algorithm can efficiently adapt to new rules by selectively accessing records using the materialized structure. Our evaluation results show that our proposed EM algorithm is significantly faster than the state-of-the-art method for extensive real-life datasets.
AB - Entity matching (EM) is the task of identifying records that refer to the same real-world entity from different data sources. While EM is widely used in data integration and data cleaning applications, the naive method for EM incurs quadratic cost with respect to the size of the datasets. To address this problem, this paper proposes a scalable EM algorithm that employs a pre-materialized structure. Specifically, once the structure is built, our proposed algorithm can identify the EM results with sub-linear cost. In addition, as the rules evolve, our algorithm can efficiently adapt to new rules by selectively accessing records using the materialized structure. Our evaluation results show that our proposed EM algorithm is significantly faster than the state-of-the-art method for extensive real-life datasets.
UR - http://www.scopus.com/inward/record.url?scp=83055197116&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=83055197116&partnerID=8YFLogxK
U2 - 10.1145/2063576.2063965
DO - 10.1145/2063576.2063965
M3 - Conference contribution
AN - SCOPUS:83055197116
SN - 9781450307178
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 2353
EP - 2356
BT - CIKM'11 - Proceedings of the 2011 ACM International Conference on Information and Knowledge Management
T2 - 20th ACM Conference on Information and Knowledge Management, CIKM'11
Y2 - 24 October 2011 through 28 October 2011
ER -