Large ontologies and taxonomies are automatically harvested from web-scale data. These taxonomies tend to be huge, noisy, and con-tains little context. As a result, cleansing and enriching those large-scale taxonomies becomes a great challenge. A natural way to en-rich a taxonomy is to map the taxonomy to existing datasets that contain rich information. In this paper, we study the problem of matching two web scale taxonomies. Besides the scale of the prob-lem, we address the challenge that the taxonomies may not contain enough context (such as attribute values). As existing entity reso-lution techniques are based directly or indirectly on attribute values as context, we must explore external evidence for entity resolution. Specifically, we explore positive and negative evidence in external data sources such as the web and in other taxonomies. To integrate positive and negative evidence, we formulate the entity resolution problem as a problem of finding optimal multi-way cuts in a graph. We analyze the complexity of the problem, and propose a Monte Carlo algorithm for finding greedy cuts. We conduct extensive ex-periments and compare our approach with three existing methods to demonstrate the advantage of our approach.
|Number of pages
|Proceedings of the VLDB Endowment
|Published - 2011 Aug
Bibliographical noteFunding Information:
This research was supported by Microsoft Research and the MKE (The Ministry of Knowledge Economy), Korea, under IT/SW Cre-ative research program supervised by the NIPA (National IT Indus-try Promotion Agency) (NIPA-2010-C1810-1002-0003).
All Science Journal Classification (ASJC) codes
- Computer Science (miscellaneous)
- General Computer Science