Loading...
Please wait, while we are loading the content...
Similar Documents
Blocking Strategies to Accelerate Record Matching for Big Data Integration
| Content Provider | Semantic Scholar |
|---|---|
| Author | Kadochnikov, I. S. Papoyan, Vl. V. |
| Copyright Year | 2019 |
| Abstract | Record matching represents a key step in many Big Data analysis problems, especially leveraging disparate large data sources. Methods of probabilistic record linkage provide a good framework to find and interpret partial record matches. However, they require combining and therefore computing string distances for the records being compared. That is, the direct use of probabilistic record linkage requires processing the Cartesian product of record sets. As a result, a “blocking” step is used, when candidate record pairs are grouped by a categorical field, significantly limiting the number of record comparisons and computational cost. On the other hand, this method requires a high level of data quality and agreement between sources on the categorical blocking field. We propose a more flexible approach where blocking does not use a categorical column. The key idea is to use clustering based on string field values. In practice, we mapped the string field with TF-IDF into a latent vector space and then used Locality Sensitive Hashing to cluster records in this vector space. Apache Spark libraries were used to show the effectiveness of this approach for linking British open company registration datasets. |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://ceur-ws.org/Vol-2507/219-224-paper-38.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |