Web Data Commons Phones Dataset, Augmented Version, Fixed Splits

Version
V0
Resource Type
Dataset
Creator
  • Primpeli, Anna (University of Mannheim (Germany))
  • Bizer, Christian (University of Mannheim (Germany))
Publication Date
2020-11-23
Description
  • Abstract

    Motivation:
    Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits.

    Dataset Description:
    An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:
    http://webdatacommons.org/productcorpus/index.html#toc4

    The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.
    The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25.

    The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results.
    The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download:
    http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
Availability
Download
This study is freely available to the general public via web download.
Relations
  • Has version
    DOI: 10.3886/E127243V1
Publications
  • Petrovski, Petar, Anna Primpeli, Robert Meusel, and Christian Bizer. “The WDC Gold Standards for Product Feature Extraction and Product Matching.” In Lecture Notes in Business Information Processing, 73–86. Cham: Springer International Publishing, 2017. https://doi.org/10.1007/978-3-319-53676-7_6.
    • ID: 10.1007/978-3-319-53676-7_6 (DOI)
  • Primpeli, Anna, and Christian Bizer. “Profiling Entity Matching Benchmark Tasks.” Proceedings of the 29th ACM International Conference on Information & Knowledge Management. New York, NY, USA: ACM, October 19, 2020. https://doi.org/10.1145/3340531.3412781.
    • ID: 10.1145/3340531.3412781 (DOI)

Update Metadata: 2020-11-23 | Issue Number: 1 | Registration Date: 2020-11-23