Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference (Task 1)

  • Peeters, Ralph (University of Mannheim (Germany))
  • Primpeli, Anna (University of Mannheim (Germany))
  • Bizer, Christian (University of Mannheim (Germany))
Free Keywords; product matching; entity matching; identity resolution; record linkage; e-commerce
    The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with vocabulary. For more information and download links for the corpus itself, please follow the links below.
    DOI: 10.3886/E127482
  • Zhang, Ziqi, Christian Bizer, Ralph Peeters, and Anna Primpeli. “MWPD2020: Semantic Web Challenge on Mining the Web of HTML-Embedded Product Data.” Proceedings of the Semantic Web Challenge on Mining the Web of HTML-Embedded Product Data Co-Located with the 19th International Semantic Web Conference (ISWC 2020). CEUR Workshop Proceedings, November 5, 2020.
