Code for: How Well Do Automated Linking Methods Perform

Version
V0
Resource Type
Dataset : program source code
Creator
  • Bailey, Martha (University of California-Los Angeles, National Bureau of Economic Research)
  • Cole, Connor (University of Michigan)
  • Henderson, Morgan (University of Michigan)
  • Massey, Catherine (University of Michigan)
Publication Date
2020-12-08
Funding Reference
  • National Science Foundation
    • Award Number: 1539228
  • United States Department of Health and Human Services. National Institutes of Health. National Institute on Aging
    • Award Number: R21AG05691201
  • University of Michigan Population Studies Center
    • Award Number: R24HD041028
  • Michigan Center for the Demography of Aging
    • Award Number: MiCDAP30AG012846-21
  • National Institutes of Health NICHD Center Grant
    • Award Number: P2C HD041028
  • Population Studies Center Grant
    • Award Number: R24 HD041028
  • United States Department of Health and Human Services. National Institutes of Health. Eunice Kennedy Shriver National Institute of Child Health and Human Development
    • Award Number: T32 HD0007339
Free Keywords
Linking; Automated Linking Methods; Machine Learning; U.S. Historical Data; Methodology; intergenerational mobility
Description
  • Abstract

    This paper reviews the literature in historical record linkage in the U.S. and examines the performance of widely-used record linking algorithms and common variations in their assumptions. We use two high-quality, hand-linked datasets and one synthetic ground truth to examine the direct effects of linking algorithms on data quality. We find that (1) no algorithm (including hand-linking) consistently produces representative samples; (2) 15 to 37 percent of links chosen by widely-used algorithms are classified as errors by trained human reviewers; and (3) false links are systematically related to baseline sample characteristics, showing that some algorithms may induce systematic measurement error into analyses. A case study shows that the combined effects of (1)-(3) attenuate estimates of the intergenerational income elasticity by up to 20 percent, and common variations in algorithm assumptions result in greater attenuation. As current practice moves to automate linking and increase link rates, these results highlight the important potential consequences of linking errors for inferences with linked data. We conclude with constructive suggestions for reducing linking errors and directions for future research.
  • Methods

    Response Rates: Not Applicable
Temporal Coverage
  • 1860-01-01 / 1940-12-31
    Time Period: Sun Jan 01 00:00:00 EST 1860--Tue Dec 31 00:00:00 EST 1940
Geographic Coverage
  • United States
Sampling
Not applicable
Availability
Download
This study is freely available to the general public via web download.
Relations
  • Has version
    DOI: 10.3886/E119932V1

Update Metadata: 2020-12-08 | Issue Number: 1 | Registration Date: 2020-12-08