Linkage methodologies constitutes one core of LinkageLibrary, including program files and documentation of code, test data, and output from linkages using particular datasets. We will especially target studies that have explicit methodology and have evaluated linkage results. Metadata will include information about the mechanics of linkage: computer/human, deterministic or probabilistic methodology, what pieces of information were used, and were they transformed? What blocking method was used? If probabilistic, was a similarity score cutoff used? If deterministic, what rules are used? We will also strongly encourage a full suite of evaluation metrics: What were the rates of sensitivity (recognition of true matches, also called recall), specificity (recognition of non-matches), and positive predictive probability (proportion of matches that are true, also called precision)? How were these ascertained (what is the ‘gold standard’ to which they were compared, manual review or linkage programs run on a file with known true matches)? To promote replicability, whenever possible, we will make the gold standard used available for other methods as well as post alternative gold standards, if there are multiple efforts that result in difference, with the ultimate goal of cleaning up the gold standard for a given data set. Both computers and humans make mistakes when developing the gold standard for evaluation. As more data become available in certain projects, both over time as well as more fields, those errors can be corrected. What was the inter-linker reliability if manually linked? (National Academies of Sciences, Engineering, and Medicine 2017, pp. 53-4). The project team will work together to identify metadata fields.

Results from projects looking to enhance use of integrated data, including data (if any), programs developed, linked out to publications where possible. This could include programming for some data sets that are deposited elsewhere for secondary use (e.g., IPUMS, NAPP).
Data for experimentation and testing, particularly data with personal identifying information is the other core of the repository. Having real datasets available for comparisons of linkage methodology will be very valuable for refining methodology and assessing the tradeoffs between quality, usefulness and privacy.

Bibliography of data linkage articles encourages researchers to include related publications and will support the addition of related publications from third party users. These entries will be included in ICPSR’s Bibliography of Related Literature.

Engagement will improve methodology by bringing together disciplines and categories of data and linkage strategies so that researchers and students can learn from each other. The functionality for this will be built by ICPSR’s technology team, leveraging the Archonnex platform that supports OpenICPSR. Participants will be able to add comments and ask questions, allowing for engagement between data custodians, providers, producers, holders and data users, as well as between various groups of data users. In addition, data users will be able to upload and share code snippets related to the data, allowing for knowledge sharing and better reproducibility. Data users will also be able to link related publications and citations via DOI import or by manually entering citation information, providing feedback to data providers and other data users regarding how these data were used. In order to participate in the repository community, researchers will register their ICPSR user id (known as a MyData account) as a verified participant in this linkage repository. This will allow them to contribute their own study materials to LinkageLibrary, if they wish, as well as contribute to a conversation (commenting, sharing documents, citations, data) on other linkage studies, share their code for other studies, etc. This will encourage trust and responsibility in the management of functionality such as crowdsourcing comments and code improvements.


Funding Source: NSF

Susan Hautaniemi Leonard