Creating a Data Quality Control Framework for Producing New Personnel-Based Science & Engineering Indicators
The prerequisite of such person level indicators is that individual researchers who appear in multiple bibliographic datasets are correctly identified and linked. Effective identification and linkage of authors based on their names is daunting because names are often ambiguous. This is particularly the case for Asian names, which poses a significant problem as Asian researchers play an increasingly important role in many fields of research. This project addresses the challenge of systematically and routinely disambiguating names in big bibliographic datasets using a new Automated and Stratified Entity Disambiguation framework. Core datasets for this effort are derived using a new method that relies on multiple data fields and an iterative process to automatically create disambiguated datasets that can be used to train artificial intelligence tools to conduct robust person level analysis. To improve disambiguation accuracy, name instances are stratified into two groups according to name-ethnicity and disambiguated separately to produce optimal models learned on the automatically generated truth data. Based on the disambiguated data, this project develops new person-level S&E indicators that characterize the landscape and trends of the international S&E research workforce across all science and engineering fields. The new big data tools for automatic disambiguation at scale will be documented and released publicly to enable expansion, validation, and reuse by the science community as well as science of science policy researchers.
This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.
Funding Source: National Science Foundation (NCSES)