PyData NYC 2022

Customizable probabilistic record linkage with Name Match
11-09, 11:45–12:30 (America/New_York), Winter Garden (5th floor)

Linking individuals across records or datasets is often a critical prerequisite for building useful data tools and answering interesting research or business questions. But doing it right is difficult and time-consuming, in part because current off-the-shelf tools do not provide a measure of linking accuracy and are too rigid to incorporate the user’s domain knowledge. In this talk, we’ll 1) define high-quality record linkage and discuss why it matters, 2) show how record linkage can be boiled down to a simple prediction problem, and 3) introduce Name Match, a new open source tool for customizable probabilistic record linkage.


A common task when working with data is the need to identify which records belong to which people. In the best case, the data already includes a unique identifier for individuals (e.g. a social security number). However, in many instances a unique identifier either does not exist for all records, or is not shared across the different datasets that need to be linked. It is in this scenario that data analysts must turn to the field of record linkage.

In many cases, the performance and fairness of a data tool or the correctness of a statistic or research finding is only as good as the record linkage that came before it. But linking data with accuracy – which requires both the ability to measure accuracy and the ability to then take steps to improve accuracy – is not a simple task. For example, basic linking techniques like exact matching or fuzzy matching are easy to implement but error prone and hard to evaluate. More sophisticated techniques may reduce linking error, but often at the expense of the user’s control.

In this talk, we show how reframing record linkage as a simple prediction problem – do these two records refer to the same person or not? – allows us to apply standard machine learning solutions in the field of record linkage. Name Match is a new open source record linkage tool that takes advantage of this reframing by using supervised learning to train (and evaluate) a probabilistic record linkage process that can deduplicate or link records across any number of datasets. Unlike other tools, Name Match also offers users the flexibility to customize the record linkage process to enforce or prohibit certain links based on their domain knowledge.

This talk is geared towards data practitioners in any field who find themselves needing to gather information about people across rows or datasets. We’ll briefly cover why record linkage is important and how it can be framed as a prediction problem. We’ll then spend most of the time introducing the Name Match tool and walking through an example. No previous experience in record linkage is required.


Prior Knowledge Expected

No previous knowledge expected

Melissa McNeill is a senior data scientist at the University of Chicago Crime Lab working to build and evaluate prediction models that are accurate, fair, and useful in the real world. She is a core contributor to Name Match, an open source probabilistic record linkage tool. Melissa holds a B.S. in Computer Science from Texas A&M and an M.S. in Analytics from Northwestern.