PyData NYC 2022

Fairness for Scikit-Learn Pipelines with Lale
11-09, 10:15–11:00 (America/New_York), Radio City (6th floor)

We would like machine-learning pipelines to be fair, i.e., to avoid bias based on race, gender, age, or other attributes. This talk gives an introduction to algorithmic fairness concepts and shows how to put them into practice with scikit-learn pipelines. You will learn about metrics to measure fairness and about mitigators to reduce bias. Furthermore, you will learn about fairness in the presence of data preprocessing, ensemble learning, hyperparameter tuning, etc. This talk includes code examples based on the Lale open-source library, which provides scikit-learn compatibility for fairness algorithms.


The talk will have four sections of roughly equal length, to discuss data, metrics, pipelines, and AutoML.

The section on fairness and data introduces basic dataset fairness information, including protected attributes and favorable outcomes. You will learn how this information induces intersectionalities on the data, and how to quantify those. One challenge with algorithmic fairness is that results are often not stable across splits; this section shows how to use stratification to ameliorate this problem. Finally, you will learn about correlations between protected attributes, other features, and outcomes, and why naive redaction is insufficient.

The section on fairness and metrics describes different fairness metrics and their relative merits. You will learn how they compare to various accuracy metrics, against which they are often traded off. You will learn the distinction between metrics on data vs. scorers on models or pipelines, and how those get used in various scikit-learn APIs. Finally, you will learn about some baselines for the values of metrics on a dataset, based on dummy models as well as common off-the-shelf estimators.

The section on fairness and pipelines starts by covering the behavior of fairness-agnostic scikit-learn estimators. Next, it introduces a variety of bias mitigators, which are operators that either enhance or replace an estimator to make it fairer. You will learn how mitigators interact with preprocessing, and how to use both together effectively. Finally, scikit-learn offers a variety of ensembles including bagging, boosting, voting and stacking. You will learn how to use mitigators with ensembles.

The section on fairness and AutoML starts by illustrating Pareto frontiers as applied to algorithmic fairness. Next, it will outline difficulties when applying AutoML in a fairness context, in particular, stability and undefined metric values. Next, you will learn how to tune hyperparameters for fairness, including hyperparameters of estimators, preprocessing, mitigators, and ensembles. Finally, you will learn how to also tune algorithm choices in all of these parts of a pipeline for fairness.


Prior Knowledge Expected

No previous knowledge expected

Martin Hirzel is a researcher and the manager of the AI Programming Models team at IBM Research AI. Martin received his PhD from the University of Colorado at Boulder in 2004; his thesis adviser was Amer Diwan. At IBM, Martin works on tools and languages for artificial intelligence and streaming systems. Martin's papers won awards at several conferences and he is an ACM Distinguished Scientist.