PyData NYC 2022

Introduction to Causal Inference
11-11, 15:30–17:00 (America/New_York), Central Park West (6th floor)

Causal data analysis is very common in many academic domains and has been surging in popularity in the data industry over the last few years. In this tutorial I'll give attendees a gentle introduction to applying causal thinking and causal inference to data using python.

Attendees don't need any prior experience with causal inference or causal thinking. To make the most of the hands-on portion of the tutorial, attendees should have moderate experience with the modern python data stack: numpy, pandas, and scikit-learn. You will be able to walk away from this tutorial with a foundational understanding of causal inference and the ability to carry out your own causal analyses.

During the tutorial you will be split up into groups to work through two different exercise notebooks. All of the materials for the tutorial can be found here: https://github.com/ronikobrosly/pydata_nyc_2022


This tutorial session is intended to give attendees a gentle introduction to applying causal thinking and causal inference to data using python. Causal data analysis is very common in many academic domains (e.g. in social psychology, epidemiology, macroeconomics, public policy research, sociology, and more) as well as in industry (all of the largest Silicon Valley tech companies employ teams of scientists who answer business questions purely with causal inference methods).

The tutorial will involve a combination of a presentation with open Q&A and group exercises contained in Jupyter notebooks. This session will cover the difference between correlation and causation, the pitfalls of conducting an analysis using observational data, how causal inference can help get around these pitfalls, and two examples of common, modern modeling approaches used to conduct causal inference (g-computation and estimating causal curves). After the tutorial, the attendees should have a good foundational understanding of causality and the ability to confidently explore the topic on their own. Causal inference can be a very theory-heavy topic, making it impenetrable to novices. In this tutorial, I'll aim to take a more practical perspective on causal inference, while still occasionally touching on the theory.

Tutorial Outline:

  • Introduction (15 min):
    • "By the end of the tutorial you should be able to..."
    • Motivating problem: vitamin D and COVID severity
    • How causal inference questions differ from standard machine learning questions
    • Experiments vs causal inference
  • Causal graphs and the four types of relationships to know (30 min):
    • What is a "confounder"
    • What is a "collider"
    • What is a "mediator"
    • What are "unrelated predictors"
  • Hands on exercise 1: G-computation (20 min)
  • Hands on exercise 2: Causal curves (20 min)
  • Closing thoughts (5 min):
    • Tips for troubleshooting your own analyses
    • Avoid multiple testing!
    • Be humble. It is likely your research or business idea doesn’t work 🤷🏻

Prior Knowledge Expected

No previous knowledge expected

I am a former epidemiology researcher who has spent approximately a decade employing causal modeling and inference. The bulk of my academic career was spent conducting data analyses to estimate the population-level effects of harmful environment exposures, when traditional randomized experiments were infeasible or unethical.

Since leaving the academic world, I've been loving my second life in the tech industry as a data scientist, ML engineer, and more recently as the Head of Data Science at a medium-sized health tech company based in Washington DC. I love mentoring junior data folks and explaining the magic of data analysis and modeling to non-technical audience.

I also am a member of the open-source community, being the author and maintainer of the causal-curve python package. This package provides a set of tools for estimating the causal impact of continuous/non-binary treatments (e.g. estimating the causal impact of a neighborhood's income inequality on local crime, or understanding the causal effect of increasing a product's price on conversion rates).