PyData NYC 2022

Zeno Does Data Science: The Paradoxical Quest for Reproducibility
11-09, 14:45–15:30 (America/New_York), Central Park East (6th floor)

How do you make your data science processed reproducible? There's a lot of really hard problems hidden in that question. This talk describes a multi-year journey towards reproducibility at the Tutte Institute, and exposes the tradeoffs that we have made to get close to that goal.


It sounds so easy on paper: "Let's make our data science processes reproducible!" And while parts of the journey feel pretty good, we soon find ourselves confronting the hard problems that we initially pushed aside. Looking closely, these problems seem harder than the problem we started with, and so on, ad-infinitum.

Getting to reproducibility feels like Zeno's paradox: to get to our finish line (reproducibility!), we must first go halfway (through the hard parts); before we can go halfway, we have to go a quarter of the way (through the REALLY hard parts), and so on, until we find ourselves confronting an infinitude of really, really hard problems.

This talk describes our journey towards reproducibility: the tools, techniques, workflows, and brutal hacks that have gotten us ever closer to the holy grail of reproducible data science. Along the way, we dig into some of the hard problems we have faced, and the even harder sub-problems that underlie them, and the compromises we have made to draw a line that is "close enough" to our finish line: reproducible data science for heterogeneous workgroups.


Prior Knowledge Expected

No previous knowledge expected

Kjell is a computer engineer and mathematician who splits his time between Big Data and Little Learners. By day he is the Supervisor of Data Science and Machine Learning research at the Tutte Institute for Mathematics and Computing. By night he's the co-founder of Learn Leap Fly, an educational software company using AI and Machine learning to help teach the world to learn.