PyData NYC 2022

Scalable Feature Engineering with Hamilton
11-10, 11:00–11:45 (America/New_York), Winter Garden (5th floor)

In this talk we present Hamilton, a novel open-source framework for developing and maintaining scalable feature engineering dataflows. Hamilton was initially built to solve the problem of managing a codebase of transforms on pandas dataframes, enabling a data science team to scale the capabilities they offer with the complexity of their business. Since then, it has grown into a general-purpose tool for writing and maintaining dataflows in python. We introduce the framework, discuss its motivations and initial successes at Stitch Fix, and share recent extensions that seamlessly integrate it with distributed compute offerings, such as Dask, Ray, and Spark.


At Stitch Fix, a data science team’s feature generation process was causing them iteration & operational frustrations in delivering time-series forecasts for the business. In this talk I’ll present Hamilton, a novel open source Python framework that solved their pain points by changing their working paradigm.

Specifically, Hamilton enables a simpler approach for data science & data engineering teams to create, maintain, execute, and scale both the human and computational sides of feature/data transforms.

At a high level, we will cover:
- What Hamilton is and why it was created
- How to use it for feature engineering
- The software engineering best practices Hamilton prescribes that make pipelines more sustainable
- How Hamilton enables out-of-the-box scaling with common distributed compute frameworks

At a low level, through code in the slides and a quick demo, you will learn:
- How a data science team at Stitch Fix scaled their team and code base with Hamilton to enable documentation-friendly, unit-testable code
- What Hamilton is and how the declarative paradigm it prescribes offers advantages over more traditional approaches
- How you can easily add runtime data quality checks to ensure the robustness of your pipeline
- How the Ray/Dask/Spark integrations with Hamilton works and how they can help you scale your data


Prior Knowledge Expected

No previous knowledge expected

Elijah has always enjoyed working at the intersection of math and engineering. More recently, he has focused his career on building tools to make data scientists more productive. At Two Sigma, he was building infrastructure to help quantitative researchers efficiently turn ideas into production trading models. At Stitch Fix he leads the Model Lifecycle team — a team that focuses on streamlining the experience for data scientists to create and ship machine learning models. In his spare time, he enjoys geeking out about fractals, poring over antique maps, and playing jazz piano.