PyData NYC 2022

Apache Beam on Dask: Portable, Scalable, Scientific Python (AKA Data Engineering for the Climate)
11-10, 11:00–11:45 (America/New_York), Central Park East (6th floor)

In this introduction to Apache Beam, we’ll discuss ongoing efforts to support Dask – and why you should care as a scientist or data practitioner. We'll talk about how both portability and scalability in data engineering are essential to help address the climate crisis.


In this introduction to Apache Beam, we’ll discuss ongoing efforts to support Dask – and why you should care as a scientist or data practitioner. We'll talk about how both portability and scalability in data engineering are essential to help address the climate crisis. Beam offers a high level way to represent the flow of data, replacing the need for an up-front understanding of low level infrastructure to compute at scale. With a standard programming model that spans execution engines – and even programming languages – we’ll demonstrate how one can pivot from HPC to Big Data systems with minimum code change. To illustrate the motivation behind this work, we’ll dissect the needs of the Pangeo Forge project, an open ecosystem for engineering climate datasets, and discuss why they adopted Beam as a Big Data compiler instead of becoming one. From here, we’ll explore the generality of the Dask and Beam solution to enable composable, scalable, and portable science in Python.


Prior Knowledge Expected

No previous knowledge expected

Alex is a senior software engineer at Google Research focused on democratizing climate & weather data.