PyData NYC 2022

Matthew Rocklin

Matthew is an open source software developer in the PyData ecosystem. He primarily works on Dask, a library for parallel computing in Python. Matthew worked for Anaconda and NVIDIA before starting a company, Coiled with a mission to enable scalable computing for the Python community.

The speaker's profile picture

Sessions

11-09
15:30
45min
Deploying Dask
Matthew Rocklin

Deploying Dask

Dask is a framework for parallel computing in Python.
It's great, until you need to set it up.

Kubernetes? Cloud? HPC? SSH? YARN/Hadoop even?
What's the right deployment technology to choose?

After you set it up a new set of problems arise:

  • How do you install software across the cluster?
  • How do you secure network access?
  • How do you access secure data that needs credentials?
  • How do you track who uses it and constrain costs?
  • When things break, how do you track them down?

There exist solutions to these problems in open source packages like dask-kubernetes, helm charts, dask-cloudprovider, and dask-gateway, as well as commercially supported products like Coiled, Saturn, QHub, AWS EMR, and GCP Dataproc. How do we choose?

This talk describes the problem faced by people trying to deploy any distributed computing system, and tries to construct a framework to help them make decisions on how to deploy.

Central Park West (6th floor)
11-11
09:00
90min
Dask
Natalia Clementi, David Chudzicki, Matthew Rocklin

Learn how to parallelize your Python code, first on your laptop and then on a distributed cluster.

This tutorial shows how to use Dask, a popular open source framework for parallel computing, to parallelize Python code. We start with parallelizing simple for loops, and move on to scaling out pandas code.

Along the way we will learn about concepts like partitioning data, parallel performance tracking, and managing exceptions and debugging on remote machines.

This will be a hands-on tutorial with Jupyter notebooks.

Central Park West (6th floor)