PyData NYC 2022

Testing Big Data Applications (Spark, Dask, and Ray)
11-10, 13:30–14:15 (America/New_York), Central Park West (6th floor)

Data practitioners use distributed computing frameworks such as Spark, Dask, and Ray to work with big data. One of the major pain points of these frameworks is testability. For testing simple code changes, users have to spin up local clusters, which have a high overhead. In some cases, code dependencies force testing against a cluster. Because testing on big data is hard, it becomes easy for practitioners to avoid testing entirely. In this talk, we’ll show best practices for testing big data applications. By using Fugue to decouple logic and execution, we can bring more tests locally and make it easier for data practitioners to test with low overhead.


Distributed computing engines such as Spark, Dask, and Ray allow data practitioners to scale their data processing over a cluster of machines. The problem is that debugging and testing distributed computing code is notoriously hard. This is for both when iterating and for unit testing. Stacktraces can often become cryptic because distributed computing uses futures or async code under the hood.

More importantly, using these frameworks can lock in testing to depend on a cluster. Some libraries such as databricks-connect make it convenient to run code on a Spark cluster, but all testing code ends up running on the cluster as well. On a code level, distributed computing code often is a combination of logic and execution behavior. This makes it hard to unit test logic without testing execution code as well.

Ideally, we want to test as much as possible locally. This speeds up iteration time and decreases computing expenses. Keeping code closer to native Python or Pandas also gives more intuitive tracebacks when errors arise. When we’re production-ready, we can run the full test suite on the cluster.

Fugue is an abstraction layer for distributed computing that ports Python and Pandas code to Spark, Dask, and Ray. By using an abstraction layer, we can code in native Python or Pandas rather than using big data frameworks. Decoupling logic and execution dramatically reduces the overhead to run tests because tests can be run locally on Pandas or Python. Unit tests are easier to write because they focus on business logic on smaller data. When production-ready, Fugue allows users to toggle the execution engine and then the same code in a cluster.

Distributed computing engines such as Spark, Dask, and Ray allow data practitioners to scale their data processing over a cluster of machines. The problem is that debugging and testing distributed computing code is notoriously hard. This is for both when iterating and for unit testing. Stacktraces can often become cryptic because distributed computing uses futures or async code under the hood.

More importantly, using these frameworks can lock in testing to depend on a cluster. Some libraries such as databricks-connect make it convenient to run code on a Spark cluster, but all testing code ends up running on the cluster as well. On a code level, distributed computing code often is a combination of logic and execution behavior. This makes it hard to unit test logic without testing execution code as well.

Ideally, we want to test as much as possible locally. This speeds up iteration time and decreases computing expenses. Keeping code closer to native Python or Pandas also gives more intuitive tracebacks when errors arise. When we’re production-ready, we can run the full test suite on the cluster.

Fugue is an abstraction layer for distributed computing that ports Python and Pandas code to Spark, Dask, and Ray. By using an abstraction layer, we can code in native Python or Pandas rather than using big data frameworks. Decoupling logic and execution dramatically reduces the overhead to run tests because tests can be run locally on Pandas or Python. Unit tests are easier to write because they focus on business logic on smaller data. When production-ready, Fugue allows users to toggle the execution engine and then the same code in a cluster.

Outline:

  • Introduction to testing (4 mins)
    • Testing code during iterating
    • Unit tests
    • Unit tests for data
  • Why is it hard to test PySpark code? (6 mins)
    • There is a lot of boilerplate code
    • Testing can require a cluster depending on the setup
    • Execution behavior can be baked into functions
    • Stacktraces can be hard to read
  • Comparisons to Dask and Ray (3 mins)
  • What is the ideal setup? (4 mins)
    • Test as much as possible locally
    • Bring to the cluster when production-ready
    • Run the unit test suite without a cluster
  • Fugue abstraction layer (4 mins)
    • Decoupling of logic and execution
  • Fugue transform function (6 mins)
    • Scaling functions to Spark/Dask/Ray
    • Type hint conversion
  • Testing code in native Python and Pandas (6 mins)
    • Unit tests become easier to write
    • Stack traces for debugging become easier to read
  • Running full tests on a cluster (3 mins)
    • Simply swap the execution engine for running tests
  • Conclusion and questions (4 minutes)

Prior Knowledge Expected

Previous knowledge expected

Han Wang is the lead of Lyft Machine Learning Platform, focusing on distributed computing and training. Before joining Lyft, he worked at Microsoft, Hudson River Trading, Amazon, and Quantlab. Han is the founder of the Fugue project, aiming at democratizing distributed computing and machine learning.