PyData NYC 2022

Fast and Scalable Timeseries Modelling with Fugue and Nixtla
11-11, 13:30–15:00 (America/New_York), Winter Garden (5th floor)

Timeseries modeling has been one of the weak points of the Python ecosystem compared to R. Statistical modeling libraries such as pmdarima and statsmodels are orders of magnitude slower than R, and state-of-the-art algorithms remain challenging to implement. In this tutorial, we introduce a set of open-source libraries that allow for fast and scalable time series modeling in Python. Using StatsForecast and NeuralForecast on different python backends for distributed computing like Dask, Ray, and Spark, we will show the participants how to do forecasting at scale and even how to outperform current benchmarks in the R ecosystem.

We’ll walk through general best practices when working with time series data and explore the various kinds of time series modeling techniques: statistical, hierarchical, and deep learning based approaches.

Using the Fugue abstraction layer, we’ll learn how to port Python and Pandas code to distributed computation clusters with a few lines of code and leverage the power of Dask, Spark and Ray. This will allow the participants to learn how to train millions of time series models in a few minutes.


This tutorial will be a detailed overview of the available classical and machine learning techniques when handling time series data. Time series have unique characteristics that require different handling compared to other regression problems. We will deep dive into problems like:

  • Evaluation of models using rolling windows or cross validations strategies
  • Overfitting and data leakage
  • Different error metrics

We’ll then give an overview of classical and deep learning time series modeling techniques. There are statistical models such as ARIMA and ETS. These are classical lightweight econometric-based models that were the field standard for a long time. We will explore how StatsForecast uses numba and highly efficient code to achieve highly scalable, accurate and reproducible results.

More cutting-edge models will then be discussed. We will deep dive into the problem of using neural networks for forecasting and explore how to reconcile different levels of aggregation with hierarchical methods.

Lastly, we'll introduce Fugue, an abstraction layer for distributed computing. Fugue can take Python and Pandas code, and bring it to Spark/Dask/Ray. We’ll show how we can use Fugue to achieve time series modeling at scale. There are best practices to consider, such as data loading and avoiding multi-level parallelism. Fugue can also handle spinning temporary clusters to execute modeling and then spin down. One of the challenges of big data is effective iteration, and we’ll show best practices to avoid costly mistakes.

Outline:

  • Introduction and Environment Spin-up (5 mins)
  • Evolution of time series modeling in Python (5 mins)
  • Timeseries basics and best practices (10 mins)
    • Exogenous variables
    • Cross-validation
    • Metrics for evaluation
  • Statistical forecasting (10 mins)
    • ARIMA
    • ETS
    • Comparison to pmdarima, statsforecast and statsmodels
    • Intermittent time series
  • Hierarchical forecasting (10 mins)
    • Motivation
    • Reconciliation methods
  • Neural forecasting (10 mins)
    • Overview of deep learning models
    • Datasets and Dataloaders
    • Production and deployment
  • Fugue abstraction layer (5 mins)
    • Introduction to Spark/Dask/Ray
  • Fugue transform function (10 mins)
    • Introduction to partitions
    • Distributed computing basics
    • Scaling functions to Spark/Dask/Ray
  • Running time series models on Spark/Dask/Ray (10 mins)
    • Scaling Statsforecast on Spark/Dask/Ray
    • Scaling hierarchicalforecast and neuralforecast (conceptual)
  • Postprocessing and deployment at scale (10 minutes)
  • Wrap-up and questions (5 minutes)

Prior Knowledge Expected

No previous knowledge expected

Kevin Kho is a maintainer for the Fugue project, an abstraction layer for distributed computing. Previously, he was an Open Source Community Engineer at Prefect, an workflow orchestration management system. Before working on data tooling, he was a data scientist for 4 years.