PyData NYC 2022

Data and Model Version Control: Applications in ML Drug Discovery pipelines
11-10, 13:30–14:15 (America/New_York), Music Box (5th floor)

Development of Machine Learning (ML) pipelines in drug discovery faces different challenges from those in traditional software development. In addition to unique challenges during the data engineering stage, drug discovery pipelines require not only the standard Git tracking for source code but also make versioning of data and ML models necessary. In this talk, we will discuss some of the main challenges when working with biological data and how Data Version Control (DVC) tools help to facilitate data- and model-tracking during the development of ML drug discovery pipelines.


Biological data is inherently complex and riddled with heterogeneity, conditionality, dimensionality, and research bias. ML models in drug discovery are built on large biological datasets from diverse sources that can vary in format and content, making data engineering tasks challenging. This information heterogeneity requires significant data sanitization, standardization, and selection steps that can dramatically influence model performance. Additionally, after building and assessing model performance over multiple iterations, the deployed model is subjected to a feedback loop that often suggests either retraining the model with the updated data or implementing a new feature to maintain or improve model performance. As a result of the ML development cycle, large amounts of files of both models and data versions are generated.

The inherent complexity of biological data preprocessing adds a new layer of experimentation parameters associated with developing optimal models. How can we keep track of dataset versions, experiments, and models in an automated way? How can we organize and store different ML model versions with their data? Data Version Control is the answer. By combining Git and DVC tools, ML teams can track and manage large datasets and ML models.

This talk includes two parts. In the first part, we will offer an overview of some of the main challenges and limitations encountered in biological datasets. In the second part, we will present a use case of ML workflow using DVC, a free and open source tool for data and model versioning.

Outline

Part I: Biological data in ML pipelines (20 min)

We will discuss some of the main challenges experts face when working with biological data. We will cover the following topics:

  • What makes biological data different?
  • Overview ML drug discovery pipelines.
  • Data feedback loops: optimizing ML models.
  • Challenges.

Part II: Implementing DVC in drug discovery ML pipelines (20 min)

In the second part of this talk, we will present a use case for DVC implementation in drug discovery pipelines. We will provide a use case on how to implement DVC for data and model versioning:

  • Introduction to DVC.
  • Tracking data and models in drug discovery.
  • Highlights.

A 5-minute Q&A space will be provided at the end of the talk.

To get the most out of this talk, attendees should have prior understanding of Git.


Prior Knowledge Expected

Previous knowledge expected

Estefania Barreto-Ojeda is a computational scientist at Cyclica Inc., where she develops and maintains machine learning pipelines for drug discovery. A physicist by training, she has a PhD in Biophysical Chemistry from the University of Calgary where she developed open source tools to analyze MD simulations. Estefania is an occasional open-source contributor, full time data visualization fan, and seasonal bicycle lover.