PyData NYC 2022

Coldstart: A library for automatic data curation and feature engineering
11-10, 10:15–11:00 (America/New_York), Central Park East (6th floor)

Coldstart is a package for automatic data curation and feature engineering. I was motivated to create this package when I witnessed fellow data scientists struggling to begin the modeling process because of unfamiliarity with the vast amount of data in our warehouse, contextual query syntax gotchas, and concurrency. This is essentially a type of cold start problem, especially for new and junior employees.

Coldstart aims to solve this problem by encapsulating best practices and abstracting away lower-level details associated with dynamic query templating, query optimization, concurrent execution, memory management, data leakage, and pipeline deployment. Some of these best practices are made possible by leveraging libraries like PyArrow, Pandas, SQLAlchmey, and Dask under the hood. Coldstart is meant to be a “Goldilocks” solution of sorts that sits somewhere between a collection of version-controlled queries and a full-fledged feature store. If you’re making batch predictions that do not require ultra-low latency guarantees or if you’re not taking full advantage of the warehouse’s available computing resources, then this package might be prefect for you.

Coldstart embraces a code that writes code mindset by exposing a high-level convenience function (feature_factory) that retrieves data from various user-defined domains by establishing 1:1 or 1:M relationships with peer-reviewed queries that are templated at runtime based on user parameters and executed concurrently in one or many batches. The output comes in the form of a single wide dataframe that can be held in memory (i.e., Pandas) or on disk (i.e., Dask) and fed directly into a feature engineering/modeling pipeline. Row-level observations are identifiable through the use of composite indexes that have two parts to them: an entity component and a temporal component, which satisfy most tabular supervised ML use cases. When it’s time to move from development to production, a user can “freeze” the queries that they will be using in their prediction pipeline.

Coldstart should be used by new and seasoned data scientists alike who want to spend less time QA’ing queries, waiting for queries to run sequentially, and refactoring queries for production. Ultimately, this package helps accelerate the fun stuff, AKA the model development process.


Data curation and cleaning are undoubtedly among the most time-consuming steps in the data science lifecycle, so why are we constantly rewriting the same queries and/or data pipelines across projects and teams when the features are the same? Alas, there is a better way… Coldstart is a package for automatic data curation that exposes a high-level convenience function for retrieving data from various user-defined domains by establishing 1:1 or 1:M relationships with queries that are templated at runtime based on parameters and executed concurrently.


Prior Knowledge Expected

No previous knowledge expected

Piero Ferrante is a Senior Principal Data Scientist at CVS Health, a Fortune 4 health solutions company, where he and his team are focused on building scalable machine learning systems and developing tools to enhance the productivity and efficacy of hundreds of fellow data scientists and engineers.

Piero has nearly 15 years of applied experience in healthcare, telecom, insurance, mobile advertising, and fintech at companies ranging in size from unicorn startups to Fortune 500s. He holds an M.S. in Predictive Analytics from Northwestern University, a B.S. in Finance and Management Information Systems from the University of Delaware, and has served as an adjunct at New York University, the University of Kansas, and Rockhurst University. Piero also advises Play-it Heath, a digital health startup, on algorithms and data strategy.