PyData NYC 2022

Why do I need to know Python? I'm a pandas user…
11-09, 15:30–16:15 (America/New_York), Radio City (6th floor)

You use pandas every day. You know every keyword argument on every function, even .melt! You even know whether it's .rename, .rename_axis, or .set_axis that you want—and you get it right on the first try! So why bother learning Python? Sure, pandas is written in it, but outside of assembling parts of the pandas API, what's there that has any value in your life?


It's common for data scientists to narrowly focus on the APIs of the tools they use every day—pandas, matplotlib, pymc, dask, &c.—to the detriment of any focus on the surrounding programming language. In the case of tools like matplotlib, the total amount of Python we need to know is limited to what existed when matplotlib was first developed. (Did you know that matplotlib predates @property? That explains a lot…) In the case of newer tools like dask or pymc or even pandas, we may encounter some newer parts of Python—e.g., context managers or descriptors—as part of these tools' API design, but it's very easy to accept these as mere “syntax.”

In this talk, we will discuss where a deeper understanding of pure Python has direct and immediate consequences to your work as a data scientist. We will discuss where these parts of Python you may have skimmed over show up in analytical code, outside of the mere “syntax” of an API.

This talk will be organised around answering the following questions:
- why do generators even matter (and who cares about coroutines)?
- the itertools module is great… if I were writing scripts, but where does it show up in data analysis?
- object orientation seems like a bunch of bureaucracy—can it really simplify my analytical code?
- why should I bother with data types in builtins and collections; is the pandas.DataFrame not enough?
- knowledge of Python internals would probably be useful, if I were a programmer writing scripts, but why do they matter for a data scientist?


Prior Knowledge Expected

Previous knowledge expected

James Powell is the founder and lead instructor at Don’t Use This Code. A professional Python programmer and enthusiast, James got his start with the language by building reporting and analysis systems for proprietary trading offices; now, he uses his experience as a consultant for those building data engineering and scientific computing platforms for a wide range of clients using cutting-edge open source tools like Python and React.

He also currently serves as a Board Director, Chair, and Vice President at NumFOCUS, the 501©3 non-profit that supports all the major tools in the Python data analysis ecosystem (i.e., pandas, numpy, jupyter, matplotlib). At NumFOCUS, he helps build global open source communities for data scientists, data engineers, and business analysts. He helps NumFOCUS run the PyData conference series and has sat on speaker selection and organizing committees for 18 conferences. James is also a prolific speaker: since 2013, he has given over seventy (70) conference talks at over fifty (50) Python events worldwide.