PyData NYC 2022

Bagging to BERT: A tour of applied NLP
11-11, 09:00–10:30 (America/New_York), Winter Garden (5th floor)

We will proceed through a progression of Natural Language Processing techniques of increasing complexity using a sentiment analysis use-case. We will evaluate each approach and identify the pros and cons of each.


Research suggests that 80-90% of data within any particular organization is unstructured, and much of these data are text. In order to make use of this wealth of unstructured data, organizations have been turning to Natural Language Processing techniques. IBM’s 2021 Global AI Adoption Index showed NLP is at the forefront of AI adoption with one in four businesses reporting adopting this type of technology within a year. This is being enabled by a wide array of open-source NLP libraries such as spaCy and HuggingFace’s Transformers.

In this workshop we will explore some of these popular NLP techniques that have broad applicability. From the basics of bagging and word vectors to the creating of contextualized representations of words and sentences, the workshop will equip participants with the tools they need to turn messy text data into useful insights.

The focus of the workshop will be building an NLP approach with increasing complexity. Each step in the progression will build on the others and be evaluated against one another. The progression will be motivated by a sentiment analysis use case using movie reviews. The intention is to show the utility of each method in a reasonably well-behaved sample dataset. We will use a combination of scikit-learn’s performant text feature extraction and spaCy’s powerful NLP pipelines to create scalable solutions that participants can apply to their own use-cases. We will conclude by comparing the results of the different approaches and discussing the pros and cons for each.

The intended audience will have intermediate-level knowledge of Python and an interest in NLP technology. Participants will leave with an understanding of how to use these techniques and the benefits (and risks!) of each.


Prior Knowledge Expected

Previous knowledge expected

Ben is a Senior Data Scientist at the Institute for Experiential AI. He obtained his Masters in Public Health (MPH) from Johns Hopkins and his PhD in Policy Analysis from the Pardee RAND Graduate School. Since 2014, he has been working in data science for government, academia and the private sector. His major focus has been on Natural Language Processing (NLP) technology and applications. Throughout his career, he has pursued opportunities to contribute to the larger data science community. He has spoken at data science conferences, taught courses in Data Science, and helped organize the Boston chapter of PyData. He also contributes to volunteer projects applying data science tools for public good.