PyData NYC 2022

Understanding the News around the World with Web Scraping and NLP at Scale
11-09, 14:45–15:30 (America/New_York), Music Box (5th floor)

Everyday, media companies around the world publish millions of articles spanning multiple languages, and at Chartbeat we process this data to understand what is driving reader engagement. In this talk we discuss real-world lessons learned in building a production pipeline for scraping and extracting metadata in real time from this multitude of news articles. The pipeline leverages a mix of pre-trained and custom-built machine learning models in Python for content extraction, natural language processing, categorization, translation, and entity linking, enabling availability of metadata for an article in just three seconds on average.


In this talk, for NLP practitioners and engineers, we discuss real-world lessons learned in building a production pipeline for streaming processing of web pages at massive scale. We present three core natural language understanding tasks of the pipeline: (1) entity extraction, linking, and importance scoring; (2) term importance scoring (2) article categorization and (3) non-English article translation. We’ll discuss how we use spaCy’s pre-trained named entity extraction models paired with Wikipedia hyperlinks and Wikidata entity aliases to disambiguate and link multiple references to an entity within an article. We further discuss how we leveraged transfer learning to train a classification model on a combination of hand-labeled articles and Wikinews data to categorize articles using a media-specific taxonomy. Finally, we discuss the engineering behind our use of LibreTranslate’s open source translation system to allow us to categorize non-English articles.
We reflect on how encountering wild and unexpected data helped us build a robust pipeline via exception-driven development. We’ll also talk about how a mix of machine learning and heuristics ultimately provided an optimal approach to the content extraction challenge.


Prior Knowledge Expected

No previous knowledge expected

Machine Learning Engineer