Understanding the News around the World with Web Scraping and NLP at Scale
Everyday, media companies around the world publish millions of articles spanning multiple languages, and at Chartbeat we process this data to understand what is driving reader engagement. In this talk we discuss real-world lessons learned in building a production pipeline for scraping and extracting metadata in real time from this multitude of news articles. The pipeline leverages a mix of pre-trained and custom-built machine learning models in Python for content extraction, natural language processing, categorization, translation, and entity linking, enabling availability of metadata for an article in just three seconds on average.