PyData NYC 2022

High-Dimensional Data Visualizations with MDS, t-SNE, and UMAP
11-11, 15:30–17:00 (America/New_York), Winter Garden (5th floor)

High-dimensional visualization techniques are machine learning methods that project your dataset with all its variables into a two-dimensional map that will help you interactively explore the properties of your data, its clusters, and outliers. We demonstrate their potential and how to apply and interpret their often unintuitive results through real-world examples.

The materials from the talk can be downloaded from:
https://github.com/mihalis/HDVisualizations


As your data grows in the number of variables or dimensions, it is harder to visualize it. High-dimensional data visualization methods will project your entire dataset with all its variables into a two-dimensional map that will help you interactively explore the properties of your data, its clusters, and outliers. The oldest methods, principal component analysis (PCA) and multidimensional scaling (MDS) are still the most frequently used and understood but will likely fail to uncover your data's local structure, including clusters. Algorithms developed in the last ten years, such as t-SNE, and UMAP, are much better at revealing the data points that are similar (neighbors) or very dissimilar to others (outliers). Their interpretation is not intuitive, and many data scients are unaware of their capabilities.

In this tutorial, we will start by projecting cities from different countries and continents, the equivalent of points on the surface of a sphere, to a 2D map using all the algorithms, old and new. How effectively data points are projected on a map to form smaller clusters (countries) within larger clusters (continents) will help us understand when to apply and how to interpret each method. To test the algorithms on a real-world problem, we will visualize antibiotic effectiveness (Butin's dataset) using MDS and t-SNE. We will also generate an interactive two-dimensional map of the bacteria to uncover further insights using Plotly. In our final and more advanced application, we will use UMAP to discover cluster structure and outliers on the NYC taxi dataset. Finally, we will demonstrate the compatibility of clustering methods such as spectral clustering with some of the algorithms.


Prior Knowledge Expected

Previous knowledge expected

I am an Assistant Professor of Practice at Rutgers where I teach data science related topics. I am also a principal at a private consulting company, Cambridge Systematics, where I build data products based on location-based services data using Apache Spark. I have given talks at conferences before but never at a python conference. I have attended several PyCon conferences since 2009 (I believe) and few pyData conferences in NYC.