PyData NYC 2022

Building a Semantic Search Engine
11-11, 11:00–12:30 (America/New_York), Central Park East (6th floor)

Most production information retrieval systems are built on top of Lucene which use tf-idf and BM25. Current state of the art techniques utilize embeddings for retrieval. This workshop will cover common information retrieval concepts, what companies used in the past, and how new systems use embeddings.

Outline:
- Overview of search retrieval
- Non deep learning based retrieval
- Embeddings and Vector Similarity Overview
- Serving Vector Similarity using Approximate Nearest Neighbors (ANN)

By the end of the session, a participant will be able to build a production information retrieval system leveraging Embeddings and Vector Similarity using ANN. This will allow participants to utilize state of the art technologies / techniques on top of the traditional information retrieval systems.


Most companies need a search engine to help serve relevant results to their users. This workshop aims to demystify what is involved in building such a system.

The workshop will cover four main themes:
- Core concepts that are common for any search retrieval systems
- Non deep learning based retrieval system
- Overview of Embedding based retrieval system
- Putting an Embedding based retrieval system with ANN to production

The full outline is shared below.

Intro (10 mins)
- Search retrieval concepts: approaches, evaluation metrics etc.
- Overview of common production retrieval stack
- Walk over the notebooks and environment setup

Non deep learning based retrieval (15 min)
- Overview of tf-idf and BM-25
- How production systems use ElasticSearch / SOLR
- Hands-on lab experience: Reviewing Retrieval Results from tf-idf

Embeddings and Vector Similarity Overview (25 min)
- Brief review of common embedding techniques: word2vec, BERT
- Briefly talk about how to train own custom embeddings
- Vector Similarity and Evaluation metrics
- Hands-on lab experience: Compare results of Non deep learning and Vector Similarity

Serving Vector Similarity using Approximate Nearest Neighbors (25 min)
- Why Vector Similarity needs ANN
- Review common Approximate Nearest Neighbors techniques in FAISS
- Overview of managed services: VertexAi, Pinecone, Milvus
- Hands-on lab experience: Building FAISS Index and comparing results

By the end of the session, we hope to empower the user with enough information to build a production information retrieval system leveraging Embeddings and Vector Similarity using ANN.


Prior Knowledge Expected

No previous knowledge expected

Staff Data Scientist

Machine Learning Engineer at Walmart Search

This speaker also appears in:

Software Engineer at Walmart Search