11-11, 11:00–12:30 (America/New_York), Central Park East (6th floor)
Most production information retrieval systems are built on top of Lucene which use tf-idf and BM25. Current state of the art techniques utilize embeddings for retrieval. This workshop will cover common information retrieval concepts, what companies used in the past, and how new systems use embeddings.
Outline:
- Overview of search retrieval
- Non deep learning based retrieval
- Embeddings and Vector Similarity Overview
- Serving Vector Similarity using Approximate Nearest Neighbors (ANN)
By the end of the session, a participant will be able to build a production information retrieval system leveraging Embeddings and Vector Similarity using ANN. This will allow participants to utilize state of the art technologies / techniques on top of the traditional information retrieval systems.
Most companies need a search engine to help serve relevant results to their users. This workshop aims to demystify what is involved in building such a system.
The workshop will cover four main themes:
- Core concepts that are common for any search retrieval systems
- Non deep learning based retrieval system
- Overview of Embedding based retrieval system
- Putting an Embedding based retrieval system with ANN to production
The full outline is shared below.
Intro (10 mins)
- Search retrieval concepts: approaches, evaluation metrics etc.
- Overview of common production retrieval stack
- Walk over the notebooks and environment setup
Non deep learning based retrieval (15 min)
- Overview of tf-idf and BM-25
- How production systems use ElasticSearch / SOLR
- Hands-on lab experience: Reviewing Retrieval Results from tf-idf
Embeddings and Vector Similarity Overview (25 min)
- Brief review of common embedding techniques: word2vec, BERT
- Briefly talk about how to train own custom embeddings
- Vector Similarity and Evaluation metrics
- Hands-on lab experience: Compare results of Non deep learning and Vector Similarity
Serving Vector Similarity using Approximate Nearest Neighbors (25 min)
- Why Vector Similarity needs ANN
- Review common Approximate Nearest Neighbors techniques in FAISS
- Overview of managed services: VertexAi, Pinecone, Milvus
- Hands-on lab experience: Building FAISS Index and comparing results
By the end of the session, we hope to empower the user with enough information to build a production information retrieval system leveraging Embeddings and Vector Similarity using ANN.
No previous knowledge expected
Staff Data Scientist
Machine Learning Engineer at Walmart Search
Software Engineer at Walmart Search