Alum @ Alma

CSE IIT Madras

We learn from our alumni in this interaction series, often technically, sometimes semi-technically.

Harsha Vardhan Simhadri

Microsoft Research India

Harsha Simhadri is a Principal Researcher at Microsoft Research. He enjoys developing new algorithms that enable practical systems. Recent examples include algorithms for web-scale nearest-neighbor search deployed in various Microsoft scenarios, and new ML operators and architectures for tiny IoT and edge devices. He previously worked on parallel algorithms and run-times with provable guarantees for multi-core processors for his PhD thesis at Carnegie Mellon University.



Vector Search Systems for Web-scale Search and Recommendation

Web-scale search and recommendation scenarios increasingly use Vector search or Approximate Nearest Neighbor Search (ANNS) indices to retrieve objects based on the similarity of their learnt representations in a geometric space. Since these scenarios often span billions or trillions of objects, efficient and scalable ANNS algorithms and systems are critical to making these systems practical. However, most algorithms studied in literature either focus on small million-scale datasets or do not have features necessary for practical indices, e.g., external memory indices or support for real-time updates. In this talk, we present DiskANN, the first external memory ANNS algorithm that can index a billion points on a single commodity machine with <64GB DRAM and serve queries at few milliseconds latency. This represents an order of magnitude more points indexed per machine than previous work. In addition, the index allows real-time updates and its in-memory performance compares well with other state of the art indices. The DiskANN framework can also be extended to support other features critical to deployment such as hybrid queries that combine vector search and hard matches such as language or author, and support for out-of-distribution queries. In a recently organized billion-scale ANNS challenge at NeurIPS'21, DiskANN proved to be the state of the art for many billion-scale datasets on both standard and specialized hardware. We will overview the datasets and baselines released in this challenge.

Finally, we will highlight some open problems in making DiskANN more robust. A few of these include support for crash recovery and serializability; systems for extremely large distributed indices.

Joint work with Ravishankar Krishnaswamy, Suhas J Subramanya, Aditi Singh, Rohan Kadekodi, Devvrit, Shikhar Jaiswal, Magdalen Dobson, Siddharth Gollapudi, Neel Karia, Varun Sivasankaran.


Organizers

  • Adityakumar Rajendra Yadav
  • N S Narayanaswamy
  • Rupesh Nasre.