HACKER Q&A
📣 alexmolas

What's the SOTA for open source search embeddings?


Hi *

I'm working on a project that involves hybrid search (neural + keyword search) and I'm wondering which is the current state of the art regarding open source search embeddings.

Afaik Mixtral is the current sota for generative models, but the embeddings it gives aren't well suited for search. As far as I understand, using the last layer activations of a generative model as embeddings isn't a good idea since two different sentences may have similar embeddings because the next word to be predicted is the same, while the semantics of the sentences are very different, eg (1) No Luke, I am your [father] and (2) My name is Íñigo Montoya, you killed my [father]. The embeddings given by the last layer of a generative model for (1) and (2) are very similar since the next word to be predicted is the same, but the semantics are very different.

My question is then, which is the current sota for search embeddings? This is, embeddings that have been trained specifically for search.

Thank you!


  👤 dmezzetti Accepted Answer ✓
Funny timing on this question. Normally I'd say check out the MTEB leaderboard - https://huggingface.co/spaces/mteb/leaderboard

But the current leader is a Mistral model - https://huggingface.co/intfloat/e5-mistral-7b-instruct

Based on this paper - https://arxiv.org/abs/2401.00368.pdf

Fun times.