What's the SOTA for open source search embeddings?

Question

Hi *I'm working on a project that involves hybrid search (neural + keyword search) and I'm wondering which is the current state of the art regarding open source search embeddings.Afaik Mixtral is the current sota for generative models, but the embeddings it gives aren't well suited for search. As far as I understand, using the last layer activations of a generative model as embeddings isn't a good idea since two different sentences may have similar embeddings because the next word to be predicted is the same, while the semantics of the sentences are very different, eg (1) No Luke, I am your [father] and (2) My name is &Iacute;&ntilde;igo Montoya, you killed my [father]. The embeddings given by the last layer of a generative model for (1) and (2) are very similar since the next word to be predicted is the same, but the semantics are very different.My question is then, which is the current sota for search embeddings? This is, embeddings that have been trained specifically for search.Thank you!

dmezzetti · Accepted Answer

Funny timing on this question. Normally I'd say check out the MTEB leaderboard - https://huggingface.co/spaces/mteb/leaderboard
But the current leader is a Mistral model - https://huggingface.co/intfloat/e5-mistral-7b-instruct
Based on this paper - https://arxiv.org/abs/2401.00368.pdf
Fun times.

What's the SOTA for open source search embeddings?

Funny timing on this question. Normally I'd say check out the MTEB leaderboard - https://huggingface.co/spaces/mteb/leaderboardBut the current leader is a Mistral model - https://huggingface.co/intfloat/e5-mistral-7b-instructBased on this paper - https://arxiv.org/abs/2401.00368.pdfFun times.

Funny timing on this question. Normally I'd say check out the MTEB leaderboard - https://huggingface.co/spaces/mteb/leaderboard
But the current leader is a Mistral model - https://huggingface.co/intfloat/e5-mistral-7b-instruct
Based on this paper - https://arxiv.org/abs/2401.00368.pdf
Fun times.