I've recently spoken with two companies that mentioned the high costs of creating embeddings on their datasets for RAG applications. A PE firm shared that generating embeddings for new data rooms could cost up to $5K, limiting how often they do it.
I’m having trouble understanding why it’s so expensive. Embeddings themselves are relatively affordable (e.g., OpenAI charges around $0.13 per million tokens). With a quick rule of thumb, if a data room averages 25K pages (or ~25M tokens), the cost should be just a few dollars. My guess is they might be using a vision-LLM to convert PDFs to text first, which could be driving up the cost. I haven’t had a chance to discuss these details with their project team yet.
Has anyone else encountered this issue? Any ideas on what could be causing these high costs?
Thanks!