HACKER Q&A
📣 Blue_Cosma

Are embeddings too expensive for large datasets?


Hi HN,

I've recently spoken with two companies that mentioned the high costs of creating embeddings on their datasets for RAG applications. A PE firm shared that generating embeddings for new data rooms could cost up to $5K, limiting how often they do it.

I’m having trouble understanding why it’s so expensive. Embeddings themselves are relatively affordable (e.g., OpenAI charges around $0.13 per million tokens). With a quick rule of thumb, if a data room averages 25K pages (or ~25M tokens), the cost should be just a few dollars. My guess is they might be using a vision-LLM to convert PDFs to text first, which could be driving up the cost. I haven’t had a chance to discuss these details with their project team yet.

Has anyone else encountered this issue? Any ideas on what could be causing these high costs?

Thanks!


  👤 devops000 Accepted Answer ✓
You might have a clearer picture if you ask them how they are creating embeddings.