Scaling local FAISS and LLM RAG system (356k chunks)architectural advice

Question

I&rsquo;ve been building a local-only AI assistant for security analysis that uses a FAISS vector index and a local model for reasoning over parsed tool output. The current system works well, but I&rsquo;m running into scaling issues as the dataset grows. Current setup: ~356k chunks FAISS (Flat index) 384-d MiniLM embeddings llama-cpp-python for inference Metadata stored in a single pickle file (~1.5GB) Tool outputs (Nmap/YARA/Volatility/etc.) parsed into structured JSON before queryingProblems I&rsquo;m running into:Metadata pickle file loads entirely into RAMNo incremental indexing &mdash; have to rebuild the FAISS index from scratchQuery performance degrades with concurrent useWant to scale to 1M+ chunks but not sure FAISS + pickle is the right long-term architectureMy questions for those who&rsquo;ve scaled local or offline RAG systems:How do you store metadata efficiently at this scale?Is there a practical pattern for incremental FAISS updates?Would a vector DB (Qdrant, Weaviate, Milvus) be a better fit for offline use?Any lessons learned from running large FAISS indexes on consumer hardware?Not looking for product feedback &mdash; just architectural guidance from people who&rsquo;ve built similar systems.

andre-z · Accepted Answer

FAISS is not suitable for production. The dedicated vector search solutions solve all the issues you mentioned: you just store the metadata along with vectors in JSON format. At least, with Qdrant, it works like this: https://qdrant.tech/documentation/concepts/payload/