* Use a local embeddings tool like BERT to embed all the docs (in chunk sizes up to 512 tokens)
* Use an open LLM like MPT-30B or Falcon-40B
* Then with the user query, do the following - generate an answer to the query with no context. Then do an embeddings similarity search based on both a) the question and b) the generated no-context answer. Then feed the 3-6 most similar chunks to to your LLM as a context (with a prompt like: Please answer the user's question. Here's context: """[context]"""". Here's the question: [question].)
All of that said, I think with the current state of open source commercially usable models, your results will be disappointing and I don't expect end users will be happy with the results. It sounds like you can't use GPT-4 (otherwise I'd say check this list I put together - https://llm-utils.org/List+of+tools+for+making+a+%22ChatGPT+...) - if you can use GPT-4, you'll get better results.
The other thing you can do is of course provide a source link to the documents that contained the most relevant chunks below the answer. (And you can also add a separate LLM prompt that asks the LLM which of the 3-6 chunks were highly relevant, and then use that to re-rank the results - I think of LLMs as better at similarity ranking than embeddings, though of course much slower and more expensive for the task, so best used sparingly when it's something embeddings can do).
So answering your questions:
1 - Minimal GPU hours needed with this approach, as you're not doing any fine tuning, only inference. If using MPT-30B I'd suggest 1x H100 on Lambda Labs or FluidStack, if using Falcon-40B I'd suggest 2x 6000 Ada on Runpod. See also this table I put together - https://gpus.llm-utils.org/recommended-gpus-and-gpu-clouds-f...
2 - I'd suggest MPT-30B if you need a commercial-ok model, otherwise Guanaco-33B.
We'd be happy to help implement something. You'll certainly want an embedding database. The open models are getting pretty good, but you'll want to stand up a testing framework. I have a reasonably good model running on a desktop machine in my office with a reasonably priced consumer grade nvidia GPU.
We also have some tactics and practices around hallucination prevention that we'd be happy to share. Feel free to reach out: human at summitlabs.ai