While developing new features/testing locally, the LLM flow frequently runs, and I use a bunch of tokens. My openAI bill spikes.
I've made some efforts to stub LLM responses but it adds a decent bit of complexity and work. I don't want to run a model locally with ollama because I need to output to be high quality and fast.
Curious how others are handling similar situations.
Langchaing exemples:
[1] Caching https://python.langchain.com/v0.1/docs/modules/model_io/llms...
[2] Fake LLM https://js.langchain.com/v0.1/docs/integrations/llms/fake/
tl;dr: - Keep prompts short, combine prompts or make more detailed prompts but go to a smaller model - Simple and semantic cache lookups - Classify tasks and route to the best LLM using an AI gateway
Portkey.ai could help with a lot of this