HACKER Q&A
📣 XCSme

What are the drawbacks of caching LLM responses?


I recently added AI integration to my application. While it works great, I dislike two things:

  1. I pay for all user prompts, even for duplicate ones.
  2. I am at the response-time mercy of the LLM API.
I could easily cache locally all prompts in a KV store and simply return the answer from cache for duplicate ones.

Why isn't everyone doing this?

I assume one reason is that the LLM response is not deterministic, where same query can return different responses, but this could also be added with a "forceRefresh" parameter to the query.


  👤 gwern Accepted Answer ✓
Two major ones: you now need to handle all cache issues like invalidation (what happens when you want to upgrade or the model improves?), and you also now need to think about security issues - given the drastic timing differences, anyone can probe your cache to figure out what calls have been made and extract anything in prompts like passwords or PII (eg. just going token by token and trying the top possibilities each time).

👤 XCSme
Just found this: https://github.com/zilliztech/GPTCache which seems to address this idea/issue.