← Back to ProxyLLM

Caching

ProxyLLM keeps a two-tier cache: an exact-string match first, then a semantic similarity search. Cache hits return in milliseconds and cost you nothing in upstream tokens.

How a request is cached

  1. The system prompt + last user message are normalized (lowercased, punctuation stripped, whitespace collapsed) and SHA-256'd.
  2. Exact match: if the hash exists in Redis, we return the cached response immediately. No API call.
  3. Semantic match:on a miss, we embed the normalized text and scan the workspace's vector pool (dot product). If any stored vector scores ≥ the similarity threshold (default 0.93), we return its cached response.
  4. On both misses, we forward to the LLM. The response is stored under the new hash + a vector is added in the background.

Response headers

HeaderMeaning
x-proxyllm-cachehit or miss
x-proxyllm-cache-typeOnly set on hit. exact or semantic.
x-proxyllm-cache-similarityOnly set on semantic hit. Cosine similarity, e.g. 0.9612.
x-proxyllm-costUpstream cost we charged (USD). 0.00000000 on cache hits.
x-proxyllm-cost-savedOnly on hit. Estimated USD you avoided by hitting cache instead of the upstream provider.
x-proxyllm-modelThe model that was used (or would have been used) for this request.

Plan limits

  • Free: exact + semantic cache, 24h TTL, 1,000-vector pool per workspace (FIFO eviction).
  • Pro: exact + semantic cache, 72h TTL, 10,000-vector pool.
  • Scale: exact + semantic cache, 168h (7d) TTL, unlimited vector pool.

When the cache is automatically bypassed

Some requests skip the cache entirely — no read, no write. We do this conservatively to avoid serving a response that was generated for a different conversation:

  • Multi-turn conversations (more than one message). The cache key uses only system + last user message, so two different conversations that share the same final user message would collide. Single-turn requests cache normally.
  • Tool use, extended thinking, and non-text content blocks on /v1/messages— tool results are intent-specific, thinking outputs depend on budget, and images/documents aren't text-comparable.

Skipping the cache manually

Add a small unique token to your prompt (e.g. a request ID or timestamp) when you genuinely need a fresh generation. The normalized text will differ — no cache hit. Use sparingly: bypassing the cache is what you're paying us to avoid.