Caching

ProxyLLM keeps a two-tier cache: an exact-string match first, then a semantic similarity search. Cache hits return in milliseconds and cost you nothing in upstream tokens.

How a request is cached

The system prompt + last user message are normalized (lowercased, punctuation stripped, whitespace collapsed) and SHA-256'd.
Exact match: if the hash exists in Redis, we return the cached response immediately. No API call.
Semantic match:on a miss, we embed the normalized text and scan the workspace's vector pool (dot product). If any stored vector scores ≥ the similarity threshold (default 0.93), we return its cached response.
On both misses, we forward to the LLM. The response is stored under the new hash + a vector is added in the background.

Response headers

Header	Meaning
`x-proxyllm-cache`	`hit` or `miss`
`x-proxyllm-cache-type`	Only set on hit. `exact` or `semantic`.
`x-proxyllm-cache-similarity`	Only set on semantic hit. Cosine similarity, e.g. `0.9612`.
`x-proxyllm-cost`	Upstream cost we charged (USD). `0.00000000` on cache hits.
`x-proxyllm-cost-saved`	Only on hit. Estimated USD you avoided by hitting cache instead of the upstream provider.
`x-proxyllm-model`	The model that was used (or would have been used) for this request.

Plan limits

Free: exact + semantic cache, 24h TTL, 1,000-vector pool per workspace (FIFO eviction).
Pro: exact + semantic cache, 72h TTL, 10,000-vector pool.
Scale: exact + semantic cache, 168h (7d) TTL, unlimited vector pool.

When the cache is automatically bypassed

Some requests skip the cache entirely — no read, no write. We do this conservatively to avoid serving a response that was generated for a different conversation:

Multi-turn conversations (more than one message). The cache key uses only system + last user message, so two different conversations that share the same final user message would collide. Single-turn requests cache normally.
Tool use, extended thinking, and non-text content blocks on /v1/messages— tool results are intent-specific, thinking outputs depend on budget, and images/documents aren't text-comparable.

Skipping the cache manually

Add a small unique token to your prompt (e.g. a request ID or timestamp) when you genuinely need a fresh generation. The normalized text will differ — no cache hit. Use sparingly: bypassing the cache is what you're paying us to avoid.