Caching
ProxyLLM keeps a two-tier cache: an exact-string match first, then a semantic similarity search. Cache hits return in milliseconds and cost you nothing in upstream tokens.
How a request is cached
- The system prompt + last user message are normalized (lowercased, punctuation stripped, whitespace collapsed) and SHA-256'd.
- Exact match: if the hash exists in Redis, we return the cached response immediately. No API call.
- Semantic match:on a miss, we embed the normalized text and scan the workspace's vector pool (dot product). If any stored vector scores ≥ the similarity threshold (default
0.93), we return its cached response. - On both misses, we forward to the LLM. The response is stored under the new hash + a vector is added in the background.
Response headers
| Header | Meaning |
|---|---|
x-proxyllm-cache | hit or miss |
x-proxyllm-cache-type | Only set on hit. exact or semantic. |
x-proxyllm-cache-similarity | Only set on semantic hit. Cosine similarity, e.g. 0.9612. |
x-proxyllm-cost | Upstream cost we charged (USD). 0.00000000 on cache hits. |
x-proxyllm-cost-saved | Only on hit. Estimated USD you avoided by hitting cache instead of the upstream provider. |
x-proxyllm-model | The model that was used (or would have been used) for this request. |
Plan limits
- Free: exact + semantic cache, 24h TTL, 1,000-vector pool per workspace (FIFO eviction).
- Pro: exact + semantic cache, 72h TTL, 10,000-vector pool.
- Scale: exact + semantic cache, 168h (7d) TTL, unlimited vector pool.
When the cache is automatically bypassed
Some requests skip the cache entirely — no read, no write. We do this conservatively to avoid serving a response that was generated for a different conversation:
- Multi-turn conversations (more than one message). The cache key uses only
system + last user message, so two different conversations that share the same final user message would collide. Single-turn requests cache normally. - Tool use, extended thinking, and non-text content blocks on
/v1/messages— tool results are intent-specific, thinking outputs depend on budget, and images/documents aren't text-comparable.
Skipping the cache manually
Add a small unique token to your prompt (e.g. a request ID or timestamp) when you genuinely need a fresh generation. The normalized text will differ — no cache hit. Use sparingly: bypassing the cache is what you're paying us to avoid.