← Back to ProxyLLM

Anthropic SDK

ProxyLLM exposes a native Anthropic Messages endpoint at POST /v1/messages. Swap the official Anthropic SDK's base URL to point at ProxyLLM and every request gets cached and rate-limited like the OpenAI route — no other code changes.

Quickstart

Managed mode is restricted to gpt-4o-mini on every tier, so any Claude model on /v1/messages requires BYOK on Pro or Scale — your own Anthropic API key, billed directly to your provider account. ProxyLLM still caches your responses, logs them, applies rate limits, and shows you costs.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  baseURL: "https://api.proxyllm.dev",
  apiKey: process.env.PROXYLLM_API_KEY, // your pl_… key
});

const msg = await client.messages.create({
  model: "claude-haiku-4-5-20251001",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Say hello in Portuguese." }],
});
console.log(msg.content);

Required fields

Anthropic's spec requires model, messages, and max_tokens on every request. ProxyLLM validates these at the route boundary and returns a 400 validation_error envelope if any are missing — faster feedback than a passthrough upstream 400.

Cross-format caching

ProxyLLM's cache is workspace-scoped and format-agnostic. A prompt sent through the OpenAI SDK once will cache-hit when the equivalent prompt comes back via the Anthropic SDK — and vice versa. The cached response is converted on retrieve, so the client sees the right shape for whichever SDK it's using.

When caching is skipped

The cache is bypassed when the request contains:

  • multi-turn conversation: any request with more than one message bypasses cache. The cache key uses system + last user message, which can't differentiate two conversations that share the same final user message but have different prior context. Single-turn requests continue to cache normally — that's the dominant case.
  • tools: tool-use responses are intent-specific and risk going stale if cached.
  • thinking enabled: extended-thinking outputs are model-and-budget-specific.
  • non-text content blocks (images, documents, tool_result): the cache key is text-based.

Plain-text requests still cache normally and benefit from the full two-tier (exact + semantic) lookup. See Caching for the headers and TTL details.

Streaming

Pass stream: true and ProxyLLM emits the canonical Anthropic named-event SSE protocol (message_start content_block_startcontent_block_delta content_block_stopmessage_delta message_stop) so the official SDK's .stream() helper works unchanged. On cache hits the sequence is synthesized from the stored response.

Plan limits on Claude models

Managed mode runs the same whitelist on every tier — gpt-4o-minionly — to bound our upstream cost exposure. Any Claude model (haiku, sonnet, opus) on /v1/messages returns 403 plan_limit_error in managed mode. To call those, enable BYOK on Pro or Scale — the whitelist no longer applies once your own Anthropic key is configured, and you keep all ProxyLLM features (cache, dashboard, rate limits, alerts). See Plan limits for the full matrix.

What ProxyLLM forwards verbatim

Tools, tool_choice, thinking, system prompts (string or array), stop_sequences, temperature, top_p, top_k, and metadata.user_id are passed through unchanged. ProxyLLM adds x-proxyllm-* observability headers to the response.