Prompt caching

Prompt caching is the cheapest cost win you'll ever make. If your system prompt, few-shot examples, or RAG context stay the same across requests, you're paying to process them every single time. Caching lets you pay once and reuse for 5 minutes.

How it works

Add cache_control markers to your messages. Everything before the last marker becomes a cached prefix:

const message = await client.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: "You are a customer support agent for Acme Corp...",
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: fewShotExamples,  // large, stable
          cache_control: { type: "ephemeral" },
        },
        {
          type: "text",
          text: userQuery,  // changes per request
        },
      ],
    },
  ],
});

The first request pays full price plus a 25% write premium for creating the cache entry. Subsequent requests within 5 minutes get 90% off the cached portion. Each cache hit extends the TTL by another 5 minutes.

What to cache

Always cache: system prompts, persona instructions, few-shot examples, tool definitions, RAG retrieval context that's shared across a conversation.

Don't cache: the user's current message, one-off context, anything that changes per request.

Rule of thumb: if the content is the same across 3+ consecutive requests, cache it. The break-even point is roughly 2 cache hits to recoup the 25% write premium.

Measuring cache performance

The API response includes usage.cache_creation_input_tokens and usage.cache_read_input_tokens. Track these to compute your hit rate:

const hitRate = usage.cache_read_input_tokens /
  (usage.cache_read_input_tokens + usage.cache_creation_input_tokens);

A healthy production system has a cache hit rate above 80%. Below 60% means your requests aren't sharing enough prefix, or your TTL is expiring between requests.

Real savings

A typical customer support prompt: 2,000-token system prompt + 1,500-token examples + 500-token user query.

Metric	Without caching	With caching
Input tokens billed	4,000	500 + 3,500 cached
Cost per request (Opus)	$0.020	$0.004
Monthly (10K requests)	$200	$38
Savings	—	81%

What's next

Caching saves money on real-time requests. For batch workloads — classification, extraction, evaluation — the Batches API cuts costs by another 50%. Next lesson.

التّخزين المؤقّت للمطالبات أرخص مكسب تكلفة ستحقّقه. إذا بقيت مطالبة النّظام أو الأمثلة أو سياق RAG نفسها بين الطّلبات، أنت تدفع لمعالجتها كلّ مرّة. التّخزين يتيح لك الدّفع مرّة والاستعمال لخمس دقائق.

كيف يعمل

أضف علامات cache_control لرسائلك. كلّ شيء قبل آخر علامة يصبح بادئة مخزّنة. الطّلب الأوّل يدفع السّعر الكامل مع علاوة كتابة 25% لإنشاء سجلّ الذّاكرة. الطّلبات اللّاحقة خلال 5 دقائق تحصل على خصم 90% على الجزء المخزّن. كلّ إصابة تمدّد مدّة الصّلاحيّة 5 دقائق إضافيّة.

ما يُخزَّن

خزّن دائمًا: مطالبات النّظام، تعليمات الشّخصيّة، الأمثلة القليلة، تعريفات الأدوات، سياق RAG المشترك عبر المحادثة.

لا تخزّن: رسالة المستخدم الحاليّة، السّياق لمرّة واحدة، أيّ شيء يتغيّر كلّ طلب.

قاعدة عامّة: إذا كان المحتوى نفسه عبر 3 طلبات متتالية أو أكثر، خزّنه. نقطة التّعادل تقريبًا إصابتان لاسترداد علاوة الكتابة 25%.

قياس أداء الذّاكرة

استجابة الواجهة تتضمّن usage.cache_creation_input_tokens و usage.cache_read_input_tokens. تتبّعهما لحساب معدّل الإصابة. نظام إنتاج صحّي يجب أن يكون معدّل إصابته فوق 80%. تحت 60% يعني أنّ طلباتك لا تتشارك بادئة كافية، أو مدّة الصّلاحيّة تنتهي بين الطّلبات.

التّوفير الحقيقي

مطالبة دعم عملاء نموذجيّة: 2,000 رمز مطالبة نظام + 1,500 رمز أمثلة + 500 رمز استفسار مستخدم. بدون تخزين: $0.020 لكلّ طلب على Opus. مع تخزين: $0.004. على 10 آلاف طلب شهريًّا، التّوفير 81%.

ما التّالي

التّخزين المؤقّت يوفّر المال على الطّلبات الفوريّة. لأعمال الدّفعات — التّصنيف، الاستخراج، التّقييم — واجهة الدّفعات تخفض التّكاليف 50% إضافيّة. الدّرس القادم.

How it works

Add cache_control markers to your messages. Everything before the last marker becomes a cached prefix:

const message = await client.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: "You are a customer support agent for Acme Corp...",
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: fewShotExamples,  // large, stable
          cache_control: { type: "ephemeral" },
        },
        {
          type: "text",
          text: userQuery,  // changes per request
        },
      ],
    },
  ],
});

What to cache

Always cache: system prompts, persona instructions, few-shot examples, tool definitions, RAG retrieval context that's shared across a conversation.

Don't cache: the user's current message, one-off context, anything that changes per request.

Rule of thumb: if the content is the same across 3+ consecutive requests, cache it. The break-even point is roughly 2 cache hits to recoup the 25% write premium.

Measuring cache performance

The API response includes usage.cache_creation_input_tokens and usage.cache_read_input_tokens. Track these to compute your hit rate:

const hitRate = usage.cache_read_input_tokens /
  (usage.cache_read_input_tokens + usage.cache_creation_input_tokens);

A healthy production system has a cache hit rate above 80%. Below 60% means your requests aren't sharing enough prefix, or your TTL is expiring between requests.

Real savings

A typical customer support prompt: 2,000-token system prompt + 1,500-token examples + 500-token user query.

Metric	Without caching	With caching
Input tokens billed	4,000	500 + 3,500 cached
Cost per request (Opus)	$0.020	$0.004
Monthly (10K requests)	$200	$38
Savings	—	81%

Prompt caching

What you'll learn

How it works

What to cache

Measuring cache performance

Real savings

What's next

كيف يعمل

ما يُخزَّن

قياس أداء الذّاكرة

التّوفير الحقيقي

ما التّالي

Try it yourself

Reflect

Prompt caching

What you'll learn

How it works

What to cache

Measuring cache performance

Real savings

What's next

كيف يعمل

ما يُخزَّن

قياس أداء الذّاكرة

التّوفير الحقيقي

ما التّالي

Try it yourself

Reflect