On this page
Lesson 3 of 6
Few-shot examples
What you'll learn
- Decide when zero-shot beats few-shot (and vice versa)
- Pick the right number and ordering of examples
- Use diverse examples to avoid pattern-matching traps
Few-shot prompting is the single most effective technique for improving output quality. Instead of describing what you want in more and more words, you show the model what you want with concrete examples. This lesson covers when to use it, how many examples to provide, and how to avoid the traps.
Zero-shot vs. few-shot
Zero-shot means you give the model an instruction with no examples. For many tasks — especially ones the model has seen heavily in training like summarization or translation — zero-shot works fine.
Few-shot means you provide input/output pairs before the actual task. The model uses these pairs to infer the pattern you want. This is called in-context learning: the model is not being retrained, it is learning your pattern at inference time from the examples in the prompt.
Here is a zero-shot classification prompt:
Classify this support ticket as "billing", "technical", or "account":
"I can't log in since the password reset yesterday."
Claude will probably get this right. But what about ambiguous cases? What if your internal definition of "account" differs from what the model assumes? That is where few-shot shines.
The same task with examples
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=64,
system="Classify support tickets. Respond with only the category.",
messages=[
{"role": "user", "content": "Ticket: \"I was charged twice for March.\"\nCategory:"},
{"role": "assistant", "content": "billing"},
{"role": "user", "content": "Ticket: \"The API returns 500 when I send a batch over 100 items.\"\nCategory:"},
{"role": "assistant", "content": "technical"},
{"role": "user", "content": "Ticket: \"I need to change the email on my team account.\"\nCategory:"},
{"role": "assistant", "content": "account"},
{"role": "user", "content": "Ticket: \"I can't log in since the password reset yesterday.\"\nCategory:"},
]
)
With three examples, the model now understands your exact taxonomy. The login issue could be "technical" or "account" — but your example of "change the email on my team account" being classified as "account" signals that credential-related issues fall under "account" in your system.
How many examples?
The research and practical experience converge on this guidance:
- 0 examples: Fine when the task is unambiguous and well-known.
- 2-3 examples: The sweet spot for most classification and formatting tasks.
- 4-5 examples: Useful when the task has subtle distinctions or edge cases.
- More than 5: Rarely improves quality and costs more tokens. If you need more than 5, consider whether fine-tuning or a better instruction would be more efficient.
The curve of quality vs. number of examples usually flattens between 3 and 5. Beyond that, you are paying for tokens without meaningful improvement.
Picking the right examples
This is where most people go wrong. The temptation is to pick examples that are easy and similar to each other. Resist it.
Rule 1: Cover the edges. If you have 3 categories, make sure at least one example hits each category. If you show 3 examples that are all "billing", the model may default to "billing" for ambiguous inputs.
Rule 2: Include an ambiguous case. Show the model a borderline example and your intended classification. This teaches the boundary more effectively than any instruction.
Rule 3: Vary the surface form. If all your example tickets are one sentence and formal, the model may struggle with a ticket that is three sentences and casual. Show different lengths, tones, and phrasings.
Rule 4: Order matters less than you think, but not zero. Recent research suggests that putting the most representative examples last (closest to the actual query) can slightly improve accuracy. In practice, consistent formatting matters more than order.
The pattern-matching trap
Here is a subtle failure. Suppose all your examples have negative sentiment and the real input is positive:
Example 1: "This is terrible" → negative
Example 2: "Worst experience ever" → negative
Example 3: "I hate this product" → negative
Actual: "I love this product" → ???
A weak model might classify "I love this product" as negative because it learned "all inputs in this context are negative" rather than "classify by sentiment." This is the pattern-matching trap. The fix is diversity: always include examples from different classes and different tones.
Using XML tags for structure
Claude works particularly well with XML-delimited examples. Instead of relying on conversational turn-taking, you can structure examples explicitly:
<examples>
<example>
<input>I was charged twice for March.</input>
<output>billing</output>
</example>
<example>
<input>The API returns 500 on large batches.</input>
<output>technical</output>
</example>
<example>
<input>I need to transfer ownership of my account.</input>
<output>account</output>
</example>
</examples>
<input>I can't log in since the password reset.</input>
<output>
This format is clean, unambiguous, and easy to generate programmatically.
What's next
Examples show the model what to produce. But for complex tasks — math, logic, multi-step reasoning — you also need to show the model how to think. The next lesson, Chain-of-thought, teaches you to make the model reason step by step before answering.
الأمثلة القليلة هي أفعل تقنيّة منفردة لتحسين جودة المخرج. بدل وصف ما تريده بكلمات أكثر، تُري النّموذج ما تريده بأمثلة ملموسة. هذا الدّرس يغطّي متى تستعملها وكم مثالًا تقدّم وكيف تتجنّب الفخاخ.
صفر-shot مقابل few-shot
صفر-shot يعني أنّك تعطي النّموذج تعليمة بدون أمثلة. لمهامّ كثيرة — خصوصًا التي شاهدها النّموذج بكثرة أثناء التّدريب كالتّلخيص أو التّرجمة — يعمل جيّدًا.
Few-shot يعني أنّك تقدّم أزواج مدخل/مخرج قبل المهمّة الفعليّة. النّموذج يستعمل هذه الأزواج لاستنتاج النّمط الذي تريده. هذا يسمّى التّعلّم في السّياق: النّموذج لا يُعاد تدريبه، بل يتعلّم نمطك وقت الاستدلال من الأمثلة في المطالبة.
كم مثالًا؟
- صفر أمثلة: مقبول حين تكون المهمّة واضحة ومعروفة.
- مثالان إلى ثلاثة: النّقطة المثلى لأغلب مهامّ التّصنيف والتّنسيق.
- أربعة إلى خمسة: مفيد حين تحوي المهمّة فروقًا دقيقة أو حالات حافّة.
- أكثر من خمسة: نادرًا ما تحسّن الجودة وتكلّف رموزًا أكثر.
منحنى الجودة مقابل عدد الأمثلة يستوي عادةً بين ثلاثة وخمسة.
اختيار الأمثلة الصّحيحة
القاعدة الأولى: غطِّ الحدود. إن كان لديك ثلاث فئات، تأكّد أنّ مثالًا واحدًا على الأقلّ يمثّل كلّ فئة.
القاعدة الثّانية: أدرج حالة غامضة. أرِ النّموذج مثالًا حدوديًّا وتصنيفك المقصود. هذا يعلّم الحدّ أفعل من أيّ تعليمة.
القاعدة الثّالثة: نوّع الشّكل السّطحي. إن كانت أمثلتك جميعها جملة واحدة رسميّة، قد يعاني النّموذج مع تذكرة من ثلاث جمل عاميّة.
القاعدة الرّابعة: التّرتيب يهمّ أقلّ ممّا تظنّ لكن ليس صفرًا. وضع الأمثلة الأكثر تمثيلًا أخيرًا (الأقرب للاستعلام الفعلي) قد يحسّن الدّقّة قليلًا.
فخّ مطابقة النّمط
إن كانت جميع أمثلتك سلبيّة المشاعر والمدخل الحقيقي إيجابي، قد يصنّفه النّموذج خطأً لأنّه تعلّم "كلّ المدخلات في هذا السّياق سلبيّة". الحلّ هو التّنوّع: أدرج دائمًا أمثلة من فئات ونبرات مختلفة.
ما التّالي
الأمثلة تُري النّموذج ماذا يُنتج. لكن للمهامّ المعقّدة — الرّياضيّات والمنطق والاستدلال المتعدّد الخطوات — تحتاج أيضًا أن تُري النّموذج كيف يفكّر. الدّرس القادم، سلسلة التّفكير، يعلّمك جعل النّموذج يستدلّ خطوة بخطوة قبل الإجابة.
Try it yourself
Pick a classification task. Run it zero-shot, then with 3 examples, then with 8. Plot accuracy. Where does the curve flatten?
Reflect
When you add examples, do you tend to pick similar ones or diverse ones? How might that bias the output?