Vision

Vision is now reliable enough for real workflows. Claude can read receipts, parse screenshots, understand diagrams, and extract data from documents. This lesson covers the API shape, the cost trade-offs, and the patterns that work in production.

Passing images via base64

The most common way to send an image is as a base64-encoded string inside the message content array:

import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "node:fs";

const client = new Anthropic();

const imageBuffer = readFileSync("receipt.jpg");
const base64Image = imageBuffer.toString("base64");

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/jpeg",
            data: base64Image,
          },
        },
        {
          type: "text",
          text: "Extract the store name, date, and total from this receipt. Return JSON.",
        },
      ],
    },
  ],
});

The content field is an array of content blocks. You can mix image and text blocks in any order. Supported formats are JPEG, PNG, GIF, and WebP.

Passing images via URL

If your image is publicly accessible, you can pass a URL directly:

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: {
            type: "url",
            url: "https://example.com/receipt.jpg",
          },
        },
        {
          type: "text",
          text: "What is the total amount on this receipt?",
        },
      ],
    },
  ],
});

URL-based images avoid the base64 encoding overhead and keep your request payload smaller. Use this when images are already hosted.

Resolution and cost

Image tokens are calculated based on resolution. Claude resizes images to fit within its processing limits:

Opus 4.7 supports up to 3.75 megapixels (roughly 2576px on the long edge).
Images are scaled down if they exceed the limit. Smaller images use fewer tokens and cost less.
A typical 1000x750 image uses around 800-1000 tokens.

The cost formula: image tokens are billed at the same rate as input text tokens. With Sonnet 4.6 at $3 per million input tokens, a 1000-token image costs about $0.003. That is cheap for one image but adds up if you are processing thousands.

Practical guidance: If you only need to read text from a receipt, resize the image to 1000px on the long edge before sending. You get the same accuracy at a fraction of the token cost. If you need fine detail — reading small text in a dense diagram — send the full resolution up to the model's limit.

Document understanding with PDFs

Claude can process PDF documents. Send them as base64-encoded content with the document type:

const pdfBuffer = readFileSync("invoice.pdf");
const base64Pdf = pdfBuffer.toString("base64");

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 2048,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "document",
          source: {
            type: "base64",
            media_type: "application/pdf",
            data: base64Pdf,
          },
        },
        {
          type: "text",
          text: "Extract all line items with their quantities and prices. Return as a JSON array.",
        },
      ],
    },
  ],
});

PDFs are rendered page-by-page. Token cost scales with page count and visual complexity. For large PDFs, consider sending only the pages you need.

Structured data extraction

Vision is most powerful when combined with structured output. Tell Claude exactly what shape you need:

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: { type: "base64", media_type: "image/png", data: screenshotBase64 },
        },
        {
          type: "text",
          text: `Analyze this UI screenshot. Return JSON:
{
  "page_type": "login | dashboard | settings | other",
  "primary_action": "string — the main CTA button text",
  "navigation_items": ["string"],
  "error_messages": ["string"] | null
}`,
        },
      ],
    },
  ],
});

This pattern is the foundation of automated testing, accessibility audits, and content extraction pipelines. Claude reads the image, understands its structure, and returns exactly the fields you asked for.

Multiple images in one request

You can send several images in a single message:

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        { type: "image", source: { type: "base64", media_type: "image/png", data: beforeImage } },
        { type: "image", source: { type: "base64", media_type: "image/png", data: afterImage } },
        { type: "text", text: "Compare these two screenshots. What changed between them?" },
      ],
    },
  ],
});

This is useful for visual diff comparisons, before/after analysis, and multi-page document processing. Token cost is the sum of all images plus the text.

Vision best practices

Resize before sending. Match resolution to your accuracy needs. No point sending a 4K screenshot when you only need to read a button label.
Be specific in your text prompt. "Extract the total" works better than "What do you see?" Claude's vision output is guided by the text instruction.
Combine with tool use. For complex pipelines, use vision to extract data and tools to act on it — for example, read an invoice with vision, then call a create_expense tool with the extracted amounts.
Test with edge cases. Blurry images, rotated text, low contrast — test these before production. Claude handles them better than you might expect, but always verify.

What's next

You can now make Claude see. The next lesson — Prompt caching — teaches you how to slash costs by caching the parts of your prompt that do not change between requests. This is the single biggest cost optimization available.

الرّؤية اليوم موثوقة بما يكفي لسيّر عمل فعليّة. Claude يقرأ الإيصالات ويحلّل لقطات الشّاشة ويفهم الرّسوم البيانيّة ويستخرج البيانات من المستندات.

تمرير الصّور عبر base64

الطّريقة الأشيع هي سلسلة مشفّرة بـ base64 داخل مصفوفة محتوى الرّسالة. حقل content مصفوفة كتل محتوى يمكنك مزج كتل image و text بأيّ ترتيب. التّنسيقات المدعومة: JPEG و PNG و GIF و WebP.

تمرير الصّور عبر URL

إذا كانت صورتك متاحة علنيًّا، مرّر URL مباشرة. هذا يتجنّب حمل ترميز base64 ويحافظ على حمولة طلبك أصغر.

الدّقّة والتّكلفة

رموز الصّور تُحسب بناءً على الدّقّة. Opus 4.7 يدعم حتّى 3.75 ميغابكسل (حوالي 2576 بكسل على الضّلع الأطول). صورة نموذجيّة 1000x750 تستهلك 800-1000 رمز تقريبًا.

نصيحة عمليّة: إذا كنت تحتاج فقط قراءة نصّ من إيصال، صغّر الصّورة إلى 1000 بكسل على الضّلع الأطول. تحصل على نفس الدّقّة بجزء من تكلفة الرّموز.

فهم المستندات مع PDF

Claude يعالج مستندات PDF. أرسلها كمحتوى مشفّر بـ base64 بنوع document. تكلفة الرّموز تتناسب مع عدد الصّفحات والتّعقيد البصري.

استخراج البيانات المنظّمة

الرّؤية أقوى ما تكون حين تُدمج مع المخرج المنظّم. أخبر Claude بالشّكل الذي تحتاجه بالضّبط. هذا النّمط أساس الاختبار الآلي ومراجعات إمكانيّة الوصول وخطوط أنابيب استخراج المحتوى.

أفضل الممارسات

صغّر قبل الإرسال. طابق الدّقّة مع احتياجات الدّقّة.
كن محدّدًا في المطالبة النّصّيّة. "استخرج المجموع" أفضل من "ماذا ترى؟"
ادمج مع الأدوات. استعمل الرّؤية للاستخراج والأدوات للفعل.
اختبر مع الحالات الحدّيّة. صور ضبابيّة، نصّ مائل، تباين منخفض.

ما التّالي

يمكنك الآن جعل Claude يرى. الدّرس القادم — التّخزين المؤقّت للمطالبات — يعلّمك كيف تخفض التّكاليف بتخزين أجزاء المطالبة التي لا تتغيّر بين الطّلبات. هذا أكبر تحسين تكلفة متاح.

Passing images via base64

The most common way to send an image is as a base64-encoded string inside the message content array:

import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "node:fs";

const client = new Anthropic();

const imageBuffer = readFileSync("receipt.jpg");
const base64Image = imageBuffer.toString("base64");

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/jpeg",
            data: base64Image,
          },
        },
        {
          type: "text",
          text: "Extract the store name, date, and total from this receipt. Return JSON.",
        },
      ],
    },
  ],
});

The content field is an array of content blocks. You can mix image and text blocks in any order. Supported formats are JPEG, PNG, GIF, and WebP.

Passing images via URL

If your image is publicly accessible, you can pass a URL directly:

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: {
            type: "url",
            url: "https://example.com/receipt.jpg",
          },
        },
        {
          type: "text",
          text: "What is the total amount on this receipt?",
        },
      ],
    },
  ],
});

URL-based images avoid the base64 encoding overhead and keep your request payload smaller. Use this when images are already hosted.

Resolution and cost

Image tokens are calculated based on resolution. Claude resizes images to fit within its processing limits:

Opus 4.7 supports up to 3.75 megapixels (roughly 2576px on the long edge).
Images are scaled down if they exceed the limit. Smaller images use fewer tokens and cost less.
A typical 1000x750 image uses around 800-1000 tokens.

Document understanding with PDFs

Claude can process PDF documents. Send them as base64-encoded content with the document type:

const pdfBuffer = readFileSync("invoice.pdf");
const base64Pdf = pdfBuffer.toString("base64");

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 2048,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "document",
          source: {
            type: "base64",
            media_type: "application/pdf",
            data: base64Pdf,
          },
        },
        {
          type: "text",
          text: "Extract all line items with their quantities and prices. Return as a JSON array.",
        },
      ],
    },
  ],
});

PDFs are rendered page-by-page. Token cost scales with page count and visual complexity. For large PDFs, consider sending only the pages you need.

Structured data extraction

Vision is most powerful when combined with structured output. Tell Claude exactly what shape you need:

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: { type: "base64", media_type: "image/png", data: screenshotBase64 },
        },
        {
          type: "text",
          text: `Analyze this UI screenshot. Return JSON:
{
  "page_type": "login | dashboard | settings | other",
  "primary_action": "string — the main CTA button text",
  "navigation_items": ["string"],
  "error_messages": ["string"] | null
}`,
        },
      ],
    },
  ],
});

Multiple images in one request

You can send several images in a single message:

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        { type: "image", source: { type: "base64", media_type: "image/png", data: beforeImage } },
        { type: "image", source: { type: "base64", media_type: "image/png", data: afterImage } },
        { type: "text", text: "Compare these two screenshots. What changed between them?" },
      ],
    },
  ],
});

This is useful for visual diff comparisons, before/after analysis, and multi-page document processing. Token cost is the sum of all images plus the text.

Vision best practices

Resize before sending. Match resolution to your accuracy needs. No point sending a 4K screenshot when you only need to read a button label.
Be specific in your text prompt. "Extract the total" works better than "What do you see?" Claude's vision output is guided by the text instruction.
Combine with tool use. For complex pipelines, use vision to extract data and tools to act on it — for example, read an invoice with vision, then call a create_expense tool with the extracted amounts.
Test with edge cases. Blurry images, rotated text, low contrast — test these before production. Claude handles them better than you might expect, but always verify.

صغّر قبل الإرسال. طابق الدّقّة مع احتياجات الدّقّة.
كن محدّدًا في المطالبة النّصّيّة. "استخرج المجموع" أفضل من "ماذا ترى؟"
ادمج مع الأدوات. استعمل الرّؤية للاستخراج والأدوات للفعل.
اختبر مع الحالات الحدّيّة. صور ضبابيّة، نصّ مائل، تباين منخفض.

Vision

What you'll learn

Passing images via base64

Passing images via URL

Resolution and cost

Document understanding with PDFs

Structured data extraction

Multiple images in one request

Vision best practices

What's next

تمرير الصّور عبر base64

تمرير الصّور عبر URL

الدّقّة والتّكلفة

فهم المستندات مع PDF

استخراج البيانات المنظّمة

أفضل الممارسات

ما التّالي

Try it yourself

Reflect

Vision

What you'll learn

Passing images via base64

Passing images via URL

Resolution and cost

Document understanding with PDFs

Structured data extraction

Multiple images in one request

Vision best practices

What's next

تمرير الصّور عبر base64

تمرير الصّور عبر URL

الدّقّة والتّكلفة

فهم المستندات مع PDF

استخراج البيانات المنظّمة

أفضل الممارسات

ما التّالي

Try it yourself

Reflect