Tokens and the context window

Tokens are the model's heartbeat. The context window is its memory. Together they explain almost every strange LLM behavior you will ever encounter — from garbled output to forgotten instructions to wildly different costs for the same task.

Tokens are not words

When you type "unhappiness" into an AI model, the model does not see one word. It sees something like two tokens: "un" and "happiness". The exact split depends on the tokenizer, but the principle is universal: LLMs do not operate on words. They operate on tokens — sub-word pieces that the tokenizer algorithm has learned are efficient building blocks.

Common English words are usually one token. Less common words get split. "Tokenization" might become "token" + "ization". Numbers often get split digit by digit. Code has its own patterns: variable names get fragmented, but common keywords like "function" or "return" stay whole.

Here are some rough rules of thumb for English text:

1 token is approximately 4 characters or 0.75 words
A typical page of text is around 300-400 tokens
1,000 tokens is roughly 750 words

Arabic tokenizes very differently

This is where it gets important for our audience. Arabic text generally requires more tokens per word than English. Why? Because most tokenizers were trained predominantly on English data. The tokenizer learned efficient sub-word units for English patterns, but Arabic's rich morphology (root systems, prefixes, suffixes, diacritics) results in less efficient tokenization.

A practical example: the Arabic word "يتعلّمون" (they learn) might be split into 3-4 tokens, while the English "learn" is a single token. The same meaning, but the Arabic version costs 3-4 times more in token budget. This has real consequences:

Arabic prompts use more of your context window
Arabic responses cost more per word in API pricing
You may need to be more concise in Arabic prompts to stay within limits

This is not a flaw in Arabic — it is a flaw in tokenizer training that the industry is actively working to fix. But as a practitioner, you need to know it exists.

The context window: your model's working memory

The context window is the total number of tokens the model can hold at one time. Think of it as a whiteboard: everything the model can "see" — your system prompt, the conversation history, your current message, and the response it is generating — must fit on this whiteboard. Anything that does not fit simply does not exist for the model.

Here are the context windows for major models as of April 2026:

Claude Opus 4.7: 1,000,000 tokens (roughly 750,000 words)
Claude Sonnet 4.6: 1,000,000 tokens
Claude Haiku 4.5: 200,000 tokens
GPT-4o: 128,000 tokens
Gemini 2.5 Pro: 1,000,000 tokens

A million tokens sounds enormous — and it is. You could fit an entire novel, plus all your conversation history, plus a large codebase. But context windows fill up faster than you think, especially in long conversations or when processing large documents.

What happens when you exceed the window

Different systems handle context overflow differently, but the general principle is the same: something gets dropped. In most chat interfaces, the oldest messages in the conversation are silently removed to make room for new ones. The model does not "remember" those dropped messages. It does not know they existed.

This is why you sometimes notice an AI assistant "forgetting" instructions you gave early in a long conversation. Those instructions were pushed out of the context window by newer messages. The model is not being careless — it literally cannot see them anymore.

Some systems use more sophisticated strategies. They might summarize older messages instead of dropping them entirely. They might keep the system prompt pinned at the beginning even as middle messages are removed. But the fundamental constraint remains: the window is finite.

Practical implications

Understanding tokens and context windows changes how you work with AI:

Plan for the budget. If you are building a system prompt that will be used in a long conversation, keep it concise. Every token in your system prompt is a token unavailable for conversation history and responses.

Put important information early and late. Research shows that LLMs pay strongest attention to the beginning and end of the context window. Information buried in the middle of a very long context is more likely to be missed. This is sometimes called the "lost in the middle" effect.

Structure long documents. If you are feeding a long document to an AI, add clear headings and structure. This helps the model's attention mechanisms find relevant sections even in a large context.

Watch your costs. API pricing is per token — both input and output. A prompt with 10,000 tokens of context costs roughly 10 times more than a prompt with 1,000 tokens, even if the question is identical. With Claude Opus 4.7 at $5 per million input tokens and $25 per million output tokens, this adds up.

What is next

Now you know what the model predicts (the next token) and the memory it works within (the context window). But what happens when the patterns the model learned during training do not match reality? In the next lesson, we tackle the most important trust question: hallucinations — why they happen and how to catch them.

الرّموز ونافذة السّياق

الرّموز (tokens) ليست كلمات كاملة. هي قطع نصّيّة دون الكلمة يستعملها النّموذج كوحدات بناء. كلمة مثل "unhappiness" تصبح رمزين: "un" و"happiness". والعربيّة تُرمَّز بكفاءة أقلّ لأنّ معظم المُرمِّزات دُرِّبت أساسًا على الإنجليزيّة، فكلمة "يتعلّمون" قد تُقسَم إلى ثلاثة أو أربعة رموز. هذا يعني أنّ النّصوص العربيّة تستهلك مساحة أكبر من نافذة السّياق وتكلّف أكثر في واجهات البرمجة.

نافذة السّياق هي ذاكرة النّموذج العاملة. كلّ ما يراه — مطالبة النّظام، تاريخ المحادثة، رسالتك الحاليّة، والإجابة قيد التّوليد — يجب أن يتّسع فيها. Claude Opus 4.7 يتّسع لمليون رمز، لكنّ النّافذة تمتلئ أسرع ممّا تتوقّع، خاصّة في المحادثات الطّويلة. حين يُتجاوَز الحدّ تُحذَف الرّسائل الأقدم بصمت ويفقد النّموذج تعليماتك الأولى دون أن يُدرك ذلك.

الخلاصة العمليّة: اجعل مطالباتك مختصرة، ضع المعلومات المهمّة في البداية والنّهاية، وتذكّر أنّ كلّ رمز يُكلّف مالًا ومساحة ذاكرة.

Tokens are not words

Here are some rough rules of thumb for English text:

1 token is approximately 4 characters or 0.75 words
A typical page of text is around 300-400 tokens
1,000 tokens is roughly 750 words

Arabic tokenizes very differently

Arabic prompts use more of your context window
Arabic responses cost more per word in API pricing
You may need to be more concise in Arabic prompts to stay within limits

This is not a flaw in Arabic — it is a flaw in tokenizer training that the industry is actively working to fix. But as a practitioner, you need to know it exists.

The context window: your model's working memory

Here are the context windows for major models as of April 2026:

Claude Opus 4.7: 1,000,000 tokens (roughly 750,000 words)
Claude Sonnet 4.6: 1,000,000 tokens
Claude Haiku 4.5: 200,000 tokens
GPT-4o: 128,000 tokens
Gemini 2.5 Pro: 1,000,000 tokens

Tokens and the context window

What you'll learn

Tokens are not words

Arabic tokenizes very differently

The context window: your model's working memory

What happens when you exceed the window

Practical implications

What is next

الرّموز ونافذة السّياق

Try it yourself

Reflect

Tokens and the context window

What you'll learn

Tokens are not words

Arabic tokenizes very differently

The context window: your model's working memory

What happens when you exceed the window

Practical implications

What is next

الرّموز ونافذة السّياق

Try it yourself

Reflect