On this page
Lesson 2 of 6
Tokens and the context window
What you'll learn
- Define a token and count tokens in real text
- Explain why the context window is finite — and what happens at the edge
- Plan prompts that fit and degrade gracefully
Tokens are the model's heartbeat. The context window is its memory. Together they explain almost every strange LLM behavior you will ever encounter — from garbled output to forgotten instructions to wildly different costs for the same task.
Tokens are not words
When you type "unhappiness" into an AI model, the model does not see one word. It sees something like two tokens: "un" and "happiness". The exact split depends on the tokenizer, but the principle is universal: LLMs do not operate on words. They operate on tokens — sub-word pieces that the tokenizer algorithm has learned are efficient building blocks.
Common English words are usually one token. Less common words get split. "Tokenization" might become "token" + "ization". Numbers often get split digit by digit. Code has its own patterns: variable names get fragmented, but common keywords like "function" or "return" stay whole.
Here are some rough rules of thumb for English text:
- 1 token is approximately 4 characters or 0.75 words
- A typical page of text is around 300-400 tokens
- 1,000 tokens is roughly 750 words
Arabic tokenizes very differently
This is where it gets important for our audience. Arabic text generally requires more tokens per word than English. Why? Because most tokenizers were trained predominantly on English data. The tokenizer learned efficient sub-word units for English patterns, but Arabic's rich morphology (root systems, prefixes, suffixes, diacritics) results in less efficient tokenization.
A practical example: the Arabic word "يتعلّمون" (they learn) might be split into 3-4 tokens, while the English "learn" is a single token. The same meaning, but the Arabic version costs 3-4 times more in token budget. This has real consequences:
- Arabic prompts use more of your context window
- Arabic responses cost more per word in API pricing
- You may need to be more concise in Arabic prompts to stay within limits
This is not a flaw in Arabic — it is a flaw in tokenizer training that the industry is actively working to fix. But as a practitioner, you need to know it exists.
The context window: your model's working memory
The context window is the total number of tokens the model can hold at one time. Think of it as a whiteboard: everything the model can "see" — your system prompt, the conversation history, your current message, and the response it is generating — must fit on this whiteboard. Anything that does not fit simply does not exist for the model.
Here are the context windows for major models as of April 2026:
- Claude Opus 4.7: 1,000,000 tokens (roughly 750,000 words)
- Claude Sonnet 4.6: 1,000,000 tokens
- Claude Haiku 4.5: 200,000 tokens
- GPT-4o: 128,000 tokens
- Gemini 2.5 Pro: 1,000,000 tokens
A million tokens sounds enormous — and it is. You could fit an entire novel, plus all your conversation history, plus a large codebase. But context windows fill up faster than you think, especially in long conversations or when processing large documents.
What happens when you exceed the window
Different systems handle context overflow differently, but the general principle is the same: something gets dropped. In most chat interfaces, the oldest messages in the conversation are silently removed to make room for new ones. The model does not "remember" those dropped messages. It does not know they existed.
This is why you sometimes notice an AI assistant "forgetting" instructions you gave early in a long conversation. Those instructions were pushed out of the context window by newer messages. The model is not being careless — it literally cannot see them anymore.
Some systems use more sophisticated strategies. They might summarize older messages instead of dropping them entirely. They might keep the system prompt pinned at the beginning even as middle messages are removed. But the fundamental constraint remains: the window is finite.
Practical implications
Understanding tokens and context windows changes how you work with AI:
Plan for the budget. If you are building a system prompt that will be used in a long conversation, keep it concise. Every token in your system prompt is a token unavailable for conversation history and responses.
Put important information early and late. Research shows that LLMs pay strongest attention to the beginning and end of the context window. Information buried in the middle of a very long context is more likely to be missed. This is sometimes called the "lost in the middle" effect.
Structure long documents. If you are feeding a long document to an AI, add clear headings and structure. This helps the model's attention mechanisms find relevant sections even in a large context.
Watch your costs. API pricing is per token — both input and output. A prompt with 10,000 tokens of context costs roughly 10 times more than a prompt with 1,000 tokens, even if the question is identical. With Claude Opus 4.7 at $5 per million input tokens and $25 per million output tokens, this adds up.
What is next
Now you know what the model predicts (the next token) and the memory it works within (the context window). But what happens when the patterns the model learned during training do not match reality? In the next lesson, we tackle the most important trust question: hallucinations — why they happen and how to catch them.
الرّموز ونافذة السّياق
الرّموز (tokens) ليست كلمات كاملة. هي قطع نصّيّة دون الكلمة يستعملها النّموذج كوحدات بناء. كلمة مثل "unhappiness" تصبح رمزين: "un" و"happiness". والعربيّة تُرمَّز بكفاءة أقلّ لأنّ معظم المُرمِّزات دُرِّبت أساسًا على الإنجليزيّة، فكلمة "يتعلّمون" قد تُقسَم إلى ثلاثة أو أربعة رموز. هذا يعني أنّ النّصوص العربيّة تستهلك مساحة أكبر من نافذة السّياق وتكلّف أكثر في واجهات البرمجة.
نافذة السّياق هي ذاكرة النّموذج العاملة. كلّ ما يراه — مطالبة النّظام، تاريخ المحادثة، رسالتك الحاليّة، والإجابة قيد التّوليد — يجب أن يتّسع فيها. Claude Opus 4.7 يتّسع لمليون رمز، لكنّ النّافذة تمتلئ أسرع ممّا تتوقّع، خاصّة في المحادثات الطّويلة. حين يُتجاوَز الحدّ تُحذَف الرّسائل الأقدم بصمت ويفقد النّموذج تعليماتك الأولى دون أن يُدرك ذلك.
الخلاصة العمليّة: اجعل مطالباتك مختصرة، ضع المعلومات المهمّة في البداية والنّهاية، وتذكّر أنّ كلّ رمز يُكلّف مالًا ومساحة ذاكرة.
Try it yourself
Open a tokenizer tool (like the one on platform.openai.com) and tokenize the same paragraph in English and Arabic. Compare the token counts. Most people are surprised by the difference.
Reflect
How does knowing about token counts change how you would structure a long prompt? Have you ever hit a context limit without realizing why?