The Claude 4 Model Family
Anthropic offers three models optimised for different speed, capability, and cost trade-offs. Choosing the right model for each task is the single most impactful decision for cost efficiency and output quality in production systems.
Claude Opus 4.6
Most capable model. Best for complex reasoning, nuanced analysis, and agentic tasks that require deep judgment. Highest cost, slowest speed. Use when the quality of the output directly impacts a high-stakes decision — architecture reviews, legal analysis, campaign copy that will be seen by thousands.
claude-opus-4-6
Claude Sonnet 4.6
Best balance of intelligence and speed. Ideal for most production workloads: coding, data extraction, document analysis, customer support. Sonnet handles 90% of real-world tasks at a fraction of Opus's cost. It is the default choice — upgrade to Opus only when you can measure the quality difference.
claude-sonnet-4-6
Claude Haiku 4.5
Fastest and most cost-efficient. Best for high-throughput applications, simple classification, quick summaries, and latency-sensitive tasks. Haiku is not a "dumb" model — it is surprisingly capable for structured tasks. Use it wherever speed matters more than depth.
claude-haiku-4-5-20251001
Detailed Comparison
| Dimension | Opus 4.6 | Sonnet 4.6 | Haiku 4.5 |
|---|---|---|---|
| Context window | 200K tokens | 200K tokens | 200K tokens |
| Max output | 32K tokens | 16K tokens | 8K tokens |
| Input price (per 1M tokens) | $15.00 | $3.00 | $0.80 |
| Output price (per 1M tokens) | $75.00 | $15.00 | $4.00 |
| Latency (time to first token) | ~2-5 seconds | ~0.5-2 seconds | ~0.2-0.5 seconds |
| Extended thinking | Yes | Yes | No |
| Tool use | Yes | Yes | Yes |
| Vision (image input) | Yes | Yes | Yes |
| Streaming | Yes | Yes | Yes |
| Prompt caching | Yes (90% discount) | Yes (90% discount) | Yes (90% discount) |
| Batch API | Yes (50% discount) | Yes (50% discount) | Yes (50% discount) |
Start development with Sonnet. Upgrade to Opus only for tasks where quality materially differs. Use Haiku for pre-processing, routing, classification, and any latency-sensitive path. The cost difference between Haiku and Opus is roughly 19x on input and 19x on output.
All prices shown are per 1 million tokens (input / output) and were current as of April 2026. Verify the latest rates at anthropic.com/pricing before building cost models.
When to Use Each Model — Decision Framework
The most common mistake is using Opus for everything. The second most common mistake is using Haiku for everything. Here is a practical decision framework:
| Use Case | Model | Rationale |
|---|---|---|
| Email classification (complaint/question/praise) | Haiku | Simple classification — Haiku matches Sonnet's accuracy, 5x faster |
| Content moderation / safety screening | Haiku | Binary or low-cardinality decisions at high throughput |
| Routing (which department handles this ticket?) | Haiku | Fast routing enables downstream processing without delay |
| Quick summaries (1-2 sentences) | Haiku | Short outputs don't benefit from Opus's depth |
| Data extraction from structured documents | Haiku or Sonnet | Haiku for simple forms; Sonnet for complex tables |
| Product descriptions, email drafts | Sonnet | Needs nuance and brand voice but not deep reasoning |
| Code generation and editing | Sonnet | Strong coding capability at reasonable cost |
| Code review (routine) | Sonnet | Catches most issues; Opus for security-critical reviews |
| Customer support responses | Sonnet | Needs empathy, accuracy, and context awareness |
| Document analysis (contracts, reports) | Sonnet | Good comprehension of long documents at lower cost |
| Architecture and design decisions | Opus | Multi-factor reasoning with long-term implications |
| Security audits and vulnerability analysis | Opus | Subtle issues require deep, systematic reasoning |
| Complex mathematical or logical proofs | Opus | Extended thinking enables multi-step reasoning chains |
| Campaign copy (high visibility) | Opus | Quality ceiling matters when thousands will read the output |
| Legal document review | Opus | Nuanced interpretation where missed details have real consequences |
Model Selection Flowchart
Is this a simple classification, routing, or yes/no decision? YES → Use Haiku NO ↓ Does this task require deep multi-step reasoning or extended thinking? YES → Use Opus NO ↓ Is this high-stakes? (customer-facing, security, legal, or visible to many people) YES → Run with Sonnet first. Compare quality with Opus on 5 samples. If Opus is measurably better → Use Opus If quality is similar → Stay with Sonnet NO ↓ Use Sonnet (default for everything else)
Cost Modelling — Real Numbers
Here is what ThreadCo's monthly Claude usage looks like with intelligent model routing vs. using Sonnet for everything:
| Task | Volume/Month | Avg Tokens (In/Out) | Routed Model | Routed Cost | All-Sonnet Cost |
|---|---|---|---|---|---|
| Email classification | 2,000 | 400 / 30 | Haiku | $0.88 | $3.30 |
| Product descriptions | 500 | 600 / 150 | Sonnet | $2.03 | $2.03 |
| Code reviews | 200 | 2000 / 500 | Sonnet | $2.70 | $2.70 |
| Customer replies | 500 | 800 / 200 | Sonnet | $2.70 | $2.70 |
| Campaign copy | 8 | 1000 / 800 | Opus | $0.60 | $0.12 |
| Architecture reviews | 4 | 5000 / 2000 | Opus | $0.90 | $0.18 |
| Monthly Total | $9.81 | $11.03 | |||
At ThreadCo's scale (small team), the savings are modest. But scale this to 100x volume (a medium company processing 200,000 emails/month) and the routing approach saves thousands per month. The principle matters more than the absolute numbers: never pay for reasoning you don't need.
Latency Considerations
Cost is not the only reason to choose the right model. Latency directly impacts user experience:
Real-Time Chat (< 1s TTFT)
If your application is a live chatbot where users are waiting for a response, you need Haiku or Sonnet. Opus's 2-5 second time to first token feels sluggish in a chat interface. Haiku's sub-500ms TTFT feels instant.
Background Processing (latency irrelevant)
For batch jobs, nightly reports, or async workflows, latency does not matter. Use the best model for quality and optimise cost with the Batch API (50% discount). There is no reason to sacrifice quality for speed on a task nobody is waiting for.
IDE Assistant (1-3s acceptable)
In Claude Code, users expect a brief pause before responses appear. Sonnet hits this sweet spot. Opus is acceptable for complex tasks where users understand the model is "thinking." Haiku is useful for quick inline completions.
Streaming Mitigates Latency
Streaming returns tokens as they are generated, so the user sees the response building in real time. With streaming, even Opus feels responsive because the first few words appear quickly, even if the full response takes time to complete.
ShopMate -- Model Router
# shopmate/router.py -- Right model for each ShopMate task from enum import Enum class Task(Enum): CLASSIFY = "classify" # sentiment check, category label WRITE = "write" # product descriptions, email replies CAMPAIGN = "campaign" # seasonal campaign copy, Maya reviews # Model choice + max tokens per task ROUTING = { Task.CLASSIFY: ("claude-haiku-4-5-20251001", 30), # fastest, cheapest Task.WRITE: ("claude-haiku-4-5-20251001", 250), # good quality, low cost Task.CAMPAIGN: ("claude-sonnet-4-6", 800), # best quality for Maya } # Cost per 1 million tokens (input / output) — verify at anthropic.com/pricing PRICES = { "claude-haiku-4-5-20251001": (0.80, 4.00), # $0.80 input / $4.00 output per 1M tokens "claude-sonnet-4-6": (3.00, 15.00), # $3.00 input / $15.00 output per 1M tokens } def route(task: Task) -> tuple[str, int]: return ROUTING[task] def monthly_cost_estimate() -> None: volumes = {Task.CLASSIFY: 2000, Task.WRITE: 500, Task.CAMPAIGN: 8} total = 0 print(f"{'Task':<15} {'Model':<35} {'Vol':>5} {'Cost':>8}") print("-" * 68) for task, vol in volumes.items(): model, max_out = ROUTING[task] cin, cout = PRICES[model] cost = ((400/1e6)*cin + (max_out/1e6)*cout) * vol total += cost print(f"{task.value:<15} {model:<35} {vol:>5} ${cost:>7.2f}") print(f" Estimated monthly total: ${total:.2f}") monthly_cost_estimate()
Advanced — Dynamic Model Routing
For production systems, you can build a dynamic router that uses Haiku to classify the complexity of incoming requests and routes them to the appropriate model automatically:
# shopmate/dynamic_router.py -- Haiku classifies, then routes to the right model import anthropic client = anthropic.Anthropic() def classify_complexity(user_message: str) -> str: """Use Haiku to classify whether a request needs Haiku, Sonnet, or Opus.""" resp = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=10, system="""Classify the complexity of this user request. Reply with ONLY one word: haiku, sonnet, or opus. - haiku: simple lookup, yes/no, classification - sonnet: writing, coding, analysis, most tasks - opus: complex reasoning, architecture, security, multi-step logic""", messages=[{"role":"user", "content": user_message}] ) return resp.content[0].text.strip().lower() MODEL_MAP = { "haiku": "claude-haiku-4-5-20251001", "sonnet": "claude-sonnet-4-6", "opus": "claude-opus-4-6", } def smart_route(user_message: str) -> str: """Route a message to the appropriate model based on complexity.""" complexity = classify_complexity(user_message) model = MODEL_MAP.get(complexity, "claude-sonnet-4-6") # default to Sonnet print(f"Routing to {model} (classified as: {complexity})") resp = client.messages.create( model=model, max_tokens=1000, messages=[{"role":"user", "content": user_message}] ) return resp.content[0].text # The Haiku classification call costs fractions of a cent # but saves dollars when it routes simple requests away from Opus
At scale, the Haiku classification call adds ~200ms latency but can reduce total cost by 40-60%. The key insight: a 10-token classification with Haiku costs roughly $0.000004. Even if Haiku "over-routes" 20% of requests to a more expensive model, the savings on the other 80% more than compensate.
Context Window and Output Length Differences
While all Claude 4 models share the same 200K input context window, their maximum output lengths differ — and this affects which tasks each model can handle:
Opus: 32K Output
Opus can generate up to 32,000 tokens in a single response — roughly 24,000 words. This is enough for entire documents, comprehensive analysis reports, or large code files. Use Opus when you need long, detailed output that maintains quality throughout.
Sonnet: 16K Output
Sonnet generates up to 16,000 tokens — roughly 12,000 words. This is sufficient for most tasks: code files, email drafts, analysis summaries, and documentation. If you need longer output, split the request into sequential parts.
Haiku: 8K Output
Haiku generates up to 8,000 tokens — roughly 6,000 words. This is intentionally limited to match its use case: quick classifications, short summaries, and concise responses. If your task routinely needs more than 8K tokens of output, upgrade to Sonnet.
If Claude's response reaches the max_tokens limit, it is truncated mid-sentence. This is a common source of broken JSON and incomplete code. Always set max_tokens high enough for your expected output, but not excessively high (it does not increase cost to set a higher limit — you only pay for tokens actually generated). If output is truncated, ask Claude to continue from where it stopped.
Benchmarks vs Real-World Performance
Published benchmarks (MMLU, HumanEval, MATH, etc.) are useful for comparing models at a high level, but they do not always predict performance on your specific tasks. Here is how to think about benchmarks practically:
| Benchmark | What It Measures | Real-World Relevance |
|---|---|---|
| HumanEval / SWE-bench | Code generation and bug fixing | High — directly maps to coding assistance tasks |
| MMLU | Broad knowledge across 57 subjects | Moderate — shows general knowledge but not task-specific skill |
| MATH | Mathematical problem solving | Moderate — relevant for quantitative analysis tasks |
| Agentic benchmarks (SWE-bench Verified, TAU-bench) | Multi-step autonomous task completion | High — directly maps to Claude Code's agentic capabilities |
| Long context benchmarks | Recall and reasoning over long documents | High — relevant for document analysis and codebase review |
The most reliable way to choose a model for your use case is to test it yourself: run 20 representative examples through each model, score the outputs, and compare. Your specific data, prompts, and quality standards matter more than any benchmark.
Model Versioning and Updates
Anthropic releases new model versions periodically. Understanding the versioning scheme prevents surprises in production:
Pinned vs. Latest
Model IDs like claude-sonnet-4-6 point to the latest release. For production stability, use the date-pinned version (e.g., claude-haiku-4-5-20251001) so your system behaviour doesn't change when Anthropic releases an update.
Testing New Versions
When a new model version is released, test it against your existing prompts before switching. Run your test suite with both the old and new model versions and compare output quality. Some prompts may need adjustment for the new version.
Deprecation Notices
Anthropic provides advance notice before deprecating model versions. Subscribe to the Anthropic changelog and set calendar reminders to migrate before deprecation dates. Running a deprecated model is not possible — your API calls will fail.
A/B Testing Models
For critical workflows, run A/B tests between model versions. Route 10% of traffic to the new model, compare quality metrics, then gradually increase if results are good. This is the safest way to upgrade in production.
Hands-On Exercises
Pick a task your team does daily (e.g., writing a customer email reply). Run the same prompt through Haiku, Sonnet, and Opus. Score each output 1-10 for quality. Is the quality difference between Sonnet and Opus worth the cost difference? Document your findings for your team's model routing table.
List your team's top 10 Claude use cases. For each, estimate: requests per month, average input tokens, and average output tokens. Calculate the monthly cost using (a) Sonnet for everything, (b) intelligent routing with all three models. What is the percentage savings?
Create a simple routing table for your project: a spreadsheet or markdown table mapping each task type to a model. Include columns for: task name, model, max_tokens, expected quality (1-10), and cost per request. Review it monthly and adjust based on actual usage data.
Using the Claude Code chat panel, time how long each model takes to respond to the same prompt. Use a stopwatch or the API's response headers. Create a simple benchmark: (a) short classification task, (b) medium writing task, (c) complex reasoning task. Plot the results. How does latency change with output length?
Using the dynamic_router.py code above as a starting point, test it with 20 different user messages ranging from simple ("Is this a complaint?") to complex ("Review this microservices architecture for single points of failure"). Check whether Haiku classifies the complexity correctly at least 80% of the time. Adjust the classification prompt if needed.