Advanced Techniques | AI Training Hub

Claude Track

Module 10

Claude Track — Module 10

Good Wasn't Good Enough: Maya's team could get passable descriptions from Claude, but not great ones. Adding a reasoning step ("identify the key selling point first, then write") and using the conversation as a refinement loop — "make it 20% shorter", "remove the passive voice" — pushed quality from acceptable to on-brand.

Advanced Techniques in Claude Code

These techniques separate good results from exceptional ones. All of them work directly in the Claude Code chat panel in VS Code — no code required. This module also covers API-level advanced features: streaming, batching, prompt caching, extended thinking, and parallel tool use for teams building production integrations.

Chain of Thought

Add "Think step by step before answering" to any complex request. Claude reasons through the problem out loud, which surfaces assumptions you can catch and correct before the final output appears. This is the single highest-impact technique for reasoning-heavy tasks.

Iterative Refinement

Treat the conversation as a drafting loop. After Claude's first response, follow up: "Make it 20% shorter." "Remove all passive voice." "Add a specific example." Each turn tightens the result. 2-3 refinement turns typically produce better output than a single elaborate prompt.

Task Decomposition

Break large tasks into steps across messages: "Step 1: list the issues in @report.md." Review, then: "Step 2: draft fixes for the top 3." Errors in step 1 don't corrupt step 2, and you can redirect between steps.

Custom Instructions via CLAUDE.md

VS Code's Claude Code supports a CLAUDE.md file at the root of your project. Put standing instructions here — tone, conventions, banned phrases — so every session starts with the right context automatically. This is your single most important setup step.

Streaming Responses

Streaming returns Claude's response token by token as it is generated, rather than waiting for the complete response. This is critical for user-facing applications where perceived responsiveness matters.

Python — Streaming with the Anthropic SDK

import anthropic
client = anthropic.Anthropic()

# Basic streaming — tokens arrive as they're generated
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a product description for the Sunset Gradient Tee."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# Streaming with event handling — more control
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Analyse ThreadCo's Q1 sales data."}]
) as stream:
    for event in stream:
        if event.type == "content_block_delta":
            print(event.delta.text, end="", flush=True)
        elif event.type == "message_stop":
            print("

[Stream complete]")

    # Access the final message after streaming
    final = stream.get_final_message()
    print(f"Total tokens: {final.usage.input_tokens} in, {final.usage.output_tokens} out")

Approach	When to Use	Latency Feel
Non-streaming	Background processing, batch jobs, when you need the complete response before acting	User waits until full response is ready
Streaming	Chat interfaces, real-time applications, any user-facing interaction	First tokens appear in < 1 second, feels responsive
Streaming + tool use	Agentic applications where Claude uses tools mid-response	Text streams, pauses for tool execution, resumes

Prompt Caching

Prompt caching dramatically reduces cost when you send the same system prompt or context with every request. Cached tokens cost 90% less than regular input tokens after the first request.

Python — Prompt Caching

import anthropic
client = anthropic.Anthropic()

# The system prompt and brand guide are the same for every product description request
# Marking them with cache_control means they're cached after the first call
BRAND_GUIDE = """ThreadCo Brand Voice Guide

Tone: Friendly, direct, slightly playful. Never corporate.
Sustainability is a feature, not a badge — mention it naturally.
Forbidden words: vibrant, perfect, stylish, must-have, luxurious
No exclamation marks in product copy.
Always mention the material (organic cotton, recycled polyester, etc.)
Always mention where it's made (Portugal, Turkey, etc.)
Two sentences maximum per product description.

... (imagine this is 2,000 words of detailed brand guidelines) ..."""

def describe_product(product_name: str, details: str) -> str:
    resp = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=150,
        system=[
            {
                "type": "text",
                "text": "You are a ThreadCo copywriter."
            },
            {
                "type": "text",
                "text": BRAND_GUIDE,
                "cache_control": {"type": "ephemeral"}  # Cache this block
            }
        ],
        messages=[{"role": "user", "content": f"Write a product description for {product_name}. Details: {details}"}]
    )
    return resp.content[0].text

# First call: full price (brand guide is processed and cached)
describe_product("Sunset Gradient Tee", "Organic cotton, made in Portugal, amber-to-rose gradient")

# Second call: 90% cheaper for the cached brand guide portion
describe_product("Midnight Pocket Tee", "Organic cotton, made in Portugal, navy, single chest pocket")

# At 2,000 product descriptions, prompt caching saves ~$50+ per batch

Caching Requirements

For a block to be cached, it must be at least 1,024 tokens (for Claude Sonnet/Opus) or 2,048 tokens (for Haiku). Short system prompts won't benefit from caching. The cache has a 5-minute TTL — if you send another request within 5 minutes, the cache hit applies. After 5 minutes of inactivity, the cache expires and the next request pays full price again.

Batch API — 50% Cost Savings

The Batch API processes multiple requests asynchronously at 50% of the standard price. Ideal for any workload that doesn't need real-time results.

Python — Batch API for Product Descriptions

import anthropic
client = anthropic.Anthropic()

# Prepare batch requests — one per product
products = [
    {"name": "Sunset Gradient Tee", "details": "Organic cotton, amber-to-rose gradient"},
    {"name": "Midnight Pocket Tee", "details": "Organic cotton, navy, chest pocket"},
    {"name": "Wave Print Crop Tee", "details": "Recycled cotton blend, wave pattern"},
    # ... imagine 2,000 products
]

requests = []
for i, product in enumerate(products):
    requests.append({
        "custom_id": f"product-{i}",
        "params": {
            "model": "claude-haiku-4-5-20251001",
            "max_tokens": 150,
            "messages": [{
                "role": "user",
                "content": f"Write a 2-sentence ThreadCo product description for {product['name']}. {product['details']}. Mention sustainability. No exclamation marks."
            }]
        }
    })

# Submit the batch — processes asynchronously
batch = client.batches.create(requests=requests)
print(f"Batch {batch.id} submitted with {len(requests)} requests")
print(f"Status: {batch.processing_status}")

# Check status later (or poll)
# batch = client.batches.retrieve(batch.id)
# Results available via batch.results_url when processing_status == "ended"

Feature	Standard API	Batch API
Pricing	Full price	50% discount on all tokens
Response time	Seconds (real-time)	Up to 24 hours (usually faster)
Best for	Interactive applications, chatbots	Bulk processing, nightly reports, content generation at scale
Max requests per batch	N/A	100,000
Streaming	Yes	No (results returned when complete)
Tool use	Yes	Yes

Extended Thinking

Extended thinking lets Claude reason internally before responding. The model uses a hidden "thinking" block where it can work through complex problems, explore multiple approaches, and self-correct — then delivers a polished final response.

Python — Extended Thinking

import anthropic
client = anthropic.Anthropic()

resp = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # How many tokens Claude can spend thinking
    },
    messages=[{
        "role": "user",
        "content": """Review ThreadCo's pricing strategy:
- Current: £28 average price, 42% margin, 500 orders/week
- Competitors: £25-£45 range
- Customer feedback: "good value" but some price sensitivity
- Goal: Increase margin to 50% without losing more than 10% volume

Propose three strategies, model the financial impact of each,
and recommend one with justification."""
    }]
)

# The response contains both thinking blocks and text blocks
for block in resp.content:
    if block.type == "thinking":
        print(f"[Thinking: {len(block.thinking)} chars]")
        # Optionally inspect the thinking for debugging
    elif block.type == "text":
        print(block.text)

When to Use Extended Thinking

Extended thinking costs more (thinking tokens are billed, though at a reduced rate) and is slower. Use it for: complex mathematical analysis, multi-constraint optimisation, security vulnerability analysis, architectural decision-making, and any problem requiring 10+ logical steps. Do NOT use it for: simple classification, short writing tasks, or straightforward coding — the overhead is not justified.

Parallel Tool Use

Claude can request multiple tool calls in a single response. When tools are independent (e.g., checking stock for 3 different SKUs), you can execute them in parallel for faster results.

Python — Handling Parallel Tool Calls

import asyncio, anthropic
client = anthropic.Anthropic()

async def execute_tool_async(name: str, input: dict) -> dict:
    """Execute a tool call asynchronously."""
    # In production, this would be an async database query or API call
    await asyncio.sleep(0.1)  # Simulate network latency
    return {"sku": input["sku"], "in_stock": 23}

async def handle_parallel_tools(response) -> list:
    """Execute all tool calls from a response in parallel."""
    tool_calls = [b for b in response.content if b.type == "tool_use"]

    # Run all tool calls concurrently
    tasks = [execute_tool_async(tc.name, tc.input) for tc in tool_calls]
    results = await asyncio.gather(*tasks)

    # Map results back to tool_use_ids
    return [
        {"type": "tool_result", "tool_use_id": tc.id, "content": json.dumps(result)}
        for tc, result in zip(tool_calls, results)
    ]

# When Claude calls 3 stock checks at once, they all execute simultaneously
# Total latency = max(individual latency), not sum(individual latencies)

Using CLAUDE.md for Standing Instructions

CLAUDE.md — ThreadCo project instructions

# ThreadCo — Claude Standing Instructions

## Brand Voice
- Friendly, direct, slightly playful. Never corporate.
- Sustainability is a feature, not a badge. Mention it naturally.
- Forbidden words: vibrant, perfect, stylish, must-have, luxurious
- No exclamation marks in product copy.

## Code Conventions
- TypeScript everywhere. Prefer `const` over `let`.
- All async functions must handle errors explicitly.
- Test files live in __tests__/ next to the source file.

## Review Checklist
When reviewing any pull request, always check:
1. Does it have tests?
2. Are error states handled?
3. Does the UI copy match the brand voice above?

Refinement Loop in Practice

Turn	What you type	What changes
1	Write a product description for the Sunset Gradient Tee (@sunset-tee.md)	First draft — correct but generic
2	Good start. Make it exactly 2 sentences. Cut any adjectives that don't describe a physical property.	Tighter, more concrete
3	The second sentence is too functional. Rewrite it to evoke the feeling of wearing it, not just the spec.	Sensory, brand-authentic
4	Perfect. Now apply the same approach to @midnight-tee.md and @wave-crop.md.	Batch — Claude uses the refined style as the implicit example

      The Refinement Principle: Claude's first response is a starting point, not a final product. Plan for 2-3 turns of refinement. Each turn should give one specific, actionable instruction — not a vague "make it better." Specific refinement instructions produce dramatically better results than rewriting the entire prompt.
    

Cost Savings Summary — All Techniques

Technique	Savings	Best For	Implementation Effort
Model routing (Haiku/Sonnet/Opus)	40-80% vs all-Opus	Mixed workloads with varying complexity	Low — just a routing function
Prompt caching	90% on cached tokens	Repeated system prompts, brand guidelines	Low — add cache_control to API calls
Batch API	50% on all tokens	Non-urgent bulk processing	Medium — async processing pipeline
Shorter prompts	Variable (10-30%)	Everything — fewer input tokens = lower cost	Low — prompt engineering
Lower max_tokens	Prevents waste	Tasks where you know the output length	Low — set per task type
Combined: routing + caching + batching	Up to 95% vs naive approach	Production systems at scale	Medium-High — architecture work

Slash Commands and Workflow Shortcuts

Command	What It Does	When to Use
/clear	Resets the conversation — clears all context	When switching to an unrelated task
/compact	Summarises conversation history to free context space	When a long session starts degrading in quality
/review	Triggers a code review on the current file	Quick review before committing
/ (just a slash)	Shows all available commands	To discover what's available

Slash Commands Speed This Up

Claude Code has built-in slash commands: /clear resets the conversation, /compact summarises history to free up context, /review triggers a code review on the current file. Type / in the chat panel to see all available commands.

CLAUDE.md vs .windsurfrules — Not the Same File

CLAUDE.md is the standing-instructions file for Claude Code in VS Code. If you also use Windsurf, that IDE uses a separate file called .windsurfrules — same concept, different tool, different filename. Do not confuse them or use one in place of the other.

Hands-On Exercises

Exercise 1 — Refinement Loop

Take a writing task (product description, email, documentation). Get Claude's first draft. Then refine it across exactly 3 turns, each with one specific instruction. Compare the final version to the first draft. How much better is it? Try with 5 turns — is there a point of diminishing returns?

Exercise 2 — Streaming Implementation

Using the streaming code from this module, build a simple Python script that streams Claude's response to the terminal. Measure the time to first token (TTFT) vs the total response time. Then modify it to use non-streaming and compare the user experience. How much faster does streaming feel?

Exercise 3 — Prompt Caching ROI

Identify the longest system prompt or context block you send repeatedly. Calculate: (a) its token count, (b) how many requests use it per day, (c) the cost without caching, (d) the cost with caching. What is your monthly savings? Is the block over 1,024 tokens (the minimum for caching)?

Exercise 4 — Batch vs Real-Time

Take a batch task (e.g., 50 product descriptions). Process them two ways: (a) 50 individual API calls, (b) one batch API submission. Compare: total cost, total wall-clock time, and output quality. Is the quality identical? Note: the Batch API may take up to 24 hours, so plan ahead.

Exercise 5 — Extended Thinking Experiment

Take a complex business problem (pricing analysis, architecture decision, or risk assessment). Run it twice with Claude Opus: (a) without extended thinking, (b) with extended thinking (budget: 10,000 tokens). Compare the depth, accuracy, and nuance of the two responses. Inspect the thinking block — does it show useful reasoning that improved the final answer?

← Prompt Engineering Next: Context & Memory →