Advanced Techniques in Claude Code
These techniques separate good results from exceptional ones. All of them work directly in the Claude Code chat panel in VS Code — no code required. This module also covers API-level advanced features: streaming, batching, prompt caching, extended thinking, and parallel tool use for teams building production integrations.
Chain of Thought
Add "Think step by step before answering" to any complex request. Claude reasons through the problem out loud, which surfaces assumptions you can catch and correct before the final output appears. This is the single highest-impact technique for reasoning-heavy tasks.
Iterative Refinement
Treat the conversation as a drafting loop. After Claude's first response, follow up: "Make it 20% shorter." "Remove all passive voice." "Add a specific example." Each turn tightens the result. 2-3 refinement turns typically produce better output than a single elaborate prompt.
Task Decomposition
Break large tasks into steps across messages: "Step 1: list the issues in @report.md." Review, then: "Step 2: draft fixes for the top 3." Errors in step 1 don't corrupt step 2, and you can redirect between steps.
Custom Instructions via CLAUDE.md
VS Code's Claude Code supports a CLAUDE.md file at the root of your project. Put standing instructions here — tone, conventions, banned phrases — so every session starts with the right context automatically. This is your single most important setup step.
Streaming Responses
Streaming returns Claude's response token by token as it is generated, rather than waiting for the complete response. This is critical for user-facing applications where perceived responsiveness matters.
import anthropic client = anthropic.Anthropic() # Basic streaming — tokens arrive as they're generated with client.messages.stream( model="claude-sonnet-4-6", max_tokens=1024, messages=[{"role": "user", "content": "Write a product description for the Sunset Gradient Tee."}] ) as stream: for text in stream.text_stream: print(text, end="", flush=True) # Streaming with event handling — more control with client.messages.stream( model="claude-sonnet-4-6", max_tokens=1024, messages=[{"role": "user", "content": "Analyse ThreadCo's Q1 sales data."}] ) as stream: for event in stream: if event.type == "content_block_delta": print(event.delta.text, end="", flush=True) elif event.type == "message_stop": print(" [Stream complete]") # Access the final message after streaming final = stream.get_final_message() print(f"Total tokens: {final.usage.input_tokens} in, {final.usage.output_tokens} out")
| Approach | When to Use | Latency Feel |
|---|---|---|
| Non-streaming | Background processing, batch jobs, when you need the complete response before acting | User waits until full response is ready |
| Streaming | Chat interfaces, real-time applications, any user-facing interaction | First tokens appear in < 1 second, feels responsive |
| Streaming + tool use | Agentic applications where Claude uses tools mid-response | Text streams, pauses for tool execution, resumes |
Prompt Caching
Prompt caching dramatically reduces cost when you send the same system prompt or context with every request. Cached tokens cost 90% less than regular input tokens after the first request.
import anthropic client = anthropic.Anthropic() # The system prompt and brand guide are the same for every product description request # Marking them with cache_control means they're cached after the first call BRAND_GUIDE = """ThreadCo Brand Voice Guide Tone: Friendly, direct, slightly playful. Never corporate. Sustainability is a feature, not a badge — mention it naturally. Forbidden words: vibrant, perfect, stylish, must-have, luxurious No exclamation marks in product copy. Always mention the material (organic cotton, recycled polyester, etc.) Always mention where it's made (Portugal, Turkey, etc.) Two sentences maximum per product description. ... (imagine this is 2,000 words of detailed brand guidelines) ...""" def describe_product(product_name: str, details: str) -> str: resp = client.messages.create( model="claude-sonnet-4-6", max_tokens=150, system=[ { "type": "text", "text": "You are a ThreadCo copywriter." }, { "type": "text", "text": BRAND_GUIDE, "cache_control": {"type": "ephemeral"} # Cache this block } ], messages=[{"role": "user", "content": f"Write a product description for {product_name}. Details: {details}"}] ) return resp.content[0].text # First call: full price (brand guide is processed and cached) describe_product("Sunset Gradient Tee", "Organic cotton, made in Portugal, amber-to-rose gradient") # Second call: 90% cheaper for the cached brand guide portion describe_product("Midnight Pocket Tee", "Organic cotton, made in Portugal, navy, single chest pocket") # At 2,000 product descriptions, prompt caching saves ~$50+ per batch
For a block to be cached, it must be at least 1,024 tokens (for Claude Sonnet/Opus) or 2,048 tokens (for Haiku). Short system prompts won't benefit from caching. The cache has a 5-minute TTL — if you send another request within 5 minutes, the cache hit applies. After 5 minutes of inactivity, the cache expires and the next request pays full price again.
Batch API — 50% Cost Savings
The Batch API processes multiple requests asynchronously at 50% of the standard price. Ideal for any workload that doesn't need real-time results.
import anthropic client = anthropic.Anthropic() # Prepare batch requests — one per product products = [ {"name": "Sunset Gradient Tee", "details": "Organic cotton, amber-to-rose gradient"}, {"name": "Midnight Pocket Tee", "details": "Organic cotton, navy, chest pocket"}, {"name": "Wave Print Crop Tee", "details": "Recycled cotton blend, wave pattern"}, # ... imagine 2,000 products ] requests = [] for i, product in enumerate(products): requests.append({ "custom_id": f"product-{i}", "params": { "model": "claude-haiku-4-5-20251001", "max_tokens": 150, "messages": [{ "role": "user", "content": f"Write a 2-sentence ThreadCo product description for {product['name']}. {product['details']}. Mention sustainability. No exclamation marks." }] } }) # Submit the batch — processes asynchronously batch = client.batches.create(requests=requests) print(f"Batch {batch.id} submitted with {len(requests)} requests") print(f"Status: {batch.processing_status}") # Check status later (or poll) # batch = client.batches.retrieve(batch.id) # Results available via batch.results_url when processing_status == "ended"
| Feature | Standard API | Batch API |
|---|---|---|
| Pricing | Full price | 50% discount on all tokens |
| Response time | Seconds (real-time) | Up to 24 hours (usually faster) |
| Best for | Interactive applications, chatbots | Bulk processing, nightly reports, content generation at scale |
| Max requests per batch | N/A | 100,000 |
| Streaming | Yes | No (results returned when complete) |
| Tool use | Yes | Yes |
Extended Thinking
Extended thinking lets Claude reason internally before responding. The model uses a hidden "thinking" block where it can work through complex problems, explore multiple approaches, and self-correct — then delivers a polished final response.
import anthropic client = anthropic.Anthropic() resp = client.messages.create( model="claude-opus-4-6", max_tokens=16000, thinking={ "type": "enabled", "budget_tokens": 10000 # How many tokens Claude can spend thinking }, messages=[{ "role": "user", "content": """Review ThreadCo's pricing strategy: - Current: £28 average price, 42% margin, 500 orders/week - Competitors: £25-£45 range - Customer feedback: "good value" but some price sensitivity - Goal: Increase margin to 50% without losing more than 10% volume Propose three strategies, model the financial impact of each, and recommend one with justification.""" }] ) # The response contains both thinking blocks and text blocks for block in resp.content: if block.type == "thinking": print(f"[Thinking: {len(block.thinking)} chars]") # Optionally inspect the thinking for debugging elif block.type == "text": print(block.text)
Extended thinking costs more (thinking tokens are billed, though at a reduced rate) and is slower. Use it for: complex mathematical analysis, multi-constraint optimisation, security vulnerability analysis, architectural decision-making, and any problem requiring 10+ logical steps. Do NOT use it for: simple classification, short writing tasks, or straightforward coding — the overhead is not justified.
Parallel Tool Use
Claude can request multiple tool calls in a single response. When tools are independent (e.g., checking stock for 3 different SKUs), you can execute them in parallel for faster results.
import asyncio, anthropic client = anthropic.Anthropic() async def execute_tool_async(name: str, input: dict) -> dict: """Execute a tool call asynchronously.""" # In production, this would be an async database query or API call await asyncio.sleep(0.1) # Simulate network latency return {"sku": input["sku"], "in_stock": 23} async def handle_parallel_tools(response) -> list: """Execute all tool calls from a response in parallel.""" tool_calls = [b for b in response.content if b.type == "tool_use"] # Run all tool calls concurrently tasks = [execute_tool_async(tc.name, tc.input) for tc in tool_calls] results = await asyncio.gather(*tasks) # Map results back to tool_use_ids return [ {"type": "tool_result", "tool_use_id": tc.id, "content": json.dumps(result)} for tc, result in zip(tool_calls, results) ] # When Claude calls 3 stock checks at once, they all execute simultaneously # Total latency = max(individual latency), not sum(individual latencies)
Using CLAUDE.md for Standing Instructions
# ThreadCo — Claude Standing Instructions
## Brand Voice
- Friendly, direct, slightly playful. Never corporate.
- Sustainability is a feature, not a badge. Mention it naturally.
- Forbidden words: vibrant, perfect, stylish, must-have, luxurious
- No exclamation marks in product copy.
## Code Conventions
- TypeScript everywhere. Prefer `const` over `let`.
- All async functions must handle errors explicitly.
- Test files live in __tests__/ next to the source file.
## Review Checklist
When reviewing any pull request, always check:
1. Does it have tests?
2. Are error states handled?
3. Does the UI copy match the brand voice above?
Refinement Loop in Practice
| Turn | What you type | What changes |
|---|---|---|
| 1 | Write a product description for the Sunset Gradient Tee (@sunset-tee.md) | First draft — correct but generic |
| 2 | Good start. Make it exactly 2 sentences. Cut any adjectives that don't describe a physical property. | Tighter, more concrete |
| 3 | The second sentence is too functional. Rewrite it to evoke the feeling of wearing it, not just the spec. | Sensory, brand-authentic |
| 4 | Perfect. Now apply the same approach to @midnight-tee.md and @wave-crop.md. | Batch — Claude uses the refined style as the implicit example |
Cost Savings Summary — All Techniques
| Technique | Savings | Best For | Implementation Effort |
|---|---|---|---|
| Model routing (Haiku/Sonnet/Opus) | 40-80% vs all-Opus | Mixed workloads with varying complexity | Low — just a routing function |
| Prompt caching | 90% on cached tokens | Repeated system prompts, brand guidelines | Low — add cache_control to API calls |
| Batch API | 50% on all tokens | Non-urgent bulk processing | Medium — async processing pipeline |
| Shorter prompts | Variable (10-30%) | Everything — fewer input tokens = lower cost | Low — prompt engineering |
| Lower max_tokens | Prevents waste | Tasks where you know the output length | Low — set per task type |
| Combined: routing + caching + batching | Up to 95% vs naive approach | Production systems at scale | Medium-High — architecture work |
Slash Commands and Workflow Shortcuts
| Command | What It Does | When to Use |
|---|---|---|
| /clear | Resets the conversation — clears all context | When switching to an unrelated task |
| /compact | Summarises conversation history to free context space | When a long session starts degrading in quality |
| /review | Triggers a code review on the current file | Quick review before committing |
| / (just a slash) | Shows all available commands | To discover what's available |
Claude Code has built-in slash commands: /clear resets the conversation, /compact summarises history to free up context, /review triggers a code review on the current file. Type / in the chat panel to see all available commands.
CLAUDE.md is the standing-instructions file for Claude Code in VS Code. If you also use Windsurf, that IDE uses a separate file called .windsurfrules — same concept, different tool, different filename. Do not confuse them or use one in place of the other.
Hands-On Exercises
Take a writing task (product description, email, documentation). Get Claude's first draft. Then refine it across exactly 3 turns, each with one specific instruction. Compare the final version to the first draft. How much better is it? Try with 5 turns — is there a point of diminishing returns?
Using the streaming code from this module, build a simple Python script that streams Claude's response to the terminal. Measure the time to first token (TTFT) vs the total response time. Then modify it to use non-streaming and compare the user experience. How much faster does streaming feel?
Identify the longest system prompt or context block you send repeatedly. Calculate: (a) its token count, (b) how many requests use it per day, (c) the cost without caching, (d) the cost with caching. What is your monthly savings? Is the block over 1,024 tokens (the minimum for caching)?
Take a batch task (e.g., 50 product descriptions). Process them two ways: (a) 50 individual API calls, (b) one batch API submission. Compare: total cost, total wall-clock time, and output quality. Is the quality identical? Note: the Batch API may take up to 24 hours, so plan ahead.
Take a complex business problem (pricing analysis, architecture decision, or risk assessment). Run it twice with Claude Opus: (a) without extended thinking, (b) with extended thinking (budget: 10,000 tokens). Compare the depth, accuracy, and nuance of the two responses. Inspect the thinking block — does it show useful reasoning that improved the final answer?