Multimodal Capabilities
All Claude 4 models support vision input. Send images, screenshots, charts, diagrams, and PDF pages for analysis alongside text. This module covers what Claude can see, what it cannot, how to get the best results from image inputs, and practical patterns for document processing, chart analysis, and UI review.
Image Analysis
Send any image (JPEG, PNG, GIF, WebP) up to 5MB. Claude can describe content, extract text via OCR, identify objects, read handwriting, compare images, and answer questions about the visual. All Claude 4 models support vision — you don't need a special "vision" model.
Document Understanding
Pass PDF pages as images for extraction of tables, charts, and structured data. Ideal for invoices, forms, research papers, and technical diagrams. Claude preserves table structure and can convert visual tables to JSON, CSV, or markdown.
Chart and Data Extraction
Send screenshots of dashboards or charts and ask Claude to extract underlying data, identify trends, or generate equivalent code to reproduce the visualisation. Claude can read bar charts, line charts, pie charts, scatter plots, and basic infographics.
UI Screenshot Analysis
Share screenshots of UIs for debugging, accessibility review, or automated testing. Claude can identify elements, suggest improvements, and generate test selectors. This is invaluable for design reviews and cross-browser issue diagnosis.
How Vision Works in Claude
Understanding how Claude processes images helps you get better results:
Images Are Tokenised
When you send an image, Claude converts it into tokens — just like text. A typical image costs 1,000-5,000 tokens depending on its size and complexity. This means images consume context window space. A conversation with 20 images can use a significant portion of your 200K token budget.
Resolution Matters
Claude processes images at various resolutions. Very small images (under 200x200 pixels) may lose detail. Very large images are downscaled. The sweet spot is 1000-2000 pixels on the longest edge. Cropping to the relevant area is often better than sending a full-screen screenshot.
Text in Images Is Readable
Claude has strong OCR capabilities — it can read printed text, handwritten text, text in diagrams, and text overlaid on images. However, very small text (under 12px in a screenshot) or low-contrast text may be misread. For critical OCR, verify the extracted text.
Multiple Images Per Message
You can send multiple images in a single message. Claude can compare them, identify differences, or process them as a batch. For example: "Here are screenshots of our checkout flow on mobile and desktop — identify any responsive design issues."
Supported Image Formats and Limits
| Format | Supported | Notes |
|---|---|---|
| JPEG (.jpg, .jpeg) | Yes | Most common. Good for photos and screenshots. |
| PNG (.png) | Yes | Best for screenshots, diagrams, and text-heavy images. |
| GIF (.gif) | Yes | Only the first frame is analysed for animated GIFs. |
| WebP (.webp) | Yes | Efficient format, works well. |
| SVG (.svg) | No | Convert to PNG first. SVG is XML-based — you can share the XML directly. |
| PDF (.pdf) | Via API | Send individual pages as images. Claude Code can read PDFs directly with @. |
| TIFF (.tiff) | No | Convert to PNG or JPEG first. |
| Maximum file size | 5MB per image (base64 encoded via API), or standard upload limits in Claude.ai | |
| Maximum images per message | 20 images (but be mindful of token cost) | |
Vision Best Practices
Crop Before Sending
If you only care about one part of a screenshot, crop it. A cropped image of a form field is more useful than a full-screen screenshot where the form field is 10% of the pixels. Less noise = better results.
Add Text Context
Always accompany images with text instructions: "This is a screenshot of our checkout page. Focus on the discount code field at the bottom." Text context guides Claude's attention to what matters in the image.
Specify What You Want Extracted
Don't just say "analyse this image." Be specific: "Extract the table in this image as a JSON array of objects with columns: date, amount, description." or "Read the text in the red error banner."
Use High-Contrast Images
Claude performs best on images with clear contrast — dark text on light backgrounds, distinct colours for chart segments. If you're creating diagrams for Claude to analyse, use bold lines, large text, and high contrast.
Document Processing Patterns
One of the most practical multimodal use cases is processing documents — invoices, receipts, contracts, forms, and research papers.
import anthropic, base64, json from pathlib import Path client = anthropic.Anthropic() def extract_invoice_data(image_path: str) -> dict: """Extract structured data from an invoice image.""" data = base64.b64encode(Path(image_path).read_bytes()).decode() mime = "image/png" if image_path.endswith(".png") else "image/jpeg" resp = client.messages.create( model="claude-sonnet-4-6", max_tokens=500, messages=[{"role":"user", "content":[ {"type":"image", "source":{"type":"base64", "media_type":mime, "data":data}}, {"type":"text", "text": """Extract the following from this invoice as JSON: { "vendor_name": "", "invoice_number": "", "date": "YYYY-MM-DD", "line_items": [{"description": "", "quantity": 0, "unit_price": 0.00, "total": 0.00}], "subtotal": 0.00, "tax": 0.00, "total": 0.00, "payment_terms": "" } Return ONLY valid JSON. If a field is not visible, use null."""} ]}] ) return json.loads(resp.content[0].text)
Chart and Dashboard Analysis
Claude can read charts and extract approximate data values. This is useful for analysing dashboards, competitor reports, and any visual data presentation.
| Chart Type | Claude's Accuracy | Tips |
|---|---|---|
| Bar charts | High — can read values within ~5% accuracy | Ensure axis labels are visible. Ask for exact values, not just trends. |
| Line charts | Good — trends are reliable, exact values less so | Ask about trends and turning points rather than precise data points. |
| Pie charts | Good — segments over 10% are reliable | Small segments (under 5%) may be misread. Ask Claude to estimate. |
| Scatter plots | Moderate — can identify clusters and outliers | Best for qualitative analysis (patterns) rather than exact coordinates. |
| Tables in images | High — strong OCR for printed tables | Specify the output format (JSON, CSV, markdown) for structured extraction. |
| Complex dashboards | Good — can identify individual charts within a dashboard | Ask about one chart at a time for best accuracy. |
You:
[Attached: screenshot of ThreadCo's Shopify analytics dashboard]
Analyse this dashboard and answer:
1. What is the trend in daily orders over the past 30 days?
2. Which product category has the highest revenue?
3. What is the conversion rate shown, and how does it compare to
e-commerce benchmarks (typically 2-3%)?
4. Are there any anomalies or concerning patterns?
Present your findings as a brief executive summary for Maya.
UI Screenshot Review
Claude can review UI screenshots for design issues, accessibility problems, and consistency:
You:
[Attached: screenshot of ThreadCo's product page on mobile]
Review this mobile product page for:
1. Accessibility: Are there contrast issues? Is text readable?
Are interactive elements large enough for touch (44x44px minimum)?
2. UX: Is the call-to-action ("Add to Cart") prominent?
Is the price clearly visible? Is the size selector intuitive?
3. Responsiveness: Does the layout use the mobile viewport well?
Are there any elements that look broken or misaligned?
4. Brand consistency: Does it match ThreadCo's brand (clean,
sustainable, no-nonsense)?
For each issue, suggest a specific fix.
Honest Limitations of Vision
No Image Generation
Claude can analyse images but cannot create, edit, or manipulate them. It cannot draw diagrams, generate product mockups, or create marketing visuals. For image generation, use tools like DALL-E, Midjourney, or Stable Diffusion — Claude can write the prompts for these tools.
Spatial Reasoning Limits
Claude can identify objects and read text in images, but precise spatial reasoning (exact pixel positions, precise measurements, counting many similar objects) is less reliable. It can say "the button is in the top-right corner" but not "the button is at position 342, 87."
People and Faces
Claude will not identify real people by name from photos, even public figures. This is a deliberate safety measure. Claude can describe what it sees (a person wearing a blue shirt, approximately 30 years old) but will not say "this is [name]."
Small or Low-Quality Text
Very small text (under ~10px), blurry text, or text with poor contrast against the background may be misread or missed entirely. If OCR accuracy is critical, ensure the image is high-resolution and the text is clearly visible.
Like text-based responses, Claude can hallucinate details in images. It might "read" text that is not there or misidentify objects. For critical applications (medical images, legal documents, safety-relevant analysis), always verify Claude's visual analysis with a human reviewer. Vision hallucinations are harder to catch than text hallucinations because you have to look at the original image to verify.
ShopMate -- Photo-to-Description
# shopmate/vision/photo_description.py # Maya uploads a product photo -> ShopMate writes the description import anthropic, base64 from pathlib import Path client = anthropic.Anthropic() SYSTEM = """You are a ThreadCo copywriter. When given a product photo: 1. Identify the garment type, colours, print/design, and apparent material 2. Write a 2-sentence product description in ThreadCo's voice: - Friendly, direct, never corporate - Always mention sustainability - No exclamation marks - Forbidden words: vibrant, perfect, stylish, must-have""" def describe_from_photo(image_path: str, extra_info: str = "") -> str: """Generate a product description directly from a product photo.""" data = base64.b64encode(Path(image_path).read_bytes()).decode() mime = "image/png" if image_path.endswith(".png") else "image/jpeg" resp = client.messages.create( model="claude-sonnet-4-6", max_tokens=150, system=SYSTEM, messages=[{"role":"user", "content":[ {"type":"image","source":{"type":"base64","media_type":mime,"data":data}}, {"type":"text", "text": f"Write the ThreadCo product description.{' Extra info: ' + extra_info if extra_info else ''}"} ]}] ) return resp.content[0].text def batch_describe_photos(photo_dir: str) -> list[dict]: """Process a folder of product photos in one go.""" results = [] for path in Path(photo_dir).glob("*.jpg"): desc = describe_from_photo(str(path)) results.append({"file": path.name, "description": desc}) print(f"{path.name}: {desc[:80]}...") return results # Usage: Maya drags her product photos into a folder, runs one command # results = batch_describe_photos("new_season_photos/")
Comparing Images
Send two or more images in one message for comparison tasks:
You:
[Image 1: ThreadCo's product page — current version]
[Image 2: ThreadCo's product page — new design mockup]
Compare these two versions of our product page:
1. What changed between the current version and the new design?
2. Which version has better visual hierarchy for the price and CTA?
3. Are there any accessibility regressions in the new design?
4. Rate both designs 1-10 for mobile usability.
Be specific — reference elements by position and describe what you see.
Hands-On Exercises
Take a screenshot of a document with mixed formatting (tables, headings, body text, footnotes). Ask Claude to extract all the text. Compare Claude's extraction with the original. Calculate the accuracy rate. Try with different image qualities (high-res vs low-res) and note how quality affects accuracy.
Take a screenshot of a chart from a report or dashboard. Ask Claude to extract the underlying data as a JSON array. Compare the extracted values with the actual data. How accurate are the numbers? For which chart types is Claude most/least accurate?
Screenshot a page from your application (or any website). Ask Claude to review it for accessibility, using the structured review prompt from this module. Verify each finding — are the contrast issues real? Are the touch targets actually too small? Compare Claude's findings with an automated accessibility tool like axe or Lighthouse.
Using the batch_describe_photos code pattern, process 5-10 product photos (or any set of similar images). Review the consistency of Claude's output across the batch. Are the descriptions consistent in style, length, and tone? Adjust the system prompt until batch consistency reaches an acceptable level.
Deliberately test Claude's vision limitations: send a very small image, a blurry photo, an image with tiny text, and an image of many similar objects. Ask Claude to describe what it sees. For each test, note where Claude succeeds and where it struggles. Use these findings to set expectations for your team's vision-based workflows.