Multimodal | AI Training Hub

Claude Track

Module 17

Claude Track -- Module 17

ShopMate Sees the Products: Maya uploads a product photo and ShopMate writes the description from the image -- no manual brief needed. She also uploads a competitor's catalogue page and gets a comparison report. The product upload workflow that used to take 10 minutes per item now takes 90 seconds.

Multimodal Capabilities

All Claude 4 models support vision input. Send images, screenshots, charts, diagrams, and PDF pages for analysis alongside text. This module covers what Claude can see, what it cannot, how to get the best results from image inputs, and practical patterns for document processing, chart analysis, and UI review.

Image Analysis

Send any image (JPEG, PNG, GIF, WebP) up to 5MB. Claude can describe content, extract text via OCR, identify objects, read handwriting, compare images, and answer questions about the visual. All Claude 4 models support vision — you don't need a special "vision" model.

Document Understanding

Pass PDF pages as images for extraction of tables, charts, and structured data. Ideal for invoices, forms, research papers, and technical diagrams. Claude preserves table structure and can convert visual tables to JSON, CSV, or markdown.

Chart and Data Extraction

Send screenshots of dashboards or charts and ask Claude to extract underlying data, identify trends, or generate equivalent code to reproduce the visualisation. Claude can read bar charts, line charts, pie charts, scatter plots, and basic infographics.

UI Screenshot Analysis

Share screenshots of UIs for debugging, accessibility review, or automated testing. Claude can identify elements, suggest improvements, and generate test selectors. This is invaluable for design reviews and cross-browser issue diagnosis.

How Vision Works in Claude

Understanding how Claude processes images helps you get better results:

Images Are Tokenised

When you send an image, Claude converts it into tokens — just like text. A typical image costs 1,000-5,000 tokens depending on its size and complexity. This means images consume context window space. A conversation with 20 images can use a significant portion of your 200K token budget.

Resolution Matters

Claude processes images at various resolutions. Very small images (under 200x200 pixels) may lose detail. Very large images are downscaled. The sweet spot is 1000-2000 pixels on the longest edge. Cropping to the relevant area is often better than sending a full-screen screenshot.

Text in Images Is Readable

Claude has strong OCR capabilities — it can read printed text, handwritten text, text in diagrams, and text overlaid on images. However, very small text (under 12px in a screenshot) or low-contrast text may be misread. For critical OCR, verify the extracted text.

Multiple Images Per Message

You can send multiple images in a single message. Claude can compare them, identify differences, or process them as a batch. For example: "Here are screenshots of our checkout flow on mobile and desktop — identify any responsive design issues."

Supported Image Formats and Limits

Format	Supported	Notes
JPEG (.jpg, .jpeg)	Yes	Most common. Good for photos and screenshots.
PNG (.png)	Yes	Best for screenshots, diagrams, and text-heavy images.
GIF (.gif)	Yes	Only the first frame is analysed for animated GIFs.
WebP (.webp)	Yes	Efficient format, works well.
SVG (.svg)	No	Convert to PNG first. SVG is XML-based — you can share the XML directly.
PDF (.pdf)	Via API	Send individual pages as images. Claude Code can read PDFs directly with @.
TIFF (.tiff)	No	Convert to PNG or JPEG first.
Maximum file size		5MB per image (base64 encoded via API), or standard upload limits in Claude.ai
Maximum images per message		20 images (but be mindful of token cost)

Vision Best Practices

Crop Before Sending

If you only care about one part of a screenshot, crop it. A cropped image of a form field is more useful than a full-screen screenshot where the form field is 10% of the pixels. Less noise = better results.

Add Text Context

Always accompany images with text instructions: "This is a screenshot of our checkout page. Focus on the discount code field at the bottom." Text context guides Claude's attention to what matters in the image.

Specify What You Want Extracted

Don't just say "analyse this image." Be specific: "Extract the table in this image as a JSON array of objects with columns: date, amount, description." or "Read the text in the red error banner."

Use High-Contrast Images

Claude performs best on images with clear contrast — dark text on light backgrounds, distinct colours for chart segments. If you're creating diagrams for Claude to analyse, use bold lines, large text, and high contrast.

Document Processing Patterns

One of the most practical multimodal use cases is processing documents — invoices, receipts, contracts, forms, and research papers.

Python — Invoice Data Extraction

import anthropic, base64, json
from pathlib import Path
client = anthropic.Anthropic()

def extract_invoice_data(image_path: str) -> dict:
    """Extract structured data from an invoice image."""
    data = base64.b64encode(Path(image_path).read_bytes()).decode()
    mime = "image/png" if image_path.endswith(".png") else "image/jpeg"

    resp = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=500,
        messages=[{"role":"user", "content":[
            {"type":"image", "source":{"type":"base64", "media_type":mime, "data":data}},
            {"type":"text", "text": """Extract the following from this invoice as JSON:
{
  "vendor_name": "",
  "invoice_number": "",
  "date": "YYYY-MM-DD",
  "line_items": [{"description": "", "quantity": 0, "unit_price": 0.00, "total": 0.00}],
  "subtotal": 0.00,
  "tax": 0.00,
  "total": 0.00,
  "payment_terms": ""
}
Return ONLY valid JSON. If a field is not visible, use null."""}
        ]}]
    )
    return json.loads(resp.content[0].text)

Chart and Dashboard Analysis

Claude can read charts and extract approximate data values. This is useful for analysing dashboards, competitor reports, and any visual data presentation.

Chart Type	Claude's Accuracy	Tips
Bar charts	High — can read values within ~5% accuracy	Ensure axis labels are visible. Ask for exact values, not just trends.
Line charts	Good — trends are reliable, exact values less so	Ask about trends and turning points rather than precise data points.
Pie charts	Good — segments over 10% are reliable	Small segments (under 5%) may be misread. Ask Claude to estimate.
Scatter plots	Moderate — can identify clusters and outliers	Best for qualitative analysis (patterns) rather than exact coordinates.
Tables in images	High — strong OCR for printed tables	Specify the output format (JSON, CSV, markdown) for structured extraction.
Complex dashboards	Good — can identify individual charts within a dashboard	Ask about one chart at a time for best accuracy.

Chat Panel — Dashboard Analysis

You:
[Attached: screenshot of ThreadCo's Shopify analytics dashboard]

Analyse this dashboard and answer:
1. What is the trend in daily orders over the past 30 days?
2. Which product category has the highest revenue?
3. What is the conversion rate shown, and how does it compare to
   e-commerce benchmarks (typically 2-3%)?
4. Are there any anomalies or concerning patterns?

Present your findings as a brief executive summary for Maya.

UI Screenshot Review

Claude can review UI screenshots for design issues, accessibility problems, and consistency:

Chat Panel — UI Accessibility Review

You:
[Attached: screenshot of ThreadCo's product page on mobile]

Review this mobile product page for:
1. Accessibility: Are there contrast issues? Is text readable?
   Are interactive elements large enough for touch (44x44px minimum)?
2. UX: Is the call-to-action ("Add to Cart") prominent?
   Is the price clearly visible? Is the size selector intuitive?
3. Responsiveness: Does the layout use the mobile viewport well?
   Are there any elements that look broken or misaligned?
4. Brand consistency: Does it match ThreadCo's brand (clean,
   sustainable, no-nonsense)?

For each issue, suggest a specific fix.

Honest Limitations of Vision

No Image Generation

Claude can analyse images but cannot create, edit, or manipulate them. It cannot draw diagrams, generate product mockups, or create marketing visuals. For image generation, use tools like DALL-E, Midjourney, or Stable Diffusion — Claude can write the prompts for these tools.

Spatial Reasoning Limits

Claude can identify objects and read text in images, but precise spatial reasoning (exact pixel positions, precise measurements, counting many similar objects) is less reliable. It can say "the button is in the top-right corner" but not "the button is at position 342, 87."

People and Faces

Claude will not identify real people by name from photos, even public figures. This is a deliberate safety measure. Claude can describe what it sees (a person wearing a blue shirt, approximately 30 years old) but will not say "this is [name]."

Small or Low-Quality Text

Very small text (under ~10px), blurry text, or text with poor contrast against the background may be misread or missed entirely. If OCR accuracy is critical, ensure the image is high-resolution and the text is clearly visible.

Hallucination in Vision

Like text-based responses, Claude can hallucinate details in images. It might "read" text that is not there or misidentify objects. For critical applications (medical images, legal documents, safety-relevant analysis), always verify Claude's visual analysis with a human reviewer. Vision hallucinations are harder to catch than text hallucinations because you have to look at the original image to verify.

ShopMate -- Photo-to-Description

Python -- shopmate/vision/photo_description.py

# shopmate/vision/photo_description.py
# Maya uploads a product photo -> ShopMate writes the description
import anthropic, base64
from pathlib import Path
client = anthropic.Anthropic()

SYSTEM = """You are a ThreadCo copywriter. When given a product photo:
1. Identify the garment type, colours, print/design, and apparent material
2. Write a 2-sentence product description in ThreadCo's voice:
   - Friendly, direct, never corporate
   - Always mention sustainability
   - No exclamation marks
   - Forbidden words: vibrant, perfect, stylish, must-have"""

def describe_from_photo(image_path: str, extra_info: str = "") -> str:
    """Generate a product description directly from a product photo."""
    data = base64.b64encode(Path(image_path).read_bytes()).decode()
    mime = "image/png" if image_path.endswith(".png") else "image/jpeg"
    resp = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=150,
        system=SYSTEM,
        messages=[{"role":"user", "content":[
            {"type":"image","source":{"type":"base64","media_type":mime,"data":data}},
            {"type":"text", "text": f"Write the ThreadCo product description.{' Extra info: ' + extra_info if extra_info else ''}"}
        ]}]
    )
    return resp.content[0].text

def batch_describe_photos(photo_dir: str) -> list[dict]:
    """Process a folder of product photos in one go."""
    results = []
    for path in Path(photo_dir).glob("*.jpg"):
        desc = describe_from_photo(str(path))
        results.append({"file": path.name, "description": desc})
        print(f"{path.name}: {desc[:80]}...")
    return results

# Usage: Maya drags her product photos into a folder, runs one command
# results = batch_describe_photos("new_season_photos/")

Comparing Images

Send two or more images in one message for comparison tasks:

Chat Panel — Image Comparison

You:
[Image 1: ThreadCo's product page — current version]
[Image 2: ThreadCo's product page — new design mockup]

Compare these two versions of our product page:
1. What changed between the current version and the new design?
2. Which version has better visual hierarchy for the price and CTA?
3. Are there any accessibility regressions in the new design?
4. Rate both designs 1-10 for mobile usability.

Be specific — reference elements by position and describe what you see.

Hands-On Exercises

Exercise 1 — OCR Accuracy Test

Take a screenshot of a document with mixed formatting (tables, headings, body text, footnotes). Ask Claude to extract all the text. Compare Claude's extraction with the original. Calculate the accuracy rate. Try with different image qualities (high-res vs low-res) and note how quality affects accuracy.

Exercise 2 — Chart Data Extraction

Take a screenshot of a chart from a report or dashboard. Ask Claude to extract the underlying data as a JSON array. Compare the extracted values with the actual data. How accurate are the numbers? For which chart types is Claude most/least accurate?

Exercise 3 — UI Review

Screenshot a page from your application (or any website). Ask Claude to review it for accessibility, using the structured review prompt from this module. Verify each finding — are the contrast issues real? Are the touch targets actually too small? Compare Claude's findings with an automated accessibility tool like axe or Lighthouse.

Exercise 4 — Batch Photo Processing

Using the batch_describe_photos code pattern, process 5-10 product photos (or any set of similar images). Review the consistency of Claude's output across the batch. Are the descriptions consistent in style, length, and tone? Adjust the system prompt until batch consistency reaches an acceptable level.

Exercise 5 — Limitation Testing

Deliberately test Claude's vision limitations: send a very small image, a blurry photo, an image with tiny text, and an image of many similar objects. Ask Claude to describe what it sees. For each test, note where Claude succeeds and where it struggles. Use these findings to set expectations for your team's vision-based workflows.

<-- Skills Next: API Usage -->