Vision Model Prompting

How to effectively prompt AI models that can see and understand images.

7 min read
2 quiz questions

Vision-capable models like GPT-4o, Claude 3.5 Sonnet, and Gemini can analyze images alongside text. This opens up powerful use cases: analyzing charts, extracting data from screenshots, understanding diagrams, reviewing UI designs, and much more. But effective vision prompting requires different techniques than text-only prompting.

Vision models convert images into tokens that are processed alongside your text prompt. A typical image uses 500-2,000 tokens depending on resolution and detail level. The model "sees" the image much like a human would — it can identify objects, read text, understand spatial relationships, and interpret charts. However, it has limitations: fine details in large images can be missed, exact counts are unreliable, and small text may be misread.

  1. Tell the model what to look for: "This is a screenshot of our dashboard. Focus on the revenue chart in the upper right."
  2. Be specific about what information you need: "Extract all text from this image" vs. "What does the headline say?"
  3. Provide context about the image: "This is a wireframe for our mobile app's login screen."
  4. Ask structured questions: "List every data point visible in this chart as a table."
  5. Use high-resolution images when details matter, lower resolution when you just need general understanding.

Image Analysis

General-purpose image analysis prompt with structured extraction.

[Attach image]

Analyze this image and provide:
1. A brief description of what's shown
2. [SPECIFIC DATA OR INFORMATION] you can extract
3. Any issues, anomalies, or notable details
4. Confidence level for each extraction (high/medium/low)

Context: This image is a [TYPE: screenshot/chart/diagram/photo] from [CONTEXT].

  • Chart/graph data extraction: Upload a chart and ask the model to extract data points into a table
  • UI/UX review: Share a screenshot and get accessibility, usability, and design feedback
  • Document OCR: Extract text from photos of documents, whiteboards, or handwritten notes
  • Code from screenshots: Convert screenshot of code (e.g., from a tutorial video) into actual code
  • Diagram interpretation: Explain architecture diagrams, flowcharts, or system designs

UI Design Review

Professional UX review of any UI screenshot.

[Attach screenshot]

Review this UI design as a UX expert. Evaluate:

1. Visual hierarchy: Is the most important action obvious?
2. Accessibility: Color contrast, text size, touch target sizes
3. Consistency: Do similar elements look and behave similarly?
4. Cognitive load: Is the user being asked to process too much?
5. Mobile-friendliness: Would this work on a small screen?

For each issue found, rate severity (critical/moderate/minor) and suggest a specific fix.
Vision models sometimes hallucinate text that isn't in the image. Always verify extracted text against the original image, especially for numbers and proper nouns.

Prompt Templates

Chart Data Extractor

Extracts data from chart images into structured tables.

[Attach chart image]

Extract all data from this chart into a structured table. Include:
- All axis labels and values
- Every data point you can identify
- The chart title and any legends
- Units of measurement

Format as a markdown table. Flag any values you're unsure about with an asterisk (*).

Whiteboard OCR

Transcribes and organizes content from whiteboard photos or handwritten notes.

[Attach whiteboard/handwriting photo]

Transcribe everything written on this whiteboard/document. Organize the content logically:
1. Main headings/topics
2. Supporting points under each heading
3. Any diagrams or arrows (describe the relationships they show)
4. Action items or circled/highlighted text

For illegible text, write [illegible] and describe what you can make out.

Test Your Knowledge

Knowledge Check

1 / 2

What is the most important thing to do when prompting a vision model?

Key Takeaways

  • Vision prompting follows the same principle as text: be specific about what you want
  • Tell the model what the image is and what to focus on for best results
  • Images consume 500-2,000 tokens — factor this into your context budget
  • Verify extracted text and numbers against the original image due to hallucination risk
  • Chart extraction, UI review, and document OCR are among the most valuable vision use cases