Visual Analysis with Gemini
When to Use This Skill
Automatically invoke this skill when:
- •User asks to analyze an image or video file
- •User requests OCR or text extraction from images
- •User wants to understand image content, objects, or scenes
- •User needs detailed visual descriptions
- •User asks about what's in a screenshot or photo
- •User requests video summarization or event detection
Examples That Trigger This Skill
- •"What's in this image?"
- •"Analyze screenshot.png"
- •"Extract text from this receipt"
- •"Describe what you see in photo.jpg"
- •"What objects are in this image?"
- •"Summarize what happens in video.mp4"
How to Use
- •Identify the file: Get the file path from user's request
- •Verify file exists: Use Read tool to check if file is accessible
- •Call Gemini: Use the
analyze_visualtool from gemini-api MCP server- •Pass the file path
- •Include user's specific question as the prompt (or use general analysis)
- •Choose model: gemini-1.5-flash for speed, gemini-1.5-pro for quality
- •Present results: Return Gemini's analysis to the user
Tool Parameters
javascript
{
"file_path": "/absolute/path/to/image.jpg",
"prompt": "What objects are visible in this image?",
"model": "gemini-1.5-flash" // or "gemini-1.5-pro"
}
Capabilities
- •Image Analysis: Detailed object detection, scene understanding, composition analysis
- •OCR: Extract and read text from images (signs, documents, screenshots)
- •Video Analysis: Summarize events, detect actions, identify changes over time
- •Spatial Reasoning: Understand object relationships and layout
- •Multi-frame Processing: Analyze video clips frame by frame
Best Practices
- •For quick analysis, use gemini-1.5-flash
- •For detailed or complex images, use gemini-1.5-pro
- •Include specific questions in the prompt for targeted analysis
- •For videos, mention timeframe of interest if relevant