AgentSkillsCN

visual-analysis

利用 Gemini 的多模态能力分析图像或视频。适用于当用户询问图像内容、需要 OCR 识别、目标检测、场景理解,或进行视频分析时使用。

SKILL.md
--- frontmatter
name: visual-analysis
description: Analyze images or videos using Gemini's multimodal capabilities. Use when the user asks about image content, needs OCR, object detection, scene understanding, or video analysis.
allowed-tools: gemini-api, Read, Glob

Visual Analysis with Gemini

When to Use This Skill

Automatically invoke this skill when:

  • User asks to analyze an image or video file
  • User requests OCR or text extraction from images
  • User wants to understand image content, objects, or scenes
  • User needs detailed visual descriptions
  • User asks about what's in a screenshot or photo
  • User requests video summarization or event detection

Examples That Trigger This Skill

  • "What's in this image?"
  • "Analyze screenshot.png"
  • "Extract text from this receipt"
  • "Describe what you see in photo.jpg"
  • "What objects are in this image?"
  • "Summarize what happens in video.mp4"

How to Use

  1. Identify the file: Get the file path from user's request
  2. Verify file exists: Use Read tool to check if file is accessible
  3. Call Gemini: Use the analyze_visual tool from gemini-api MCP server
    • Pass the file path
    • Include user's specific question as the prompt (or use general analysis)
    • Choose model: gemini-1.5-flash for speed, gemini-1.5-pro for quality
  4. Present results: Return Gemini's analysis to the user

Tool Parameters

javascript
{
  "file_path": "/absolute/path/to/image.jpg",
  "prompt": "What objects are visible in this image?",
  "model": "gemini-1.5-flash"  // or "gemini-1.5-pro"
}

Capabilities

  • Image Analysis: Detailed object detection, scene understanding, composition analysis
  • OCR: Extract and read text from images (signs, documents, screenshots)
  • Video Analysis: Summarize events, detect actions, identify changes over time
  • Spatial Reasoning: Understand object relationships and layout
  • Multi-frame Processing: Analyze video clips frame by frame

Best Practices

  • For quick analysis, use gemini-1.5-flash
  • For detailed or complex images, use gemini-1.5-pro
  • Include specific questions in the prompt for targeted analysis
  • For videos, mention timeframe of interest if relevant