AgentSkillsCN

ai-multimodal

利用 Google Gemini API 处理并生成多媒体内容,以提升视觉能力。其功能涵盖:分析音频文件(支持带时间戳的转录、摘要生成、语音理解,以及长达 9.5 小时的音乐与声音分析);理解图像(相比 Claude 模型,具备更出色的图像分析能力,可完成图像标题生成、推理、目标检测、设计元素提取、OCR 文字识别、视觉问答、图像分割,并支持多张图像处理);处理视频(支持场景检测、问答交互、时序分析,可解析 YouTube 视频链接,最长处理时长可达 6 小时);从文档中提取信息(如 PDF 表格、表单、图表、示意图及多页文档内容);生成图像(通过 Imagen 4 实现文生图,支持图像编辑、构图与细节优化);生成视频(借助 Veo 3 实现文生视频,可生成带有原生音频的 8 秒短片)。当您需要处理音频或视频文件、分析图像或截图(而非依赖 Claude 的默认视觉能力,仅在必要时回退至 Claude 的视觉能力)、处理 PDF 文档、从媒体中提取结构化数据、根据文本提示生成图像或视频,或实现多模态 AI 功能时,均可使用此工具。该工具兼容 Gemini 3/2.5、Imagen 4 与 Veo 3 模型,支持高达 200 万 Token 的上下文窗口。

SKILL.md
--- frontmatter
name: ai-multimodal
description: Process and generate multimedia content using Google Gemini API for better vision capabilities. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (better image analysis than Claude models, captioning, reasoning, object detection, design extraction, OCR, visual Q&A, segmentation, handle multiple images), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image with Imagen 4, editing, composition, refinement), generate videos (text-to-video with Veo 3, 8-second clips with native audio). Use when working with audio/video files, analyzing images or screenshots (instead of default vision capabilities of Claude, only fallback to Claude's vision capabilities if needed), processing PDF documents, extracting structured data from media, creating images/videos from text prompts, or implementing multimodal AI features. Supports Gemini 3/2.5, Imagen 4, and Veo 3 models with context windows up to 2M tokens.
license: MIT
allowed-tools:
  - Bash
  - Read
  - Write
  - Edit

AI Multimodal

Process audio, images, videos, documents, and generate images/videos using Google Gemini's multimodal API.

Setup

bash
export GEMINI_API_KEY="your-key"  # Get from https://aistudio.google.com/apikey
pip install google-genai python-dotenv pillow

Quick Start

Verify setup: python scripts/check_setup.py Analyze media: python scripts/gemini_batch_process.py --files <file> --task <analyze|transcribe|extract>

  • TIP: When you're asked to analyze an image, check if gemini command is available, then use "<prompt to analyze image>" | gemini -y -m gemini-2.5-flash command. If gemini command is not available, use python scripts/gemini_batch_process.py --files <file> --task analyze command. Generate content: python scripts/gemini_batch_process.py --task <generate|generate-video> --prompt "description"

Models

  • Image generation: imagen-4.0-generate-001 (standard), imagen-4.0-ultra-generate-001 (quality), imagen-4.0-fast-generate-001 (speed)
  • Video generation: veo-3.1-generate-preview (8s clips with audio)
  • Analysis: gemini-2.5-flash (recommended), gemini-2.5-pro (advanced)

Scripts

  • gemini_batch_process.py: CLI orchestrator for transcribe|analyze|extract|generate|generate-video that auto-resolves API keys, picks sensible default models per task, streams files inline vs File API, and saves structured outputs (text/JSON/CSV/markdown plus generated assets) for Imagen 4 + Veo workflows.
  • media_optimizer.py: ffmpeg/Pillow-based preflight tool that compresses/resizes/converts audio, image, and video inputs, enforces target sizes/bitrates, splits long clips into hour chunks, and batch-processes directories so media stays within Gemini limits.
  • document_converter.py: Gemini-powered converter that uploads PDFs/images/Office docs, applies a markdown-preserving prompt, batches multiple files, auto-names outputs under docs/assets, and exposes CLI flags for model, prompt, auto-file naming, and verbose logging.
  • check_setup.py: Interactive readiness checker that verifies directory layout, centralized env resolver, required Python deps, and GEMINI_API_KEY availability/format, then performs a live Gemini API call and prints remediation instructions if anything fails.

Use --help for options.

References

Load for detailed guidance:

TopicFileDescription
Audioreferences/audio-processing.mdAudio formats and limits, transcription (timestamps, speakers, segments), non-speech analysis, File API vs inline input, TTS models, best practices, cost and token math, and concrete meeting/podcast/interview recipes.
Imagesreferences/vision-understanding.mdVision capabilities overview, supported formats and models, captioning/classification/VQA, detection and segmentation, OCR and document reading, multi-image workflows, structured JSON output, token costs, best practices, and common product/screenshot/chart/scene use cases.
Image Genreferences/image-generation.mdImagen 4 and Gemini image model overview, generate_images vs generate_content APIs, aspect ratios and costs, text/image/both modalities, editing and composition, style and quality control, safety settings, best practices, troubleshooting, and common marketing/concept-art/UI scenarios.
Videoreferences/video-analysis.mdVideo analysis capabilities and supported formats, model/context choices, local/inline/YouTube inputs, clipping and FPS control, multi-video comparison, temporal Q&A and scene detection, transcription with visual context, token and cost guidance, and optimization/best-practice patterns.
Video Genreferences/video-generation.mdVeo model matrix, text-to-video and image-to-video quick start, multi-reference and extension flows, camera and timing control, configuration (resolution, aspect, audio, safety), prompt design patterns, performance tips, limitations, troubleshooting, and cost estimates.

Limits

Formats: Audio (WAV/MP3/AAC, 9.5h), Images (PNG/JPEG/WEBP, 3.6k), Video (MP4/MOV, 6h), PDF (1k pages) Size: 20MB inline, 2GB File API

Resources