AgentSkillsCN

Ai Multimodal

AI多模态

SKILL.md

Multimodal AI

Vision and audio AI integration. Image analysis, transcription, text-to-speech.

Quick Start

bash
npx ai-multimodal vision ./image.png "Describe this"

What It Does

  • Analyze images with GPT-4 Vision
  • Extract text from images (OCR)
  • Transcribe audio with Whisper
  • Generate speech from text

Usage

bash
# Vision
npx ai-multimodal vision ./photo.jpg "What's in this?"

# OCR
npx ai-multimodal ocr ./screenshot.png

# Transcribe
npx ai-multimodal transcribe ./audio.mp3

# Text to speech
npx ai-multimodal tts "Hello" ./output.mp3

Part of the LXGIC Dev Toolkit

One of 110+ free developer tools from LXGIC Studios.

License

MIT. Free forever.