SmolVLM - Local Image Analysis
Analyze images locally using SmolVLM-2B, a state-of-the-art compact vision-language model optimized for Apple Silicon via mlx-vlm.
Quick Usage
Describe an Image
bash
python ~/.claude/skills/smolvlm/scripts/view_image.py /path/to/image.png
Ask a Question About an Image
bash
python ~/.claude/skills/smolvlm/scripts/view_image.py /path/to/image.png "What text is visible?"
Specific Tasks
bash
# Extract text (OCR) python ~/.claude/skills/smolvlm/scripts/view_image.py screenshot.png "Extract all text" # UI analysis python ~/.claude/skills/smolvlm/scripts/view_image.py ui.png "Describe the UI elements" # Detailed description python ~/.claude/skills/smolvlm/scripts/view_image.py photo.jpg --detailed
Effective Prompts
General Description
- •
"Describe this image"- Basic description - •
"Describe this image in detail, including colors, composition, and any text"- Comprehensive
Text Extraction (OCR)
- •
"Extract all visible text from this image" - •
"What text appears in this screenshot?" - •
"Read the text in this document"
UI/Screenshot Analysis
- •
"Describe the user interface elements" - •
"What buttons and controls are visible?" - •
"Identify the application and its current state"
Visual Question Answering
- •
"How many [objects] are in this image?" - •
"What color is the [object]?" - •
"Is there a [object] in this image?"
Code/Technical
- •
"What programming language is shown?" - •
"Describe what this code does" - •
"Identify any errors in this code screenshot"
Model Details
| Spec | Value |
|---|---|
| Model | SmolVLM-2B-Instruct |
| Size | ~4GB |
| Peak Memory | 5.8GB |
| Speed | ~94 tok/s (M-series) |
| Supported Formats | PNG, JPG, JPEG, GIF, WebP |
Requirements
- •macOS with Apple Silicon (M1/M2/M3)
- •Python 3.10+
- •mlx-vlm package:
uv pip install mlx-vlm --system
Troubleshooting
"Model not found": First run downloads the model (~4GB). Wait for completion.
Out of memory: Close other applications. Model needs ~6GB free RAM.
Slow first inference: Model loading takes 10-15s on first use, subsequent calls are faster.