Video to Markdown Conversion

Convert video files to structured Markdown documents with timestamps using speech recognition.

When to Use

•User wants to transcribe a video file
•User needs to convert video content to text documentation
•User wants to create meeting notes from recorded video
•User asks to extract speech/dialogue from video

Workflow

code

Video (mp4/mkv/webm/avi)
    ↓ [ffmpeg - audio extraction]
Audio (16kHz mono WAV)
    ↓ [Whisper - speech recognition]
Transcription with timestamps
    ↓ [formatting]
Markdown document

Prerequisites

Tip: Environment Setup

•uv: See uv installation for quickstart

•ffmpeg - For audio extraction
•Python 3.12+ with uv package manager
•CUDA GPU (recommended) - For faster transcription
•openai-whisper package

Step-by-Step Instructions

1. Check Environment

bash

# Verify ffmpeg
ffmpeg -version

# Verify GPU (if available)
nvidia-smi --query-gpu=name,memory.total --format=csv

2. Setup Project

bash

# Initialize uv project
uv init --python 3.12

# Install whisper with CUDA support
# Configure pyproject.toml with pytorch-cu126 index
uv add openai-whisper torch

See pyproject.toml template for CUDA configuration.

3. Run Conversion

bash

uv run python main.py "video.mp4" -l zh -m large-v3

4. Output Format

The generated Markdown includes:

•Document header with metadata (generation time, duration, language)
•Transcribed content with timestamps [HH:MM:SS → HH:MM:SS]
•Timeline table appendix

Model Selection Guide

Model	VRAM	Speed	Accuracy	Recommended For
tiny	~1GB	★★★★★	★	Quick previews
base	~1GB	★★★★	★★	Draft transcripts
small	~2GB	★★★	★★★	General use
medium	~5GB	★★	★★★★	Good quality
large-v3	~10GB	★	★★★★★	Best accuracy

Language Codes

Common codes for -l parameter:

•zh - Chinese
•en - English
•ja - Japanese
•ko - Korean
•auto - Auto-detect (default)

Troubleshooting

CUDA Not Available

If PyTorch shows CUDA available: False:

•Check CUDA installation: echo $env:CUDA_PATH
•Reinstall torch with CUDA index in pyproject.toml
•Delete uv.lock and run uv sync

Triton Warning on Windows

code

UserWarning: Failed to launch Triton kernels...

This is expected on Windows. Triton only supports Linux. The warning does not affect transcription quality.

Model Download Fails

If SHA256 checksum fails:

•Delete corrupted model: ~/.cache/whisper/<model>.pt
•Retry with stable network connection
•Consider using smaller model first

Example Output

See example output for sample Markdown structure.

CLI Reference

code

usage: main.py [-h] [-o OUTPUT] [-m MODEL] [-l LANGUAGE] video

positional arguments:
  video                 Input video file path

options:
  -o, --output          Output Markdown file path (default: same as video)
  -m, --model           Whisper model (tiny/base/small/medium/large-v3)
  -l, --language        Language code (zh/en/ja/ko or auto)