Autocut Shorts
This is the main orchestration skill that combines all other skills to automatically create short-form content from long videos.
What It Does
This skill automates the entire workflow:
- •Download video from YouTube URL (if provided)
- •Transcribe audio using Whisper or Gemini API
- •Perform speaker diarization (pyannote or Gemini) - identifies who speaks when
- •Detect highlights using combined analysis:
- •Transcript analysis (hooks, viral phrases)
- •Speaker dynamics (debates, interactions, overlapping speech)
- •Laughter detection (humorous moments)
- •Sentiment analysis (emotional peaks)
- •Scene detection (cut points)
- •Select best segments (15-60 seconds each)
- •Trim video to highlight segments
- •Resize to 9:16 portrait format (1080x1920)
- •Add burned-in subtitles with speaker labels
- •Export multiple clips ready for upload
When to Use
- •User wants to create TikTok clips from a YouTube video
- •Converting podcasts to short-form content
- •Finding viral moments in vlogs or tutorials
- •Repurposing gaming content for Shorts/Reels
- •Batch processing multiple videos
Available Scripts
scripts/autocut.py
Main autocut workflow script.
Usage:
python skills/autocut-shorts/scripts/autocut.py <video_or_url> [options]
Options:
- •
--source: Source type (file, youtube) - auto-detected - •
--num-clips: Number of clips to generate (default: 5) - •
--min-duration: Minimum clip duration in seconds (default: 15) - •
--max-duration: Maximum clip duration in seconds (default: 60) - •
--platform: Target platform (tiktok, shorts, reels, facebook) - default: tiktok - •
--output-dir: Output directory (default:./shorts/) - •
--transcription-model: Transcription model (auto, whisper, gemini) - default: auto - •
--diarization-model: Speaker diarization (auto, pyannote, gemini, none) - default: auto - •
--huggingface-token: HuggingFace token for pyannote (or use env var) - •
--focus-speaker: Extract clips only for specific speaker (SPEAKER_00, etc.) - •
--gemini-api-key: Gemini API key (or use env var) - •
--skip-transcribe: Skip transcription if already have transcript - •
--skip-diarization: Skip speaker diarization - •
--skip-scenes: Skip scene detection - •
--skip-laughter: Skip laughter detection - •
--skip-sentiment: Skip sentiment analysis - •
--transcript-path: Use existing transcript file - •
--style: Subtitle style (tiktok, shorts, reels) - default: tiktok
Examples:
Basic autocut from file:
python skills/autocut-shorts/scripts/autocut.py video.mp4
Autocut from YouTube URL:
python skills/autocut-shorts/scripts/autocut.py "https://www.youtube.com/watch?v=VIDEO_ID"
Generate 10 clips for Instagram Reels:
python skills/autocut-shorts/scripts/autocut.py video.mp4 --num-clips 10 --platform reels --style reels
Use Gemini for transcription:
python skills/autocut-shorts/scripts/autocut.py video.mp4 --transcription-model gemini
Custom duration range:
python skills/autocut-shorts/scripts/autocut.py video.mp4 --min-duration 20 --max-duration 45
Use existing transcript:
python skills/autocut-shorts/scripts/autocut.py video.mp4 --transcript-path video.srt --skip-transcribe
scripts/quick_cut.py
Quick cut without full analysis (faster).
Usage:
python skills/autocut-shorts/scripts/quick_cut.py <video_path> [options]
Options:
- •
--timestamps: JSON file with timestamps to cut - •
--output-dir: Output directory - •
--platform: Target platform
Example:
python skills/autocut-shorts/scripts/quick_cut.py video.mp4 --timestamps cuts.json
Workflow Steps
Step 1: Download (Optional)
If URL provided:
- •Downloads from YouTube using yt-dlp
- •Best quality MP4
- •Saves to temp directory
Step 2: Transcribe
Extracts audio and transcribes:
- •Auto mode: Chooses based on requirements
- •Whisper: Local processing, good for privacy
- •Gemini: Cloud processing, better quality + features
Step 3: Detect Highlights
Runs detection modules:
- •Transcript analysis: Viral phrases, hooks, questions
- •Laughter detection: Funny moments (if enabled)
- •Sentiment analysis: Emotional peaks (if enabled)
- •Scene detection: Visual cut points (if enabled)
Step 4: Score and Rank
Combines all signals:
Virality Score = 35% Transcript (hooks, viral content) + 25% Laughter (humor) + 25% Sentiment (emotion) + 15% Scenes (visual transitions)
Ranks all segments and selects top N.
Step 5: Trim
For each highlight:
- •Extends 2-3 seconds before/after for context
- •Trims using FFmpeg (stream copy for speed)
- •Validates duration constraints
Step 6: Resize to Portrait
Converts to 9:16:
- •Smart crop (focus on subjects)
- •1080x1920 resolution
- •Maintains quality
Step 7: Add Subtitles
Burns in captions:
- •Platform-specific styling
- •White text with black outline
- •Bottom position
- •Readable size (24-28px)
Step 8: Export
Saves final clips:
- •Named:
{original}_short_{index}.mp4 - •Organized in output directory
- •JSON report with metadata
Output Format
Directory Structure
shorts/ video_short_001.mp4 video_short_002.mp4 video_short_003.mp4 report.json
JSON Report
{
"success": true,
"source": {
"type": "youtube",
"url": "https://youtube.com/watch?v=...",
"title": "Video Title",
"duration": 1200.5
},
"processing": {
"transcription_model": "gemini-flash-lite-latest",
"detection_methods": ["transcript", "laughter", "sentiment", "scenes"],
"platform": "tiktok"
},
"results": {
"total_clips": 5,
"clips": [
{
"rank": 1,
"filename": "video_short_001.mp4",
"start_time": 45.2,
"end_time": 72.5,
"duration": 27.3,
"virality_score": 0.92,
"text": "This is the key moment...",
"output_path": "shorts/video_short_001.mp4"
}
],
"total_duration": 135.5,
"avg_virality_score": 0.78
},
"performance": {
"total_time": 180.5,
"transcription_time": 45.2,
"analysis_time": 67.3,
"processing_time": 68.0
}
}
Platform Presets
TikTok
- •Resolution: 1080x1920
- •Duration: 15-60 seconds
- •Subtitle style: TikTok
- •Output naming:
_tiktok_{index}.mp4
YouTube Shorts
- •Resolution: 1080x1920
- •Duration: 15-60 seconds
- •Subtitle style: Shorts
- •Output naming:
_shorts_{index}.mp4
Instagram Reels
- •Resolution: 1080x1920
- •Duration: 15-90 seconds
- •Subtitle style: Reels
- •Output naming:
_reels_{index}.mp4
Facebook Reels
- •Resolution: 1080x1920
- •Duration: 15-90 seconds
- •Subtitle style: Default
- •Output naming:
_facebook_{index}.mp4
Viral Detection Algorithm
High-Value Signals
Transcript (35% weight):
- •Viral phrases ("you won't believe", "this changes everything")
- •Hooks ("let me tell you", "here's the secret")
- •Questions and answers
- •Story beats
Laughter (25% weight):
- •Explicit laughter markers
- •High-confidence laughter detection
- •Audience reactions
Sentiment (25% weight):
- •Positive emotions (excitement, joy)
- •Surprise moments
- •Negative emotions (controversy, drama)
- •Emotional intensity > 0.7
Scenes (15% weight):
- •Scene transitions
- •Visual changes
- •Topic shifts
Scoring
virality_score = (
transcript_score * 0.35 +
laughter_score * 0.25 +
sentiment_score * 0.25 +
scene_score * 0.15
)
Premium Clips (0.8-1.0): Must include Excellent Clips (0.6-0.8): High priority Good Clips (0.4-0.6): Consider including
AI Agent Decision Guide
When to Use pyannote vs Gemini Diarization
Default Behavior (--diarization-model auto): The AI agent automatically selects based on context:
# Use pyannote when:
if "podcast" in user_request or "interview" in user_request:
return "pyannote" # Multi-speaker, needs accuracy
if "accurate" in user_request or "precise" in user_request:
return "pyannote" # User explicitly wants accuracy
if "panel" in user_request or "debate" in user_request:
return "pyannote" # Complex multi-speaker scenarios
if "overlapping" in user_request or "talk over" in user_request:
return "pyannote" # Overlapping speech detection
if "privacy" in user_request or "offline" in user_request:
return "pyannote" # Local processing needed
# Use Gemini when:
if "quick" in user_request or "fast" in user_request:
return "gemini" # Speed priority
if "single speaker" in user_request or "monologue" in user_request:
return "gemini" # Simple scenario
if "no diarization" in user_request or "skip speakers" in user_request:
return "none" # User doesn't want speaker detection
# Default for ambiguous cases:
return "pyannote" if likely_multi_speaker(video) else "gemini"
Decision Matrix:
| Scenario | Recommended | Reason |
|---|---|---|
| Podcast with 2-3 hosts | pyannote | High accuracy for multi-speaker |
| Interview (host + guest) | pyannote | Precise speaker separation |
| Panel discussion | pyannote | Handles 4+ speakers well |
| Single speaker vlog | gemini | Faster, good enough |
| Gaming commentary | gemini | Usually 1-2 speakers |
| Tutorial video | gemini | Single speaker, speed matters |
| Debate/competitive | pyannote | Overlapping speech detection |
| Privacy-sensitive | pyannote | Local processing |
Examples by Use Case:
# Podcast - use pyannote automatically python skills/autocut-shorts/scripts/autocut.py podcast.mp4 # Interview - use pyannote for accuracy python skills/autocut-shorts/scripts/autocut.py interview.mp4 # Vlog - use gemini (single speaker, faster) python skills/autocut-shorts/scripts/autocut.py vlog.mp4 # Force pyannote explicitly python skills/autocut-shorts/scripts/autocut.py video.mp4 --diarization-model pyannote # Skip diarization for simple content python skills/autocut-shorts/scripts/autocut.py tutorial.mp4 --diarization-model none # Extract only host's segments python skills/autocut-shorts/scripts/autocut.py podcast.mp4 --focus-speaker SPEAKER_00
Smart Defaults
The agent automatically detects:
- •Content type (podcast, vlog, tutorial, gaming, etc.)
- •Likely speaker count based on audio patterns
- •User priority (speed vs accuracy vs privacy)
- •Available resources (GPU, internet, API keys)
Override any time:
Users can always override with --diarization-model flag.
Integration
This skill uses all other skills:
- •
youtube-downloader: Download from URL - •
video-transcriber: Transcribe audio - •
scene-detector: Find visual cut points - •
laughter-detector: Find funny moments - •
sentiment-analyzer: Find emotional peaks - •
highlight-scanner: Combine all signals - •
video-trimmer: Cut segments - •
portrait-resizer: Convert to 9:16 - •
subtitle-overlay: Add captions
Common Use Cases
Podcast to Shorts
python skills/autocut-shorts/scripts/autocut.py podcast.mp4 --num-clips 10 --platform shorts
Vlog Highlights
python skills/autocut-shorts/scripts/autocut.py vlog.mp4 --num-clips 5 --platform tiktok
YouTube to TikTok
python skills/autocut-shorts/scripts/autocut.py "https://youtube.com/watch?v=..." --platform tiktok
Tutorial Clips
python skills/autocut-shorts/scripts/autocut.py tutorial.mp4 --min-duration 30 --max-duration 60
Performance
Processing Time (approximate):
- •1-minute video: ~30-60 seconds
- •10-minute video: ~3-5 minutes
- •30-minute video: ~8-12 minutes
- •1-hour video: ~15-25 minutes
Breakdown:
- •Download: 5-30 seconds (depends on video)
- •Transcription: 20-60 seconds
- •Detection: 10-30 seconds per method
- •Trimming: 1-5 seconds per clip
- •Resizing: 5-10 seconds per clip
- •Subtitles: 5-10 seconds per clip
Error Handling
- •Download failure: Retries up to 3 times
- •Transcription failure: Falls back to alternative model
- •No highlights found: Returns error with suggestions
- •Processing failure: Reports which step failed
- •Partial success: Reports successful clips vs failed
Tips
- •Use Gemini transcription for best highlight detection
- •Provide more clips requested than needed (filter by score)
- •15-30 second clips perform best on TikTok
- •30-60 second clips work well for Shorts/Reels
- •Keep 2-3 second buffer around highlights
- •Test different platforms for best engagement
- •Use transcript-only mode for faster processing
- •Batch process multiple videos for efficiency
References
- •OpusClip: https://www.opus.pro/
- •Vizard.ai: https://vizard.ai/
- •TikTok specs: https://www.tiktok.com/business/en-US/solutions/tiktok-specs
- •YouTube Shorts specs: https://support.google.com/youtube/answer/10059066
- •Instagram Reels specs: https://help.instagram.com/609412256345459