AgentSkillsCN

consolidate-transcripts

将频道的转录合并到一个文件中,按日期排序(最新优先),最多 80 万个标记。在准备 LLM 上下文或批量分析转录时使用。

SKILL.md
--- frontmatter
name: consolidate-transcripts
description: Consolidate transcripts from a channel into a single file, sorted by date (newest first), up to 800K tokens. Use when preparing transcripts for LLM context or bulk analysis.

Consolidate Transcripts

Why? LLMs have context limits. This skill merges multiple transcripts into a single file with accurate token counting, so you can feed an entire channel's content to Claude or GPT without exceeding limits.

Quick Start

bash
python scripts/consolidate_transcripts.py <channel_name>

Output: ~/Documents/YTScriber/<channel_name>/<channel_name>-consolidated.md

[!NOTE] This feature is currently a standalone script. A ytscriber consolidate CLI command is planned for a future release.


Workflow

1. Identify the Channel

List available channels:

bash
ls ~/Documents/YTScriber/

2. Choose Token Limit

Use CaseRecommended LimitFlag
Claude (200K context)150000--limit 150000
GPT-4 Turbo (128K)100000--limit 100000
Full archive (Claude Pro)800000(default)
Quick sample50000--limit 50000

[!TIP] The default 800K limit leaves ~200K tokens for prompts and responses when using Claude's 1M context.

3. Run Consolidation

bash
python scripts/consolidate_transcripts.py <channel_name> [--limit TOKENS] [--verbose]

Examples:

bash
# Default (800K tokens)
python scripts/consolidate_transcripts.py library-of-minds

# Custom limit for GPT-4
python scripts/consolidate_transcripts.py aws-reinvent-2025 --limit 100000

# Verbose output showing all included files
python scripts/consolidate_transcripts.py dwarkesh-patel --verbose

4. Verify Output

Check the consolidated file was created:

bash
ls -la ~/Documents/YTScriber/<channel_name>/*-consolidated.md

Parameters

OptionDescriptionDefault
channel_nameFolder name in data directoryRequired
--limit, -lMaximum tokens to include800000
--verbose, -vShow detailed file listFalse

Output Format

The consolidated file includes:

  1. Header — Generation metadata, total transcripts, token/word counts
  2. Table of Contents — Dates, titles, tokens, words per transcript
  3. Transcripts — Full text with title, date, author, source URL

Troubleshooting

ProblemCauseSolution
ModuleNotFoundError: tiktokentiktoken not installedpip install tiktoken
No transcripts foundEmpty transcripts folderRun ytscriber download first
FileNotFoundErrorChannel doesn't existCheck ls ~/Documents/YTScriber/ for valid names
Output file is smallFew transcripts availableUse --verbose to see what was included
Token count seems wrongOld tiktoken versionpip install --upgrade tiktoken

Common Mistakes

  1. Wrong channel name — Use the folder name exactly as shown in ls ~/Documents/YTScriber/, not the YouTube channel name.
  2. Forgetting to download transcripts first — Consolidation requires transcripts to exist. Run ytscriber download first.
  3. Using too high a limit — If you exceed your LLM's context, you'll get truncation errors. Use the limit guide above.
  4. Expecting real-time updates — Re-run consolidation after downloading new transcripts.

Reference

  • Transcripts sorted newest first (descending by date)
  • Files without dates in filename are placed last
  • Token counting uses cl100k_base encoding (GPT-4/Claude compatible)
  • Consolidated files are gitignored (not committed)
  • Re-running overwrites the previous consolidated file