semantic-chunking-rules

Name: semantic-chunking-rules
Rating: 76
Author: cdeistopened

当用户希望将文字稿按时间戳分割为适合 RAG 的语义分块时使用。当您需要对长文档进行分段，以支持检索增强生成时，此技能同样适用。若需清理文字稿，请参考 transcript-polisher。

SKILL.md

--- frontmatter

name: semantic-chunking-rules
version: 1.0.0
description: "When the user wants to split transcripts into RAG-ready semantic chunks with timestamps. Also use when segmenting long documents for retrieval-augmented generation. For transcript cleanup, see transcript-polisher."

Semantic Chunking Rules

Use this skill when splitting transcripts into RAG-ready chunks.

Chunk targets

•Size: 500-1500 words (or tokens).
•Break at topic shifts and timestamp markers.
•Preserve episode metadata in each chunk.

Output schema

See references/chunk_schema.json.

Script

bash

python3 scripts/semantic_chunker.py --input <transcript.md> --output <chunks.json>

Heuristics

•Treat intro/outro and ads as separate chunks.
•Prefer clean boundaries over exact length.
•Keep timestamps on every chunk for citations.

References

•references/chunk_schema.json
•references/topic_types.md

Related Skills

•Upstream: transcript-polisher
•Feeds into: local-markdown-search, podcast-wiki-pipeline, multi-source-research