AgentSkillsCN

chunking

当用户需要将文档拆分为多个片段,以用于 RAG 或搜索时,可启用此功能。触发条件包括:分块、分割、设置分块大小、文本拆分器、Token 限制、重叠设置等。

SKILL.md
--- frontmatter
name: chunking
description: "Use when user needs to split documents into chunks for RAG or search. Triggers on: chunking, split, chunk size, text splitter, token limit, overlap."

Chunking - Document Chunking

Split long documents into smaller chunks suitable for vectorization and retrieval.

Chunking Strategy Selection

Scenariochunk_sizeoverlapNotes
Precise Q&A256-51250More precise matching
Summarization1024-2048100More complete context
Code documentationBy function/class0Keep code complete

Code Examples

Basic Chunking

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " ", ""]
)

text = "Your long document content..."
chunks = splitter.split_text(text)

Token-based Chunking

python
from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

chunks = splitter.split_text(text)

Preserving Metadata

python
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50
)

# Document with metadata
doc = Document(
    page_content="Document content...",
    metadata={"source": "doc.pdf", "page": 1}
)

chunks = splitter.split_documents([doc])
# Each chunk retains original metadata

Markdown Chunking

python
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown_text)

Code Chunking

python
from langchain.text_splitter import (
    Language,
    RecursiveCharacterTextSplitter
)

splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=100
)

chunks = splitter.split_text(code)

Chunk Quality Check

python
def analyze_chunks(chunks):
    sizes = [len(c) for c in chunks]
    print(f"Total chunks: {len(chunks)}")
    print(f"Average size: {sum(sizes)/len(sizes):.0f}")
    print(f"Min: {min(sizes)}, Max: {max(sizes)}")

analyze_chunks(chunks)

Next Steps

After chunking:

  • Vectorization: Use core:embedding
  • Storage: Use core:indexing