Data Pipeline (Pretraining-First)
Use this skill when the user needs dataset creation for from-scratch LLM training. Default output is a reproducible corpus package ready for tokenizer training and pretraining.
Scope
- •Ingest raw text from web pages, documents, dumps, and domain sources.
- •Extract clean text, normalize formatting, and preserve provenance metadata.
- •Remove duplicates and near-duplicates before quality filtering.
- •Filter noisy/low-value content and remove sensitive data.
- •Balance domains to avoid over-representation in training data.
Extraction
Preferred text extraction stack:
- •
trafilaturafor robust web/article extraction. - •
resiliparsefor resilient HTML parsing and cleanup fallback.
Extraction requirements:
- •Keep source URL / dataset id for each sample.
- •Track extraction failures and retry queue separately.
- •Store raw + cleaned snapshots when feasible for auditability.
Deduplication
Use MinHash-based near-deduplication with datasketch:
- •Exact dedup first (hash of normalized text).
- •Near-dedup next (MinHash + LSH) to collapse paraphrased copies.
- •Preserve one canonical copy per duplicate cluster.
- •Report dedup ratio and retained sample counts by domain.
Quality Filtering
Apply a FineWeb-Edu-style quality filter pipeline:
- •Remove boilerplate, spam, machine-generated garbage, and template-heavy pages.
- •Penalize low-information text (keyword stuffing, repetitive n-grams, nav clutter).
- •Prefer coherent long-form educational or domain-relevant content.
- •Keep configurable quality thresholds and log reasons for removals.
PII and Safety
- •Detect and remove likely PII (emails, phone numbers, government IDs, addresses where applicable).
- •Redact or drop records containing sensitive personal information.
- •Never include secrets/tokens from scraped configs, docs, or logs.
- •Emit a PII-removal summary and residual-risk warning.
Domain Balancing
Use balancing to avoid dataset collapse toward the largest source:
- •Set target domain mix (for example: general, code, scientific, medical, legal).
- •Downsample dominant domains; upsample scarce high-quality domains conservatively.
- •Produce final domain histogram and token share per domain.
Deliverables
Produce these artifacts:
- •Cleaned corpus shards (jsonl/parquet/txt).
- •
dataset_manifest.jsonwith source counts, dedup metrics, and filter stats. - •
data_pipeline_report.mdwith quality, PII, and domain-balance summaries. - •Reproducible commands/configs used to generate the dataset.
Python Dependencies
Core dependencies for this skill:
- •
trafilatura - •
resiliparse - •
datasketch - •
pandas - •
pyarrow - •
regex - •
ftfy - •
langdetect