JSONL Word Counter
Count and analyze word statistics in JSONL text fields across single or multiple files.
Quick Start
bash
# Auto-discover and count all text fields python .claude/skills/count-words/scripts/counter.py data/processed/squad_humanized.jsonl # Count specific field python .claude/skills/count-words/scripts/counter.py file.jsonl --fields text_ai_humanized # Compare multiple files python .claude/skills/count-words/scripts/counter.py file1.jsonl file2.jsonl --fields text_ai_humanized
Features
1. Auto-Discovery
Automatically finds all text_* fields in your JSONL files if you don't specify fields.
bash
python counter.py squad_humanized.jsonl # Discovers: text_human, text_ai_base, text_ai_humanized
2. Specific Field Counting
Count only the fields you care about.
bash
# Single field python counter.py file.jsonl --fields text_ai_humanized # Multiple fields python counter.py file.jsonl --fields text_human text_ai_base text_ai_humanized
3. Multi-File Comparison
Compare word counts across multiple files (e.g., different humanizers).
bash
python counter.py \ squad_humanized_gpt4.jsonl \ squad_humanized_stealthgpt.jsonl \ --fields text_ai_humanized
4. Comprehensive Statistics
For each field, provides:
- •Total word count across all entries
- •Total character count
- •Number of entries with the field
- •Average words per entry
- •Min/max word counts
- •Per-file breakdown (when multiple files)
Common Use Cases
Compare Humanizer Output Sizes
bash
python counter.py \ data/processed/squad_group_h_train_ai_gpt-4o_humanized_gpt-4o.jsonl \ data/processed/squad_group_h_train_ai_gpt-4o_humanized_stealthgpt.jsonl \ --fields text_ai_humanized
Output shows:
- •Which humanizer generates longer text
- •Average words per entry for each
- •Total words generated by each humanizer
Analyze Pipeline Stages
bash
python counter.py squad_humanized.jsonl --fields text_human text_ai_base text_ai_humanized
See how text length changes through:
- •Human baseline → AI baseline → Humanized
Check Single Field Across Dataset
bash
python counter.py xsum_group_h.jsonl --fields text_human
Get statistics on:
- •Total corpus size
- •Average document length
- •Length distribution
Output Format
By Field (Aggregate)
code
📈 STATISTICS BY FIELD
══════════════════════════════════════════════════════════════════════
🔹 Field: text_ai_humanized
Total words (all files): 199,813
Total entries with field: 599
Average words per entry: 333.6
Breakdown by file:
squad_group_h_train_ai_gpt-4o_humanized_gpt-4o.jsonl
Entries: 300
Words: 90,952 (avg: 303.2)
Range: 152-405 words
squad_group_h_train_ai_gpt-4o_humanized_stealthgpt.jsonl
Entries: 299
Words: 108,861 (avg: 364.1)
Range: 174-512 words
By File (Detailed)
code
📄 STATISTICS BY FILE
══════════════════════════════════════════════════════════════════════
📁 squad_humanized_gpt4.jsonl
Total words (all fields): 90,952
🔹 text_ai_humanized
Entries: 300/300
Total words: 90,952
Average: 303.2 words/entry
Range: 152-405 words
Grand Total
code
📊 GRAND TOTAL ══════════════════════════════════════════════════════════════════════ Total words across all files and fields: 199,813 Total entries counted: 599 Overall average: 333.6 words/entry
Examples
Example 1: Quick field count
bash
python counter.py data/processed/xsum_group_h.jsonl --fields text_human
Example 2: Compare two humanizers
bash
python counter.py \ squad_gpt4.jsonl \ squad_stealthgpt.jsonl \ --fields text_ai_humanized
Example 3: Analyze all pipeline stages
bash
python counter.py squad_final.jsonl \ --fields text_human text_ai_base text_ai_humanized
Example 4: Multiple files, multiple fields
bash
python counter.py file1.jsonl file2.jsonl file3.jsonl \ --fields text_human text_ai_humanized
Integration with Claude Code
When working with Claude Code, simply mention:
code
Count words in data/processed/squad_humanized.jsonl
Claude will automatically use this skill to provide word count statistics.
Tips
- •Use
--fieldsto focus: Don't count unnecessary fields - •Compare humanizers: Use multiple files with same field
- •Check text expansion: Compare
text_ai_basevstext_ai_humanized - •Verify corpus size: Useful for dataset documentation
- •Quality check: Unusually short/long entries may indicate issues
Related Skills
- •analyze-jsonl: For error analysis and data quality
- •read-jsonl: For reading specific entries by ID
Python API
python
from count_words.scripts.counter import WordCounter
# Single file, auto-discover fields
counter = WordCounter(['file.jsonl'])
results = counter.analyze()
print(counter.generate_report())
# Multiple files, specific field
counter = WordCounter(
['file1.jsonl', 'file2.jsonl'],
fields=['text_ai_humanized']
)
results = counter.analyze()
# Access raw results
total_words = results['by_field']['text_ai_humanized'][0]['total_words']