JSONL Word Counter

Count and analyze word statistics in JSONL text fields across single or multiple files.

Quick Start

bash

# Auto-discover and count all text fields
python .claude/skills/count-words/scripts/counter.py data/processed/squad_humanized.jsonl

# Count specific field
python .claude/skills/count-words/scripts/counter.py file.jsonl --fields text_ai_humanized

# Compare multiple files
python .claude/skills/count-words/scripts/counter.py file1.jsonl file2.jsonl --fields text_ai_humanized

Features

1. Auto-Discovery

Automatically finds all text_* fields in your JSONL files if you don't specify fields.

bash

python counter.py squad_humanized.jsonl
# Discovers: text_human, text_ai_base, text_ai_humanized

2. Specific Field Counting

Count only the fields you care about.

bash

# Single field
python counter.py file.jsonl --fields text_ai_humanized

# Multiple fields
python counter.py file.jsonl --fields text_human text_ai_base text_ai_humanized

3. Multi-File Comparison

Compare word counts across multiple files (e.g., different humanizers).

bash

python counter.py \
  squad_humanized_gpt4.jsonl \
  squad_humanized_stealthgpt.jsonl \
  --fields text_ai_humanized

4. Comprehensive Statistics

For each field, provides:

•Total word count across all entries
•Total character count
•Number of entries with the field
•Average words per entry
•Min/max word counts
•Per-file breakdown (when multiple files)

Common Use Cases

Compare Humanizer Output Sizes

bash

python counter.py \
  data/processed/squad_group_h_train_ai_gpt-4o_humanized_gpt-4o.jsonl \
  data/processed/squad_group_h_train_ai_gpt-4o_humanized_stealthgpt.jsonl \
  --fields text_ai_humanized

Output shows:

•Which humanizer generates longer text
•Average words per entry for each
•Total words generated by each humanizer

Analyze Pipeline Stages

bash

python counter.py squad_humanized.jsonl --fields text_human text_ai_base text_ai_humanized

See how text length changes through:

•Human baseline → AI baseline → Humanized

Check Single Field Across Dataset

bash

python counter.py xsum_group_h.jsonl --fields text_human

Get statistics on:

•Total corpus size
•Average document length
•Length distribution

Output Format

By Field (Aggregate)

code

📈 STATISTICS BY FIELD
══════════════════════════════════════════════════════════════════════

🔹 Field: text_ai_humanized
   Total words (all files): 199,813
   Total entries with field: 599
   Average words per entry: 333.6

   Breakdown by file:
      squad_group_h_train_ai_gpt-4o_humanized_gpt-4o.jsonl
         Entries: 300
         Words: 90,952 (avg: 303.2)
         Range: 152-405 words

      squad_group_h_train_ai_gpt-4o_humanized_stealthgpt.jsonl
         Entries: 299
         Words: 108,861 (avg: 364.1)
         Range: 174-512 words

By File (Detailed)

code

📄 STATISTICS BY FILE
══════════════════════════════════════════════════════════════════════

📁 squad_humanized_gpt4.jsonl
   Total words (all fields): 90,952

   🔹 text_ai_humanized
      Entries: 300/300
      Total words: 90,952
      Average: 303.2 words/entry
      Range: 152-405 words

Grand Total

code

📊 GRAND TOTAL
══════════════════════════════════════════════════════════════════════
Total words across all files and fields: 199,813
Total entries counted: 599
Overall average: 333.6 words/entry

Examples

Example 1: Quick field count

bash

python counter.py data/processed/xsum_group_h.jsonl --fields text_human

Example 2: Compare two humanizers

bash

python counter.py \
  squad_gpt4.jsonl \
  squad_stealthgpt.jsonl \
  --fields text_ai_humanized

Example 3: Analyze all pipeline stages

bash

python counter.py squad_final.jsonl \
  --fields text_human text_ai_base text_ai_humanized

Example 4: Multiple files, multiple fields

bash

python counter.py file1.jsonl file2.jsonl file3.jsonl \
  --fields text_human text_ai_humanized

Integration with Claude Code

When working with Claude Code, simply mention:

code

Count words in data/processed/squad_humanized.jsonl

Claude will automatically use this skill to provide word count statistics.

Tips

•Use --fields to focus: Don't count unnecessary fields
•Compare humanizers: Use multiple files with same field
•Check text expansion: Compare text_ai_base vs text_ai_humanized
•Verify corpus size: Useful for dataset documentation
•Quality check: Unusually short/long entries may indicate issues

Related Skills

•analyze-jsonl: For error analysis and data quality
•read-jsonl: For reading specific entries by ID

Python API

python

from count_words.scripts.counter import WordCounter

# Single file, auto-discover fields
counter = WordCounter(['file.jsonl'])
results = counter.analyze()
print(counter.generate_report())

# Multiple files, specific field
counter = WordCounter(
    ['file1.jsonl', 'file2.jsonl'],
    fields=['text_ai_humanized']
)
results = counter.analyze()

# Access raw results
total_words = results['by_field']['text_ai_humanized'][0]['total_words']