AgentSkillsCN

count-words

统计JSONL文本字段中的单词数量。自动识别文本字段,或统计特定字段的单词数。支持多文件对比。在需要统计单词数量、分析文本长度,或比较不同文件/字段间的文本大小时使用此功能。

SKILL.md
--- frontmatter
name: count-words
description: Count words in JSONL text fields. Auto-discovers text fields or counts specific fields. Supports multiple files for comparison. Use when you need word count statistics, text length analysis, or comparing text sizes across files/fields.

JSONL Word Counter

Count and analyze word statistics in JSONL text fields across single or multiple files.

Quick Start

bash
# Auto-discover and count all text fields
python .claude/skills/count-words/scripts/counter.py data/processed/squad_humanized.jsonl

# Count specific field
python .claude/skills/count-words/scripts/counter.py file.jsonl --fields text_ai_humanized

# Compare multiple files
python .claude/skills/count-words/scripts/counter.py file1.jsonl file2.jsonl --fields text_ai_humanized

Features

1. Auto-Discovery

Automatically finds all text_* fields in your JSONL files if you don't specify fields.

bash
python counter.py squad_humanized.jsonl
# Discovers: text_human, text_ai_base, text_ai_humanized

2. Specific Field Counting

Count only the fields you care about.

bash
# Single field
python counter.py file.jsonl --fields text_ai_humanized

# Multiple fields
python counter.py file.jsonl --fields text_human text_ai_base text_ai_humanized

3. Multi-File Comparison

Compare word counts across multiple files (e.g., different humanizers).

bash
python counter.py \
  squad_humanized_gpt4.jsonl \
  squad_humanized_stealthgpt.jsonl \
  --fields text_ai_humanized

4. Comprehensive Statistics

For each field, provides:

  • Total word count across all entries
  • Total character count
  • Number of entries with the field
  • Average words per entry
  • Min/max word counts
  • Per-file breakdown (when multiple files)

Common Use Cases

Compare Humanizer Output Sizes

bash
python counter.py \
  data/processed/squad_group_h_train_ai_gpt-4o_humanized_gpt-4o.jsonl \
  data/processed/squad_group_h_train_ai_gpt-4o_humanized_stealthgpt.jsonl \
  --fields text_ai_humanized

Output shows:

  • Which humanizer generates longer text
  • Average words per entry for each
  • Total words generated by each humanizer

Analyze Pipeline Stages

bash
python counter.py squad_humanized.jsonl --fields text_human text_ai_base text_ai_humanized

See how text length changes through:

  1. Human baseline → AI baseline → Humanized

Check Single Field Across Dataset

bash
python counter.py xsum_group_h.jsonl --fields text_human

Get statistics on:

  • Total corpus size
  • Average document length
  • Length distribution

Output Format

By Field (Aggregate)

code
📈 STATISTICS BY FIELD
══════════════════════════════════════════════════════════════════════

🔹 Field: text_ai_humanized
   Total words (all files): 199,813
   Total entries with field: 599
   Average words per entry: 333.6

   Breakdown by file:
      squad_group_h_train_ai_gpt-4o_humanized_gpt-4o.jsonl
         Entries: 300
         Words: 90,952 (avg: 303.2)
         Range: 152-405 words

      squad_group_h_train_ai_gpt-4o_humanized_stealthgpt.jsonl
         Entries: 299
         Words: 108,861 (avg: 364.1)
         Range: 174-512 words

By File (Detailed)

code
📄 STATISTICS BY FILE
══════════════════════════════════════════════════════════════════════

📁 squad_humanized_gpt4.jsonl
   Total words (all fields): 90,952

   🔹 text_ai_humanized
      Entries: 300/300
      Total words: 90,952
      Average: 303.2 words/entry
      Range: 152-405 words

Grand Total

code
📊 GRAND TOTAL
══════════════════════════════════════════════════════════════════════
Total words across all files and fields: 199,813
Total entries counted: 599
Overall average: 333.6 words/entry

Examples

Example 1: Quick field count

bash
python counter.py data/processed/xsum_group_h.jsonl --fields text_human

Example 2: Compare two humanizers

bash
python counter.py \
  squad_gpt4.jsonl \
  squad_stealthgpt.jsonl \
  --fields text_ai_humanized

Example 3: Analyze all pipeline stages

bash
python counter.py squad_final.jsonl \
  --fields text_human text_ai_base text_ai_humanized

Example 4: Multiple files, multiple fields

bash
python counter.py file1.jsonl file2.jsonl file3.jsonl \
  --fields text_human text_ai_humanized

Integration with Claude Code

When working with Claude Code, simply mention:

code
Count words in data/processed/squad_humanized.jsonl

Claude will automatically use this skill to provide word count statistics.

Tips

  1. Use --fields to focus: Don't count unnecessary fields
  2. Compare humanizers: Use multiple files with same field
  3. Check text expansion: Compare text_ai_base vs text_ai_humanized
  4. Verify corpus size: Useful for dataset documentation
  5. Quality check: Unusually short/long entries may indicate issues

Related Skills

  • analyze-jsonl: For error analysis and data quality
  • read-jsonl: For reading specific entries by ID

Python API

python
from count_words.scripts.counter import WordCounter

# Single file, auto-discover fields
counter = WordCounter(['file.jsonl'])
results = counter.analyze()
print(counter.generate_report())

# Multiple files, specific field
counter = WordCounter(
    ['file1.jsonl', 'file2.jsonl'],
    fields=['text_ai_humanized']
)
results = counter.analyze()

# Access raw results
total_words = results['by_field']['text_ai_humanized'][0]['total_words']