AgentSkillsCN

dct-profile

当用户希望分析数据质量、对数据文件进行概要分析、检查数值分布、对文本字段进行字符分析、识别数据质量问题,或获取关于数据集内容的统计数据时,此技能将为您提供高效便捷的解决方案。触发条件包括“对这些数据进行概要分析”、“分析数据质量”、“检查空值”、“数值分布”、“字符频率”、“数据统计”、“列概要分析”,或当您进行探索性数据分析或质量评估时。

SKILL.md
--- frontmatter
name: dct-profile
description: Use this skill when the user wants to analyze data quality, profile data files, check value distributions, perform character analysis on text fields, identify data quality issues, or get statistics about dataset contents. Triggers include "profile this data", "analyze data quality", "check for nulls", "value distribution", "character frequency", "data statistics", "column profiling", or when doing exploratory data analysis or quality assessment.

DCT Profile - Data Quality Analysis

Analyze data files for value distributions, unique counts, and character frequencies.

When to Use

Use this skill when you need to:

  • Assess data quality before processing
  • Identify anomalies or outliers
  • Check for null/missing values
  • Analyze text field character distributions
  • Understand value cardinality
  • Validate data format compliance

Installation

bash
which dct || go build -o dct && chmod +x ./dct

Usage

bash
dct prof <file> [flags]

Arguments

  • file: Data file to profile (CSV, JSON, NDJSON, or Parquet)

Flags

  • -o, --output <file>: Output to file instead of stdout

Examples

Profile a CSV file:

bash
dct prof data.csv

Profile Parquet file:

bash
dct prof large.parquet

Save profile report:

bash
dct prof messy.csv -o data_quality_report.txt

Profile JSON data:

bash
dct prof data.json

Output Sections

The profile report includes detailed analysis for each column:

1. Count Statistics

Basic cardinality information:

code
-- Field: `email` --
Count: 1000
Unique Count: 995

2. Value Occurrences

Most common values with their frequencies:

code
Value Occurrence
row: value -> count
0: user@example.com -> 1
1: admin@example.com -> 1
...
MOSTLY UNIQUE VALUES SHOWING SAMPLE...

For high-cardinality fields, shows a sample of unique values.

3. String Length Statistics

For text fields, provides length metrics:

code
Value Summary - String Lengths
Min: 10
Mean: 22.500000
Max: 45

4. Character Frequency Analysis

Detailed character-level statistics:

code
Char Occurrence
row: rune -> count
00: '@' (hex: U+0040) (dec: 64) -> 1000
01: '.' (hex: U+002E) (dec: 46) -> 1000
02: 'e' (hex: U+0065) (dec: 101) -> 2500

Shows:

  • Character symbol
  • Hexadecimal code (U+XXXX)
  • Decimal code
  • Total occurrences

Data Quality Indicators

Look for these patterns in the output:

Missing/Null Values

  • Low count vs expected row count
  • <nil> values in occurrence list

Duplicates

  • Count significantly higher than unique count
  • Same value appearing multiple times

Encoding Issues

  • Unexpected characters in char occurrence
  • Non-ASCII characters (hex > U+007F)
  • Null bytes ()

Format Inconsistencies

  • Wide range in string lengths
  • Mixed formats in same column
  • Special characters in unexpected places

Best Practices

  • Profile first: Always profile new data sources before processing
  • Check all columns: Review each field's statistics
  • Look for outliers: Extreme min/max values may indicate errors
  • Character analysis: Check for encoding issues, especially in text fields
  • Save reports: Use -o to save profiles for documentation

Example Workflow

bash
# 1. Profile the data
dct prof incoming_data.csv -o profile.txt

# 2. Review the output for issues:
#    - Check count matches expectations
#    - Look for nulls in value occurrences
#    - Review character frequencies for encoding issues

# 3. Fix issues if found
#    - Handle nulls
#    - Fix encoding
#    - Remove duplicates

# 4. Re-profile after fixes
dct prof cleaned_data.csv

Interpreting Results

Good Data Quality Signs

  • Count matches expected row count
  • Unique counts appropriate for field type
  • Character distributions match expected language/encoding
  • String lengths within reasonable bounds

Warning Signs

  • High null counts
  • Extreme string length variations
  • Unexpected special characters
  • Count/unique count ratio indicates duplicates

Related Skills

  • dct-peek: Quick preview before detailed profiling
  • dct-infer: Generate schema after quality check
  • dct-diff: Compare profiles of two file versions

Performance Notes

  • Profiles entire file by default
  • May be slow on very large files (>1GB)
  • Consider sampling large files with dct peek first
  • Character analysis can be memory-intensive on wide text columns