DCT Profile - Data Quality Analysis

Analyze data files for value distributions, unique counts, and character frequencies.

When to Use

Use this skill when you need to:

•Assess data quality before processing
•Identify anomalies or outliers
•Check for null/missing values
•Analyze text field character distributions
•Understand value cardinality
•Validate data format compliance

Installation

bash

which dct || go build -o dct && chmod +x ./dct

Usage

bash

dct prof <file> [flags]

Arguments

•file: Data file to profile (CSV, JSON, NDJSON, or Parquet)

Flags

•-o, --output <file>: Output to file instead of stdout

Examples

Profile a CSV file:

bash

dct prof data.csv

Profile Parquet file:

bash

dct prof large.parquet

Save profile report:

bash

dct prof messy.csv -o data_quality_report.txt

Profile JSON data:

bash

dct prof data.json

Output Sections

The profile report includes detailed analysis for each column:

1. Count Statistics

Basic cardinality information:

code

-- Field: `email` --
Count: 1000
Unique Count: 995

2. Value Occurrences

Most common values with their frequencies:

code

Value Occurrence
row: value -> count
0: user@example.com -> 1
1: admin@example.com -> 1
...
MOSTLY UNIQUE VALUES SHOWING SAMPLE...

For high-cardinality fields, shows a sample of unique values.

3. String Length Statistics

For text fields, provides length metrics:

code

Value Summary - String Lengths
Min: 10
Mean: 22.500000
Max: 45

4. Character Frequency Analysis

Detailed character-level statistics:

code

Char Occurrence
row: rune -> count
00: '@' (hex: U+0040) (dec: 64) -> 1000
01: '.' (hex: U+002E) (dec: 46) -> 1000
02: 'e' (hex: U+0065) (dec: 101) -> 2500

Shows:

•Character symbol
•Hexadecimal code (U+XXXX)
•Decimal code
•Total occurrences

Data Quality Indicators

Look for these patterns in the output:

Missing/Null Values

•Low count vs expected row count
•<nil> values in occurrence list

Duplicates

•Count significantly higher than unique count
•Same value appearing multiple times

Encoding Issues

•Unexpected characters in char occurrence
•Non-ASCII characters (hex > U+007F)
•Null bytes (�)

Format Inconsistencies

•Wide range in string lengths
•Mixed formats in same column
•Special characters in unexpected places

Best Practices

•Profile first: Always profile new data sources before processing
•Check all columns: Review each field's statistics
•Look for outliers: Extreme min/max values may indicate errors
•Character analysis: Check for encoding issues, especially in text fields
•Save reports: Use -o to save profiles for documentation

Example Workflow

bash

# 1. Profile the data
dct prof incoming_data.csv -o profile.txt

# 2. Review the output for issues:
#    - Check count matches expectations
#    - Look for nulls in value occurrences
#    - Review character frequencies for encoding issues

# 3. Fix issues if found
#    - Handle nulls
#    - Fix encoding
#    - Remove duplicates

# 4. Re-profile after fixes
dct prof cleaned_data.csv

Interpreting Results

Good Data Quality Signs

•Count matches expected row count
•Unique counts appropriate for field type
•Character distributions match expected language/encoding
•String lengths within reasonable bounds

Warning Signs

•High null counts
•Extreme string length variations
•Unexpected special characters
•Count/unique count ratio indicates duplicates

Related Skills

•dct-peek: Quick preview before detailed profiling
•dct-infer: Generate schema after quality check
•dct-diff: Compare profiles of two file versions

Performance Notes

•Profiles entire file by default
•May be slow on very large files (>1GB)
•Consider sampling large files with dct peek first
•Character analysis can be memory-intensive on wide text columns