DCT Profile - Data Quality Analysis
Analyze data files for value distributions, unique counts, and character frequencies.
When to Use
Use this skill when you need to:
- •Assess data quality before processing
- •Identify anomalies or outliers
- •Check for null/missing values
- •Analyze text field character distributions
- •Understand value cardinality
- •Validate data format compliance
Installation
bash
which dct || go build -o dct && chmod +x ./dct
Usage
bash
dct prof <file> [flags]
Arguments
- •
file: Data file to profile (CSV, JSON, NDJSON, or Parquet)
Flags
- •
-o, --output <file>: Output to file instead of stdout
Examples
Profile a CSV file:
bash
dct prof data.csv
Profile Parquet file:
bash
dct prof large.parquet
Save profile report:
bash
dct prof messy.csv -o data_quality_report.txt
Profile JSON data:
bash
dct prof data.json
Output Sections
The profile report includes detailed analysis for each column:
1. Count Statistics
Basic cardinality information:
code
-- Field: `email` -- Count: 1000 Unique Count: 995
2. Value Occurrences
Most common values with their frequencies:
code
Value Occurrence row: value -> count 0: user@example.com -> 1 1: admin@example.com -> 1 ... MOSTLY UNIQUE VALUES SHOWING SAMPLE...
For high-cardinality fields, shows a sample of unique values.
3. String Length Statistics
For text fields, provides length metrics:
code
Value Summary - String Lengths Min: 10 Mean: 22.500000 Max: 45
4. Character Frequency Analysis
Detailed character-level statistics:
code
Char Occurrence row: rune -> count 00: '@' (hex: U+0040) (dec: 64) -> 1000 01: '.' (hex: U+002E) (dec: 46) -> 1000 02: 'e' (hex: U+0065) (dec: 101) -> 2500
Shows:
- •Character symbol
- •Hexadecimal code (U+XXXX)
- •Decimal code
- •Total occurrences
Data Quality Indicators
Look for these patterns in the output:
Missing/Null Values
- •Low count vs expected row count
- •
<nil>values in occurrence list
Duplicates
- •Count significantly higher than unique count
- •Same value appearing multiple times
Encoding Issues
- •Unexpected characters in char occurrence
- •Non-ASCII characters (hex > U+007F)
- •Null bytes (
�)
Format Inconsistencies
- •Wide range in string lengths
- •Mixed formats in same column
- •Special characters in unexpected places
Best Practices
- •Profile first: Always profile new data sources before processing
- •Check all columns: Review each field's statistics
- •Look for outliers: Extreme min/max values may indicate errors
- •Character analysis: Check for encoding issues, especially in text fields
- •Save reports: Use
-oto save profiles for documentation
Example Workflow
bash
# 1. Profile the data dct prof incoming_data.csv -o profile.txt # 2. Review the output for issues: # - Check count matches expectations # - Look for nulls in value occurrences # - Review character frequencies for encoding issues # 3. Fix issues if found # - Handle nulls # - Fix encoding # - Remove duplicates # 4. Re-profile after fixes dct prof cleaned_data.csv
Interpreting Results
Good Data Quality Signs
- •Count matches expected row count
- •Unique counts appropriate for field type
- •Character distributions match expected language/encoding
- •String lengths within reasonable bounds
Warning Signs
- •High null counts
- •Extreme string length variations
- •Unexpected special characters
- •Count/unique count ratio indicates duplicates
Related Skills
- •
dct-peek: Quick preview before detailed profiling - •
dct-infer: Generate schema after quality check - •
dct-diff: Compare profiles of two file versions
Performance Notes
- •Profiles entire file by default
- •May be slow on very large files (>1GB)
- •Consider sampling large files with
dct peekfirst - •Character analysis can be memory-intensive on wide text columns