AgentSkillsCN

Data Quality

数据质量

SKILL.md

Data Quality Assessment with qsv

Quality Dimensions

1. Completeness

Question: Are there missing values?

CheckCommandWhat to Look For
Null countsstats --cardinality --stats-jsonlnullcount column > 0
Empty stringsfrequency --limit 10Empty string in top values
Sparsitystatssparsity field (ratio of nulls)

Red flag: Sparsity > 0.5 means more than half the values are null.

2. Uniqueness

Question: Are there unwanted duplicates?

CheckCommandWhat to Look For
Duplicate rowsdedup --dupes-output dupes.csvNon-empty dupes file
Cardinalitystats --cardinalitycardinality vs row count
Unique ratiostatsIf cardinality = row count, column is unique

Red flag: Key columns (ID, email) with cardinality < row count.

3. Validity

Question: Do values match expected formats?

CheckCommandWhat to Look For
Schema validationvalidate schema.jsonValidation error count
Data typesstatstype column (String, Integer, Float, Date, etc.)
Format patternssearch --flagRows not matching expected regex
Value rangesstatsmin, max outside expected range

Red flag: Type column shows "String" for what should be numeric data.

4. Consistency

Question: Are formats consistent across the dataset?

CheckCommandWhat to Look For
Date formatsstatsMixed date types in same column
Case consistencyfrequency"NYC" vs "nyc" vs "Nyc" as separate values
EncodingsniffNon-UTF-8 encoding detected
DelimiterssniffUnexpected delimiter or quoting
Row lengthsfixlengths --countRows with wrong number of fields

Red flag: Frequency shows same value in different cases/formats.

5. Accuracy

Question: Are values plausible?

CheckCommandWhat to Look For
Statistical outliersstatsmean, stddev - values > 3 stddev from mean
Value distributionsfrequency --limit 20Unexpected dominant values
Range checksstatsmin/max outside plausible range
Cross-field checkssqlpSQL WHERE clauses for business rules

Red flag: Latitude > 90 or < -90, negative ages, future birth dates.

Quality Assessment Workflow

code
1. sniff           -> Detect format, encoding, preamble issues
2. count           -> Establish baseline row count
3. headers         -> Verify expected columns exist
4. stats --cardinality --stats-jsonl -> Full statistical profile
5. frequency       -> Value distribution for categorical columns
6. validate        -> Schema validation (if schema available)
7. fixlengths --count -> Check for ragged rows

Quality Report Checklist

After profiling, report on:

  • Row count and column count
  • Null/empty counts per column (completeness)
  • Cardinality per column (uniqueness assessment)
  • Data types inferred per column (validity)
  • Min/max/mean for numeric columns (range plausibility)
  • Top frequency values for categorical columns (distribution)
  • Duplicate rows detected (uniqueness)
  • Schema violations if schema provided (validity)
  • Encoding and delimiter detected (consistency)

Common Data Quality Fixes

ProblemFix Command
Inconsistent caseapply operations upper/lower col
Leading/trailing whitespaceapply operations trim col
Duplicate rowsdedup
Ragged rowsfixlengths
Unsafe column namessafenames
Wrong encodinginput (normalizes to UTF-8)
Empty value replacementapply emptyreplace "N/A" col
Invalid rowsvalidate schema.json + filter