Data Quality Assessment with qsv
Quality Dimensions
1. Completeness
Question: Are there missing values?
| Check | Command | What to Look For |
|---|---|---|
| Null counts | stats --cardinality --stats-jsonl | nullcount column > 0 |
| Empty strings | frequency --limit 10 | Empty string in top values |
| Sparsity | stats | sparsity field (ratio of nulls) |
Red flag: Sparsity > 0.5 means more than half the values are null.
2. Uniqueness
Question: Are there unwanted duplicates?
| Check | Command | What to Look For |
|---|---|---|
| Duplicate rows | dedup --dupes-output dupes.csv | Non-empty dupes file |
| Cardinality | stats --cardinality | cardinality vs row count |
| Unique ratio | stats | If cardinality = row count, column is unique |
Red flag: Key columns (ID, email) with cardinality < row count.
3. Validity
Question: Do values match expected formats?
| Check | Command | What to Look For |
|---|---|---|
| Schema validation | validate schema.json | Validation error count |
| Data types | stats | type column (String, Integer, Float, Date, etc.) |
| Format patterns | search --flag | Rows not matching expected regex |
| Value ranges | stats | min, max outside expected range |
Red flag: Type column shows "String" for what should be numeric data.
4. Consistency
Question: Are formats consistent across the dataset?
| Check | Command | What to Look For |
|---|---|---|
| Date formats | stats | Mixed date types in same column |
| Case consistency | frequency | "NYC" vs "nyc" vs "Nyc" as separate values |
| Encoding | sniff | Non-UTF-8 encoding detected |
| Delimiters | sniff | Unexpected delimiter or quoting |
| Row lengths | fixlengths --count | Rows with wrong number of fields |
Red flag: Frequency shows same value in different cases/formats.
5. Accuracy
Question: Are values plausible?
| Check | Command | What to Look For |
|---|---|---|
| Statistical outliers | stats | mean, stddev - values > 3 stddev from mean |
| Value distributions | frequency --limit 20 | Unexpected dominant values |
| Range checks | stats | min/max outside plausible range |
| Cross-field checks | sqlp | SQL WHERE clauses for business rules |
Red flag: Latitude > 90 or < -90, negative ages, future birth dates.
Quality Assessment Workflow
code
1. sniff -> Detect format, encoding, preamble issues 2. count -> Establish baseline row count 3. headers -> Verify expected columns exist 4. stats --cardinality --stats-jsonl -> Full statistical profile 5. frequency -> Value distribution for categorical columns 6. validate -> Schema validation (if schema available) 7. fixlengths --count -> Check for ragged rows
Quality Report Checklist
After profiling, report on:
- • Row count and column count
- • Null/empty counts per column (completeness)
- • Cardinality per column (uniqueness assessment)
- • Data types inferred per column (validity)
- • Min/max/mean for numeric columns (range plausibility)
- • Top frequency values for categorical columns (distribution)
- • Duplicate rows detected (uniqueness)
- • Schema violations if schema provided (validity)
- • Encoding and delimiter detected (consistency)
Common Data Quality Fixes
| Problem | Fix Command |
|---|---|
| Inconsistent case | apply operations upper/lower col |
| Leading/trailing whitespace | apply operations trim col |
| Duplicate rows | dedup |
| Ragged rows | fixlengths |
| Unsafe column names | safenames |
| Wrong encoding | input (normalizes to UTF-8) |
| Empty value replacement | apply emptyreplace "N/A" col |
| Invalid rows | validate schema.json + filter |