Data Exploration Skill
Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas.
Data Profiling Methodology
Phase 1: Structural Understanding
Before analyzing any data, understand its structure:
Table-level questions:
- •How many rows and columns?
- •What is the grain (one row per what)?
- •What is the primary key? Is it unique?
- •When was the data last updated?
- •How far back does the data go?
Column classification: Categorize each column as one of:
- •Identifier: Unique keys, foreign keys, entity IDs
- •Dimension: Categorical attributes for grouping/filtering (status, type, region, category)
- •Metric: Quantitative values for measurement (revenue, count, duration, score)
- •Temporal: Dates and timestamps (created_at, updated_at, event_date)
- •Text: Free-form text fields (description, notes, name)
- •Boolean: True/false flags
- •Structural: JSON, arrays, nested structures
Phase 2: Column-Level Profiling
For each column, compute:
All columns:
- •Null count and null rate
- •Distinct count and cardinality ratio (distinct / total)
- •Most common values (top 5-10 with frequencies)
- •Least common values (bottom 5 to spot anomalies)
Numeric columns (metrics):
min, max, mean, median (p50) standard deviation percentiles: p1, p5, p25, p75, p95, p99 zero count negative count (if unexpected)
String columns (dimensions, text):
min length, max length, avg length empty string count pattern analysis (do values follow a format?) case consistency (all upper, all lower, mixed?) leading/trailing whitespace count
Date/timestamp columns:
min date, max date null dates future dates (if unexpected) distribution by month/week gaps in time series
Boolean columns:
true count, false count, null count true rate
Phase 3: Relationship Discovery
After profiling individual columns:
- •Foreign key candidates: ID columns that might link to other tables
- •Hierarchies: Columns that form natural drill-down paths (country > state > city)
- •Correlations: Numeric columns that move together
- •Derived columns: Columns that appear to be computed from others
- •Redundant columns: Columns with identical or near-identical information
Quality Assessment Framework
Completeness Score
Rate each column:
- •Complete (>99% non-null): Green
- •Mostly complete (95-99%): Yellow -- investigate the nulls
- •Incomplete (80-95%): Orange -- understand why and whether it matters
- •Sparse (<80%): Red -- may not be usable without imputation
Consistency Checks
Look for:
- •Value format inconsistency: Same concept represented differently ("USA", "US", "United States", "us")
- •Type inconsistency: Numbers stored as strings, dates in various formats
- •Referential integrity: Foreign keys that don't match any parent record
- •Business rule violations: Negative quantities, end dates before start dates, percentages > 100
- •Cross-column consistency: Status = "completed" but completed_at is null
Accuracy Indicators
Red flags that suggest accuracy issues:
- •Placeholder values: 0, -1, 999999, "N/A", "TBD", "test", "xxx"
- •Default values: Suspiciously high frequency of a single value
- •Stale data: Updated_at shows no recent changes in an active system
- •Impossible values: Ages > 150, dates in the far future, negative durations
- •Round number bias: All values ending in 0 or 5 (suggests estimation, not measurement)
Timeliness Assessment
- •When was the table last updated?
- •What is the expected update frequency?
- •Is there a lag between event time and load time?
- •Are there gaps in the time series?
Pattern Discovery Techniques
Distribution Analysis
For numeric columns, characterize the distribution:
- •Normal: Mean and median are close, bell-shaped
- •Skewed right: Long tail of high values (common for revenue, session duration)
- •Skewed left: Long tail of low values (less common)
- •Bimodal: Two peaks (suggests two distinct populations)
- •Power law: Few very large values, many small ones (common for user activity)
- •Uniform: Roughly equal frequency across range (often synthetic or random)
Temporal Patterns
For time series data, look for:
- •Trend: Sustained upward or downward movement
- •Seasonality: Repeating patterns (weekly, monthly, quarterly, annual)
- •Day-of-week effects: Weekday vs. weekend differences
- •Holiday effects: Drops or spikes around known holidays
- •Change points: Sudden shifts in level or trend
- •Anomalies: Individual data points that break the pattern
Segmentation Discovery
Identify natural segments by:
- •Finding categorical columns with 3-20 distinct values
- •Comparing metric distributions across segment values
- •Looking for segments with significantly different behavior
- •Testing whether segments are homogeneous or contain sub-segments
Correlation Exploration
Between numeric columns:
- •Compute correlation matrix for all metric pairs
- •Flag strong correlations (|r| > 0.7) for investigation
- •Note: Correlation does not imply causation -- flag this explicitly
- •Check for non-linear relationships (e.g., quadratic, logarithmic)
Schema Understanding and Documentation
Schema Documentation Template
When documenting a dataset for team use:
## Table: [schema.table_name] **Description**: [What this table represents] **Grain**: [One row per...] **Primary Key**: [column(s)] **Row Count**: [approximate, with date] **Update Frequency**: [real-time / hourly / daily / weekly] **Owner**: [team or person responsible] ### Key Columns | Column | Type | Description | Example Values | Notes | |--------|------|-------------|----------------|-------| | user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id | | event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values | | revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events | | created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column | ### Relationships - Joins to `users` on `user_id` - Joins to `products` on `product_id` - Parent of `event_details` (1:many on event_id) ### Known Issues - [List any known data quality issues] - [Note any gotchas for analysts] ### Common Query Patterns - [Typical use cases for this table]
Schema Exploration Queries
When connected to a data warehouse, use these patterns to discover schema:
-- List all tables in a schema (PostgreSQL) SELECT table_name, table_type FROM information_schema.tables WHERE table_schema = 'public' ORDER BY table_name; -- Column details (PostgreSQL) SELECT column_name, data_type, is_nullable, column_default FROM information_schema.columns WHERE table_name = 'my_table' ORDER BY ordinal_position; -- Table sizes (PostgreSQL) SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC; -- Row counts for all tables (general pattern) -- Run per-table: SELECT COUNT(*) FROM table_name
Lineage and Dependencies
When exploring an unfamiliar data environment:
- •Start with the "output" tables (what reports or dashboards consume)
- •Trace upstream: What tables feed into them?
- •Identify raw/staging/mart layers
- •Map the transformation chain from raw data to analytical tables
- •Note where data is enriched, filtered, or aggregated