Exploratory Data Analysis (EDA)
Use this skill for understanding datasets before modeling: profiling distributions, detecting anomalies, identifying relationships, and assessing data quality.
When to use this skill
- •New dataset — need orientation on structure, types, distributions
- •Before feature engineering — understand variable relationships
- •Data quality investigation — find anomalies, missing patterns, outliers
- •Model preparation — validate assumptions about data
Core EDA workflow
- •Profile structure
- •Schema, types, cardinality
- •Missing value patterns
- •Analyze distributions
- •Numerical: histograms, boxplots, skewness
- •Categorical: frequencies, rare categories
- •Explore relationships
- •Correlation matrix (numerical)
- •Cross-tabulations (categorical)
- •Target-variable relationships
- •Identify issues
- •Outliers, duplicates, inconsistencies
- •Class imbalance (classification)
- •Temporal patterns (time series)
Quick tool selection
| Task | Default choice | Notes |
|---|---|---|
| Automated profiling | ydata-profiling / pandas-profiling | Fast comprehensive reports |
| Interactive exploration | ipywidgets + plotly | Drill-down capability |
| Statistical tests | scipy.stats | Normality, correlations |
| Large datasets | Polars + lazy | Memory-efficient |
Core implementation rules
1) Start with automated profiling
python
import polars as pl
from ydata_profiling import ProfileReport
df = pl.read_parquet("data.parquet")
profile = ProfileReport(df.to_pandas(), title="Data Profile")
profile.to_file("profile_report.html")
2) Focus on actionable insights
- •Document outliers worth investigating (not all outliers are problems)
- •Flag features with high cardinality or rare categories
- •Note strong correlations that may cause multicollinearity
3) Visualize for communication
- •Distribution plots for key variables
- •Correlation heatmap
- •Missing value patterns
- •Target relationship plots
4) Validate assumptions
- •Check for expected ranges/business rules
- •Verify temporal consistency
- •Confirm key relationships match domain knowledge
Common anti-patterns
- •❌ Skipping EDA and jumping to modeling
- •❌ Treating all outliers as errors
- •❌ Ignoring missing value mechanisms (MCAR/MAR/MNAR)
- •❌ Over-plotting large datasets without sampling
- •❌ Not documenting findings for team
Progressive disclosure
- •
../references/automated-profiling.md— ydata-profiling, Sweetviz, D-Tale - •
../references/visualization-patterns.md— Matplotlib, Seaborn, Plotly patterns - •
../references/statistical-tests.md— Scipy statistical tests guide - •
../references/large-dataset-eda.md— Sampling, Polars, Dask approaches
Related skills
- •
@data-science-feature-engineering— Next step after EDA - •
@data-science-model-evaluation— Validate modeling assumptions - •
@data-engineering-quality— Data validation frameworks