Exploratory Data Analysis (EDA)
Analyze tabular datasets to understand distributions, data quality, and patterns.
When to Use
- •Understanding a new dataset before modeling
- •Checking data quality (missing values, outliers, duplicates)
- •Analyzing target variable distribution
- •Identifying class imbalance
- •Generating summary statistics
Analysis Process
- •Connect to data - Verify access and inspect schema
- •Analyze target variable first - Understand class balance
- •Check each column - Distribution, missing data, cardinality
- •Document findings - Save reports for reproducibility
Available Analyses
| Analysis | Description |
|---|---|
| Column Distribution | Value counts, percentages, cardinality assessment |
| Missing Data | Null counts, patterns (MCAR/MAR/MNAR) |
| Class Balance | Imbalance detection for classification targets |
| Summary Stats | Count, unique, nulls per column |
Column Distribution Analysis
For detailed analysis methodology and output format:
Quick Reference
Cardinality Levels:
| Level | Criteria | Action |
|---|---|---|
| Low | ≤10 unique | Good for categorical encoding |
| Medium | 11-100 or <1% of rows | May need encoding strategy |
| High | >100 and <50% of rows | Consider grouping/binning |
| Very High | >50% of rows | Likely identifier, exclude |
Missing Data Thresholds:
| Percentage | Assessment |
|---|---|
| 0% | No missing data |
| <1% | Minimal - safe to drop or impute |
| 1-5% | Some - consider imputation strategy |
| >5% | Significant - investigate pattern |
Class Imbalance:
- •
80% in top class: Imbalance detected
- •
95% in top class: Extreme imbalance
Output Format
markdown
# Column Distribution: {column_name}
- **source**: path/to/data
- **column**: column_name
## Summary
- Total rows: N
- Null/missing: N (X%)
- Unique values: N
- Cardinality: Low|Medium|High|Very High
## Distribution
| Value | Count | Percentage | Cumulative |
|-------|-------|------------|------------|
## Observations
- Auto-generated insights
Best Practices
- •Start with schema inspection before deep analysis
- •Check target variable first for classification tasks
- •Missing data may not be random - investigate patterns
- •Save reports for reproducibility