Data Analysis & Cleaning Skill

This skill helps analyze and clean datasets for machine learning workflows, handling the critical 60-80% of work that happens before model training.

When to Use This Skill

Trigger this skill when:

•User uploads a dataset (CSV, Excel, etc.) and asks to clean or analyze it
•User mentions "data cleaning", "preprocessing", "data quality", "prepare for modeling"
•User asks to "check my data", "find issues in dataset", "prepare data for ML"
•User wants to understand data distributions, missing patterns, or outliers

Core Workflow

The skill follows a semi-automated approach: analyze thoroughly, present findings clearly, suggest actions, and let the user decide what to apply.

Phase 1: Initial Data Profiling

•
Load and inspect the dataset
- •Read the file (CSV, Excel, etc.) from /mnt/user-data/uploads/
- •Display basic info: shape, column names, data types
- •Show first few rows to understand structure
•
Generate comprehensive data profile
- •
  For each column, capture:
  - •Data type (actual vs inferred)
  - •Missing value count and percentage
  - •Unique value count
  - •Basic statistics (mean, median, std for numeric; mode, frequency for categorical)
  - •Sample values

Phase 2: Issue Detection & Analysis

Systematically check for common data quality issues:

Missing Values

•Identify columns with missing data
•Calculate missing percentages
•Analyze missing patterns (MCAR, MAR, MNAR)
•Visualize missing data patterns if significant

Outliers

•
For numeric columns, detect outliers using:
- •IQR method (values beyond Q1-1.5IQR or Q3+1.5IQR)
- •Z-score method (|z| > 3)
- •Domain-specific thresholds if applicable
•Report outlier counts and extreme values
•Consider whether outliers are errors or valid extreme cases

Data Type Inconsistencies

•Check for mixed types in columns
•Identify columns that should be categorical but are stored as text/numeric
•Find date/time columns stored as strings
•Detect numeric values stored as strings (e.g., "1,234" or "$100")

Duplicates

•Check for fully duplicate rows
•Identify potential duplicate records (similar but not identical)
•Check for duplicate keys/IDs if applicable

Distribution Analysis

•For numeric columns: check skewness, kurtosis
•For categorical columns: check cardinality, imbalanced classes
•Identify zero/near-zero variance columns

Data Integrity Issues

•Negative values where only positive expected (e.g., age, price)
•Out-of-range values (e.g., age > 150, percentages > 100)
•Invalid categories or typos in categorical data
•Inconsistent formatting (e.g., "NY" vs "New York" vs "ny")

Phase 3: Generate Analysis Report

Create a comprehensive markdown report that includes:

•
Executive Summary
- •Dataset overview (rows, columns, file size)
- •Key quality metrics (overall completeness, issue count)
- •Critical issues requiring attention
•
Detailed Findings by Issue Type
- •Missing values: which columns, patterns, severity
- •Outliers: which columns, counts, examples
- •Type issues: specific problems and affected columns
- •Duplicates: count and sample duplicates
- •Distribution insights: skewed columns, imbalanced targets
•
Column-by-Column Profile
- •Organized table with all columns and their key statistics
- •Quality flags for each column
•
Visualizations (when helpful)
- •Missing value heatmap code
- •Distribution plots for key numeric columns
- •Correlation matrix if many numeric columns

Phase 4: Present Cleaning Recommendations

For each identified issue, provide:

•Issue description: What's wrong and why it matters
•Severity: Critical / High / Medium / Low
•Suggested actions: Specific strategies (e.g., "Impute with median" vs "Drop column")
•Trade-offs: Pros/cons of each approach
•Code snippets: Ready-to-use Python/pandas code for the fix

Present options, don't auto-apply: Use the ask_user_input tool to let users choose which cleaning actions to apply.

Phase 5: Apply Selected Cleaning Actions

After user selects which actions to take:

•
Execute cleaning steps in the right order:
- •Remove duplicates first
- •Handle missing values
- •Fix data types
- •Address outliers (cap, remove, or transform)
- •Standardize formatting
- •Remove zero-variance columns
•
Track all transformations in a log file
•
Validate cleaned data:
- •Verify no new issues introduced
- •Check data shape and types
- •Ensure key relationships preserved
•
Save outputs:
- •Cleaned dataset: /mnt/user-data/outputs/cleaned_data.csv (or .xlsx)
- •Analysis report: /mnt/user-data/outputs/data_analysis_report.md
- •Cleaning log: /mnt/user-data/outputs/cleaning_log.txt
- •Optional: Cleaning script for reproducibility

Best Practices

Code Quality

•Use pandas efficiently (vectorized operations, not loops)
•Handle edge cases (empty dataframes, all-null columns)
•Preserve original data (work on copies)
•Use appropriate data types (category for low-cardinality strings, int for IDs)

Statistical Rigor

•Don't blindly impute - consider the missingness mechanism
•For outliers, distinguish errors from valid extremes
•Consider domain knowledge (if user provides context)
•Document assumptions made during cleaning

Communication

•Use clear, non-jargon explanations (avoid assuming technical background)
•Show examples of issues with actual data samples
•Explain trade-offs in simple terms
•Visualize when it aids understanding

Output Quality

•Cleaned data should be immediately usable for modeling
•Reports should be well-formatted markdown with clear sections
•Include code snippets users can adapt
•Provide both summary and detailed views

Error Handling

•If file can't be read, suggest format or provide helpful error
•If dataset is too large, sample strategically and note this
•If column names are unclear, ask user for clarification
•If unexpected data structure, explain what was found vs expected

Example Interaction Flow

code

User: "Can you analyze this customer dataset and clean it?"

Claude: 
1. Loads data from uploads
2. Profiles the dataset
3. Detects: 
   - 23% missing values in 'income' column
   - 47 duplicate customer records
   - 'signup_date' stored as string instead of datetime
   - 12 outliers in 'age' (values > 120)
4. Generates analysis report
5. Presents cleaning options via ask_user_input:
   - "Handle missing income: [Impute with median / Drop rows / Keep as-is]"
   - "Handle duplicates: [Keep first / Keep last / Manual review]"
   - "Fix age outliers: [Cap at 100 / Remove rows / Manual review]"
6. User selects preferred actions
7. Applies transformations
8. Saves cleaned data + report + log
9. Shows before/after comparison

Tools and Libraries

Primary tools:

•pandas for data manipulation
•numpy for numerical operations
•matplotlib/seaborn for visualizations (optional, when helpful)
•Standard library (csv, pathlib, etc.)

Install if needed:

bash

pip install pandas numpy matplotlib seaborn openpyxl --break-system-packages

Output Files

Always produce:

•Cleaned dataset - Same format as input, ready for modeling
•Analysis report - Comprehensive markdown document
•Cleaning log - Text file documenting all transformations

Optionally produce: 4. Reproducible script - Python script to repeat the cleaning 5. Visualizations - PNG files for key insights

Quality Checklist

Before finalizing outputs, verify:

• All identified issues addressed or explicitly noted as kept
• No new missing values introduced
• Data types are appropriate
• Shape changes documented
• Report is clear and actionable
• Files saved to /mnt/user-data/outputs/
• User has final say on all cleaning decisions