Data Analysis & Cleaning Skill
This skill helps analyze and clean datasets for machine learning workflows, handling the critical 60-80% of work that happens before model training.
When to Use This Skill
Trigger this skill when:
- •User uploads a dataset (CSV, Excel, etc.) and asks to clean or analyze it
- •User mentions "data cleaning", "preprocessing", "data quality", "prepare for modeling"
- •User asks to "check my data", "find issues in dataset", "prepare data for ML"
- •User wants to understand data distributions, missing patterns, or outliers
Core Workflow
The skill follows a semi-automated approach: analyze thoroughly, present findings clearly, suggest actions, and let the user decide what to apply.
Phase 1: Initial Data Profiling
- •
Load and inspect the dataset
- •Read the file (CSV, Excel, etc.) from
/mnt/user-data/uploads/ - •Display basic info: shape, column names, data types
- •Show first few rows to understand structure
- •Read the file (CSV, Excel, etc.) from
- •
Generate comprehensive data profile
- •For each column, capture:
- •Data type (actual vs inferred)
- •Missing value count and percentage
- •Unique value count
- •Basic statistics (mean, median, std for numeric; mode, frequency for categorical)
- •Sample values
- •For each column, capture:
Phase 2: Issue Detection & Analysis
Systematically check for common data quality issues:
Missing Values
- •Identify columns with missing data
- •Calculate missing percentages
- •Analyze missing patterns (MCAR, MAR, MNAR)
- •Visualize missing data patterns if significant
Outliers
- •For numeric columns, detect outliers using:
- •IQR method (values beyond Q1-1.5IQR or Q3+1.5IQR)
- •Z-score method (|z| > 3)
- •Domain-specific thresholds if applicable
- •Report outlier counts and extreme values
- •Consider whether outliers are errors or valid extreme cases
Data Type Inconsistencies
- •Check for mixed types in columns
- •Identify columns that should be categorical but are stored as text/numeric
- •Find date/time columns stored as strings
- •Detect numeric values stored as strings (e.g., "1,234" or "$100")
Duplicates
- •Check for fully duplicate rows
- •Identify potential duplicate records (similar but not identical)
- •Check for duplicate keys/IDs if applicable
Distribution Analysis
- •For numeric columns: check skewness, kurtosis
- •For categorical columns: check cardinality, imbalanced classes
- •Identify zero/near-zero variance columns
Data Integrity Issues
- •Negative values where only positive expected (e.g., age, price)
- •Out-of-range values (e.g., age > 150, percentages > 100)
- •Invalid categories or typos in categorical data
- •Inconsistent formatting (e.g., "NY" vs "New York" vs "ny")
Phase 3: Generate Analysis Report
Create a comprehensive markdown report that includes:
- •
Executive Summary
- •Dataset overview (rows, columns, file size)
- •Key quality metrics (overall completeness, issue count)
- •Critical issues requiring attention
- •
Detailed Findings by Issue Type
- •Missing values: which columns, patterns, severity
- •Outliers: which columns, counts, examples
- •Type issues: specific problems and affected columns
- •Duplicates: count and sample duplicates
- •Distribution insights: skewed columns, imbalanced targets
- •
Column-by-Column Profile
- •Organized table with all columns and their key statistics
- •Quality flags for each column
- •
Visualizations (when helpful)
- •Missing value heatmap code
- •Distribution plots for key numeric columns
- •Correlation matrix if many numeric columns
Phase 4: Present Cleaning Recommendations
For each identified issue, provide:
- •Issue description: What's wrong and why it matters
- •Severity: Critical / High / Medium / Low
- •Suggested actions: Specific strategies (e.g., "Impute with median" vs "Drop column")
- •Trade-offs: Pros/cons of each approach
- •Code snippets: Ready-to-use Python/pandas code for the fix
Present options, don't auto-apply: Use the ask_user_input tool to let users choose which cleaning actions to apply.
Phase 5: Apply Selected Cleaning Actions
After user selects which actions to take:
- •
Execute cleaning steps in the right order:
- •Remove duplicates first
- •Handle missing values
- •Fix data types
- •Address outliers (cap, remove, or transform)
- •Standardize formatting
- •Remove zero-variance columns
- •
Track all transformations in a log file
- •
Validate cleaned data:
- •Verify no new issues introduced
- •Check data shape and types
- •Ensure key relationships preserved
- •
Save outputs:
- •Cleaned dataset:
/mnt/user-data/outputs/cleaned_data.csv(or .xlsx) - •Analysis report:
/mnt/user-data/outputs/data_analysis_report.md - •Cleaning log:
/mnt/user-data/outputs/cleaning_log.txt - •Optional: Cleaning script for reproducibility
- •Cleaned dataset:
Best Practices
Code Quality
- •Use pandas efficiently (vectorized operations, not loops)
- •Handle edge cases (empty dataframes, all-null columns)
- •Preserve original data (work on copies)
- •Use appropriate data types (category for low-cardinality strings, int for IDs)
Statistical Rigor
- •Don't blindly impute - consider the missingness mechanism
- •For outliers, distinguish errors from valid extremes
- •Consider domain knowledge (if user provides context)
- •Document assumptions made during cleaning
Communication
- •Use clear, non-jargon explanations (avoid assuming technical background)
- •Show examples of issues with actual data samples
- •Explain trade-offs in simple terms
- •Visualize when it aids understanding
Output Quality
- •Cleaned data should be immediately usable for modeling
- •Reports should be well-formatted markdown with clear sections
- •Include code snippets users can adapt
- •Provide both summary and detailed views
Error Handling
- •If file can't be read, suggest format or provide helpful error
- •If dataset is too large, sample strategically and note this
- •If column names are unclear, ask user for clarification
- •If unexpected data structure, explain what was found vs expected
Example Interaction Flow
User: "Can you analyze this customer dataset and clean it?" Claude: 1. Loads data from uploads 2. Profiles the dataset 3. Detects: - 23% missing values in 'income' column - 47 duplicate customer records - 'signup_date' stored as string instead of datetime - 12 outliers in 'age' (values > 120) 4. Generates analysis report 5. Presents cleaning options via ask_user_input: - "Handle missing income: [Impute with median / Drop rows / Keep as-is]" - "Handle duplicates: [Keep first / Keep last / Manual review]" - "Fix age outliers: [Cap at 100 / Remove rows / Manual review]" 6. User selects preferred actions 7. Applies transformations 8. Saves cleaned data + report + log 9. Shows before/after comparison
Tools and Libraries
Primary tools:
- •
pandasfor data manipulation - •
numpyfor numerical operations - •
matplotlib/seabornfor visualizations (optional, when helpful) - •Standard library (
csv,pathlib, etc.)
Install if needed:
pip install pandas numpy matplotlib seaborn openpyxl --break-system-packages
Output Files
Always produce:
- •Cleaned dataset - Same format as input, ready for modeling
- •Analysis report - Comprehensive markdown document
- •Cleaning log - Text file documenting all transformations
Optionally produce: 4. Reproducible script - Python script to repeat the cleaning 5. Visualizations - PNG files for key insights
Quality Checklist
Before finalizing outputs, verify:
- • All identified issues addressed or explicitly noted as kept
- • No new missing values introduced
- • Data types are appropriate
- • Shape changes documented
- • Report is clear and actionable
- • Files saved to
/mnt/user-data/outputs/ - • User has final say on all cleaning decisions