AgentSkillsCN

Data Analysis

数据分析

SKILL.md

Data Analysis

Profile, explore, validate, and visualize datasets. Systematic methodology for understanding data quality, discovering patterns, and producing reliable analysis.

Usage

Share a dataset or point me to data files, and I will help you explore, validate, and analyze them.

Capabilities

  • Profile datasets to understand shape, quality, and patterns
  • Detect data quality issues (nulls, duplicates, inconsistencies)
  • Statistical analysis and distribution characterization
  • Validate analysis results before sharing with stakeholders
  • Create visualizations with Python (matplotlib, seaborn, plotly)

Data Profiling Methodology

Phase 1: Structural Understanding

Before analyzing any data, understand its structure:

Table-level questions:

  • How many rows and columns?
  • What is the grain (one row per what)?
  • What is the primary key? Is it unique?
  • When was the data last updated?

Column classification:

  • Identifier: Unique keys, foreign keys, entity IDs
  • Dimension: Categorical attributes for grouping/filtering
  • Metric: Quantitative values for measurement
  • Temporal: Dates and timestamps
  • Text: Free-form text fields
  • Boolean: True/false flags

Phase 2: Column-Level Profiling

For each column, compute:

All columns: Null count/rate, distinct count, most common values (top 5-10)

Numeric columns: min, max, mean, median, standard deviation, percentiles (p1, p5, p25, p75, p95, p99)

String columns: min/max/avg length, empty string count, pattern analysis

Date columns: min/max date, gaps in time series, distribution by period

Phase 3: Pattern Discovery

  • Distribution analysis: Normal, skewed, bimodal, power law, uniform
  • Temporal patterns: Trend, seasonality, day-of-week effects, change points
  • Correlations: Flag strong correlations (|r| > 0.7) for investigation

Quality Assessment

Completeness Score

  • Complete (>99% non-null): Good to use
  • Mostly complete (95-99%): Investigate the nulls
  • Incomplete (80-95%): Understand why, decide if usable
  • Sparse (<80%): May need imputation or exclusion

Common Pitfalls to Check

  • Join explosion: Many-to-many joins silently multiplying rows
  • Survivorship bias: Analyzing only entities that exist today
  • Incomplete period comparison: Comparing partial to full periods
  • Average of averages: Wrong when group sizes differ
  • Timezone mismatches: Different sources using different timezones

Pre-Delivery QA Checklist

Before sharing any analysis:

  • Source tables verified and data is fresh enough
  • No unexpected gaps or missing segments
  • Nulls handled appropriately
  • No double-counting from bad joins
  • Aggregation logic matches the analysis grain
  • Rates use correct denominators
  • Numbers are in a plausible range
  • Key numbers cross-referenced against known sources
  • Charts start at zero (bar charts), axes labeled
  • Assumptions and limitations stated explicitly

Example Prompts

  • "Analyze this CSV file and tell me what you find"
  • "Check this data for quality issues before I share it"
  • "Create a visualization showing the trend over time"
  • "Profile this dataset and document the schema"

Tools

  • read_file: Read data files (CSV, JSON, etc.)
  • write_file: Save analysis results and documentation
  • exec: Run Python scripts for statistical analysis and visualization
  • list_dir: Explore data file locations