Data Analysis Skill

Follow these guidelines when performing data analysis.

Loading Data

python

import pandas as pd
import numpy as np

# From CSV
df = pd.read_csv("data.csv")

# From JSON
df = pd.read_json("data.json")

# Quick inspection
print(df.head())
print(df.info())
print(df.describe())

Data Cleaning

Handle missing values

python

# Check for nulls
df.isnull().sum()

# Fill with mean/median
df["column"] = df["column"].fillna(df["column"].mean())

# Drop rows with nulls
df = df.dropna(subset=["important_column"])

Fix data types

python

# Convert types
df["date"] = pd.to_datetime(df["date"])
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df["category"] = df["category"].astype("category")

Analysis Patterns

Group and aggregate

python

# Group by category
summary = df.groupby("category").agg({
    "amount": ["sum", "mean", "count"],
    "quantity": "sum"
})

# Pivot table
pivot = pd.pivot_table(
    df,
    values="amount",
    index="category",
    columns="month",
    aggfunc="sum"
)

Time series

python

# Set datetime index
df = df.set_index("date")

# Resample to monthly
monthly = df.resample("M").sum()

# Rolling average
df["rolling_avg"] = df["value"].rolling(window=7).mean()

Visualization

Basic plots with matplotlib

python

import matplotlib.pyplot as plt

# Line chart
plt.figure(figsize=(10, 6))
plt.plot(df["date"], df["value"])
plt.title("Value Over Time")
plt.xlabel("Date")
plt.ylabel("Value")
plt.savefig("chart.png")
plt.close()

Statistical visualization

python

# Histogram
df["amount"].hist(bins=30)

# Box plot
df.boxplot(column="amount", by="category")

# Scatter plot
df.plot.scatter(x="x_col", y="y_col", c="category", colormap="viridis")

Statistical Analysis

python

from scipy import stats

# Descriptive statistics
mean = df["value"].mean()
std = df["value"].std()
median = df["value"].median()

# Correlation
correlation = df[["col1", "col2"]].corr()

# T-test
t_stat, p_value = stats.ttest_ind(group1, group2)

Large Dataset Pattern (File-Based Analysis)

When working with large datasets, save results to files and use shell commands to query them. This keeps context lean and avoids loading everything into memory.

Step 1: Generate and save data to file

python

import json
import random
from datetime import datetime, timedelta

# Generate mock sales data
categories = ["Electronics", "Clothing", "Food", "Home", "Sports"]
regions = ["North", "South", "East", "West"]

data = []
base_date = datetime(2024, 1, 1)
for i in range(10000):
    data.append({
        "id": i + 1,
        "date": (base_date + timedelta(days=random.randint(0, 365))).isoformat(),
        "category": random.choice(categories),
        "region": random.choice(regions),
        "amount": round(random.uniform(10, 500), 2),
        "quantity": random.randint(1, 20),
        "customer_id": f"CUST-{random.randint(1000, 9999)}"
    })

# Save to file (DON'T return to context)
with open("/tmp/sales_data.json", "w") as f:
    for record in data:
        f.write(json.dumps(record) + "\n")

print(f"Saved {len(data)} records to /tmp/sales_data.json")

Step 2: Query with shell commands (executeCommand)

Use executeCommand to run shell commands on the saved data:

json

{
  "action": {
    "type": "executeCommand",
    "command": "grep 'Electronics' /tmp/sales_data.json | head -5"
  }
}

Common shell patterns for data analysis

bash

# Count records by category
grep -o '"category":"[^"]*"' /tmp/sales_data.json | sort | uniq -c

# Find high-value transactions (amount > 400)
grep -E '"amount":4[0-9]{2}' /tmp/sales_data.json | wc -l

# Get unique customer IDs
grep -o '"customer_id":"[^"]*"' /tmp/sales_data.json | sort -u | wc -l

# Filter by region and calculate with jq
cat /tmp/sales_data.json | jq -s '[.[] | select(.region=="North")] | length'

# Sample 10 random records
shuf -n 10 /tmp/sales_data.json

# Get records for a specific date range
grep '"date":"2024-06' /tmp/sales_data.json | wc -l

Step 3: Process subsets in Python

python

import json

# Read only filtered data for detailed analysis
with open("/tmp/filtered_results.json", "r") as f:
    subset = [json.loads(line) for line in f]

# Now analyze the smaller subset
total = sum(r["amount"] for r in subset)
avg = total / len(subset)
print(f"Subset analysis: {len(subset)} records, total=${total:.2f}, avg=${avg:.2f}")

Workflow Summary

•Generate/Load → Save raw data to /tmp/data.json
•Filter → Use grep, jq, awk to create subsets
•Analyze → Load only the subset needed into Python
•Output → Save results to files, return only summary to context

This pattern keeps your context window free for reasoning while handling datasets of any size.

Best Practices

•Always inspect data first - Use .head(), .info(), .describe()
•Handle missing data explicitly - Document your strategy
•Use appropriate data types - Saves memory and prevents errors
•Validate results - Sanity check aggregations and statistics
•Save intermediate results - Use checkpoints for large datasets
•Keep context lean - Write large data to files, use shell to query