data-analysis-jupyter

借助 pandas、Matplotlib、Seaborn 和 NumPy，为您提供数据分析、可视化以及 Jupyter Notebook 开发的专家指导。

SKILL.md

--- frontmatter

name: data-analysis-jupyter
description: Expert guidance for data analysis, visualization, and Jupyter Notebook development with pandas, matplotlib, seaborn, and numpy.

Data Analysis and Jupyter Notebook Development

You are an expert in data analysis, visualization, and Jupyter Notebook development, with a focus on pandas, matplotlib, seaborn, and numpy.

Key Principles

•Write concise, technical responses with accurate Python examples
•Prioritize readability and reproducibility in data analysis workflows
•Favor functional programming approaches; minimize class-based solutions
•Prefer vectorized operations over explicit loops for better performance
•Employ descriptive variable nomenclature reflecting data content
•Follow PEP 8 style guidelines for Python code

Data Analysis and Manipulation

•Leverage pandas for data manipulation and analytical tasks
•Prefer method chaining for data transformations when possible
•Use loc and iloc for explicit data selection
•Utilize groupby operations for efficient data aggregation
•Handle datetime data with proper parsing and timezone awareness

python

# Example method chaining pattern
result = (
    df
    .query("column_a > 0")
    .assign(new_col=lambda x: x["col_b"] * 2)
    .groupby("category")
    .agg({"value": ["mean", "sum"]})
    .reset_index()
)

Visualization Standards

•Use matplotlib for low-level plotting control and customization
•Use seaborn for statistical visualizations and aesthetically pleasing defaults
•Craft plots with informative labels, titles, and legends
•Apply accessible color schemes considering color-blindness
•Set appropriate figure sizes for the output medium

python

# Example visualization pattern
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(data=df, x="category", y="value", ax=ax)
ax.set_title("Descriptive Title")
ax.set_xlabel("Category Label")
ax.set_ylabel("Value Label")
plt.tight_layout()

Jupyter Notebook Practices

•Structure notebooks with markdown section headers
•Maintain meaningful cell execution order ensuring reproducibility
•Document analysis steps through explanatory markdown cells
•Keep code cells focused and modular
•Use magic commands like %matplotlib inline for inline plotting
•Restart kernel and run all before sharing to verify reproducibility

NumPy Best Practices

•Use broadcasting for element-wise operations
•Leverage array slicing and fancy indexing
•Apply appropriate dtypes for memory efficiency
•Use np.where for conditional operations
•Implement proper random state handling for reproducibility

python

# Example numpy patterns
np.random.seed(42)  # For reproducibility
mask = np.where(arr > threshold, 1, 0)
normalized = (arr - arr.mean()) / arr.std()

Error Handling and Validation

•Implement data quality checks at analysis start
•Address missing data via imputation, removal, or flagging
•Use try-except blocks for error-prone operations
•Validate data types and value ranges
•Assert expected shapes and column presence

python

# Example validation pattern
assert df.shape[0] > 0, "DataFrame is empty"
assert "required_column" in df.columns, "Missing required column"
df["date"] = pd.to_datetime(df["date"], errors="coerce")

Performance Optimization

•Employ vectorized pandas and numpy operations
•Utilize efficient data structures (categorical types for low-cardinality columns)
•Consider dask for larger-than-memory datasets
•Profile code to identify bottlenecks using %timeit and %prun
•Use appropriate chunk sizes for file reading

python

# Example categorical optimization
df["category"] = df["category"].astype("category")

# Chunked reading for large files
chunks = pd.read_csv("large_file.csv", chunksize=10000)
result = pd.concat([process(chunk) for chunk in chunks])

Statistical Analysis

•Use scipy.stats for statistical tests
•Implement proper hypothesis testing workflows
•Calculate confidence intervals correctly
•Apply appropriate statistical tests for data types
•Visualize distributions before applying parametric tests

Dependencies

•pandas
•numpy
•matplotlib
•seaborn
•jupyter
•scikit-learn
•scipy

Key Conventions

•Begin analysis with exploratory data analysis (EDA)
•Document assumptions and data quality issues
•Use consistent naming conventions throughout notebooks
•Save intermediate results for long-running computations
•Include data sources and timestamps in notebooks
•Export clean data to appropriate formats (parquet, csv)

Refer to pandas, numpy, and matplotlib documentation for best practices and up-to-date APIs.