AgentSkillsCN

python-data-science-pro

运用高级 Python 数据科学,高效处理海量数据集与高性能机器学习任务。专注于 Polars、Dask、Ray,以及先进的模型优化技术(XGBoost、Optuna、HPO)。主动应用于大规模数据处理、分布式机器学习,或加速数据流水线。

SKILL.md
--- frontmatter
name: python-data-science-pro
description: Advanced Python data science for handling massive datasets and high-performance machine learning. Specializes in Polars, Dask, Ray, and advanced model optimization (XGBoost, Optuna, HPO). Use PROACTIVELY for large-scale data processing, distributed ML, or pipeline acceleration.

Use this skill when

  • Working with "medium to large" data that doesn't fit in standard Pandas.
  • Accelerating data pipelines using Polars or Dask.
  • Implementing high-performance machine learning models (GBMs, ensembles).
  • Performing complex hyperparameter optimization (HPO) with Optuna.
  • Building distributed data processing or training tasks with Ray.

Instructions

  • Prefer Polars over Pandas for performance-critical data manipulation.
  • Use Dask or Ray for distributed computing beyond a single core.
  • Implement Vectorization with NumPy for mathematical operations.
  • Use Optuna for efficient, Bayesian-based hyperparameter tuning.
  • Follow scikit-learn compatible patterns for custom transformers/estimators.

Capabilities

High-Performance Data Processing

  • Polars Mastery: Lazy-evaluation, multi-threaded expressions, and Apache Arrow integration.
  • NumPy Expert: Advanced indexing, broadcasting, and memory-efficient array operations.
  • Dask / Ray: Parallelizing Python code across clusters or multiple CPUs.
  • Data Schemas: Using Pydantic or Pandera for strict data validation and typing.

Advanced Machine Learning

  • Gradient Boosting: Master-level tuning of XGBoost, LightGBM, and CatBoost.
  • Hyperparameter Tuning: Optuna prune-and-search strategies (TPESampler).
  • AutoML Integration: Using H2O, AutoGluon, or TPOT for rapid prototyping.
  • Feature Stores: Integration with Feast or Tecton (conceptual or actual).

Scalability & Efficiency

  • Memory Optimization: Efficient dtypes, generator-based processing, and chunking.
  • Serialization: Using Parquet, Avro, or Feather for fast I/O.
  • Pipeline Orchestration: Best practices for Prefect, Dagster, or Airflow (modular logic).

Visualization for Data Science

  • Interactive Viz: Using Plotly, Boken, or Streamlit for data apps.
  • Statistical Viz: Advanced Seaborn and Matplotlib for publication-ready figures.

Example Interactions

  • "Convert this slow Pandas pipeline to Polars for 10x speedup."
  • "Implement an Optuna study to find the best LightGBM parameters for this dataset."
  • "Scale my data processing task across 8 cores using Dask."
  • "Design a memory-efficient pipeline to process 50GB of CSV files using chunking and Parquet."