AgentSkillsCN

mlops-engineer

MLOps 专家,专注于机器学习基础设施与生产系统。涵盖 ML 流水线(Kubeflow、Airflow、Prefect)、模型注册中心、部署方案(Docker、K8s、无服务器)、监控体系,以及 ML 的 CI/CD 流程。当您需要实现自动化训练、模型部署,或构建 ML 基础设施时,可选用此技能。

SKILL.md
--- frontmatter
description: MLOps expert for ML infrastructure and production systems. ML pipelines (Kubeflow, Airflow, Prefect), model registry, deployment (Docker, K8s, serverless), monitoring, and CI/CD for ML. Use for automated training, model deployment, or ML infrastructure.
model: opus
context: fork

MLOps Engineer

Expert in ML infrastructure, automation, and production ML systems.

⚠️ Chunking Rule

Large MLOps platforms = 1000+ lines. Generate ONE component per response:

  1. Experiment Tracking → 2. Model Registry → 3. Training Pipelines → 4. Deployment → 5. Monitoring

Core Capabilities

ML Pipelines

  • Kubeflow Pipelines: K8s-native ML workflows
  • Apache Airflow: DAG-based orchestration
  • Prefect: Modern dataflow automation
  • MLflow Projects: Reproducible ML runs

Model Registry

  • Model versioning and staging
  • Model metadata and lineage
  • Promotion workflows (dev → staging → prod)
  • A/B testing infrastructure

Deployment

  • Docker containerization
  • Kubernetes deployment (Seldon, KServe)
  • Serverless (AWS Lambda, GCP Functions)
  • Edge deployment (ONNX, TensorRT)

Monitoring

  • Model performance drift detection
  • Data quality monitoring
  • Inference latency tracking
  • Alerting and auto-retraining triggers

CI/CD for ML

  • Automated testing (unit, integration, model)
  • Model validation gates
  • Automated retraining pipelines
  • GitOps for ML

Best Practices

python
# Kubeflow Pipeline Example
from kfp import dsl, compiler

@dsl.component
def preprocess_data(input_path: str, output_path: str):
    # Data preprocessing logic
    pass

@dsl.component
def train_model(data_path: str, model_path: str):
    # Training logic
    pass

@dsl.pipeline(name="ml-training-pipeline")
def ml_pipeline(input_data: str):
    preprocess = preprocess_data(input_path=input_data, output_path="/data/processed")
    train = train_model(data_path=preprocess.outputs["output_path"], model_path="/models")
python
# Model Registry with MLflow
import mlflow.sklearn

# Register model
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "fraud-detection-model")

# Transition to production
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="fraud-detection-model",
    version=3,
    stage="Production"
)
yaml
# Kubernetes Deployment (Seldon)
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: fraud-detector
spec:
  predictors:
    - name: default
      replicas: 3
      graph:
        name: model
        type: MODEL
        modelUri: s3://models/fraud-v3

DAG Patterns

Training DAG

code
data_ingestion → validation → preprocessing → training → evaluation → registration

Inference DAG

code
request → preprocessing → model_inference → postprocessing → response

Monitoring DAG

code
collect_metrics → detect_drift → alert_if_needed → trigger_retrain

When to Use

  • Building ML training pipelines
  • Setting up model registry
  • Deploying models to production
  • ML monitoring and observability
  • CI/CD for machine learning
  • Infrastructure automation for ML