MLOps Observability

Goal

To implement a "Glass Box" system where every result is Reproducible, every asset has Lineage, and system health is Monitored, Alerted on, and Explained.

Prerequisites

•Language: Python
•Context: Production monitoring and debugging.
•Platform Suggestion: MLflow, SHAP, Evidently, ...

Instructions

1. Guarantee Reproducibility

Consistency is key. For instance:

•Randomness: Set seeds for random, numpy, torch, tensorflow.
•Environment: Use docker and locked dependencies (uv.lock).
•Builds: Use justfile with uv build --build-constraint for deterministic wheels.
•Code: Track git commit hash for every run.

2. Track Data Lineage

Know the origin of your data. For instance:

•Datasets: Create MLflow Datasets with mlflow.data.from_pandas.
•Logging: Log inputs to MLflow context with mlflow.log_input.
•Versioning: Version data files (e.g., data/v1.csv) or use DVC.
•Transformations: Log preprocessing parameters mapping data versions to model versions.

3. Monitoring & Drift Detection

Watch for silent failures. For instance:

•Validation: Use MLflow Evaluate to gate models against quality thresholds.
•
Drift: Use evidently to compare reference (training) vs current (production) data.
- •Detect Data Drift (input distribution changes) and Concept Drift (relationship changes).
•System: Enable MLflow System Metrics (log_system_metrics=True) for CPU/GPU.

4. Alerting

Don't stare at dashboards. For instance:

•Local: Use plyer for desktop notifications during long training runs.
•Production: Use PagerDuty (critical) or Slack (warnings).
•Thresholds: Use Static (fixed value) or Dynamic (anomaly detection) rules.
•Action: Alerts must link to a dashboard or playbook.

5. Explainability (XAI)

Trust but verify. For instance:

•Global: Use Feature Importance (e.g., Random Forest) to understand overall logic.
•Local: Use SHAP values to explain individual predictions.
•Artifacts: Save explanations (plots/tables) as MLflow artifacts.

6. Infrastructure & Costs

Optimize resources. For instance:

•Tags: Tag runs with project, env, user.
•Costs: Log run_time and instance type to estimate ROI.

Self-Correction Checklist

• Seeds: Are random seeds fixed?
• Inputs: Are input datasets logged to MLflow?
• System Metrics: Is log_system_metrics enabled?
• Explanations: Are SHAP values generated?
• Alerts: Are thresholds defined for failures?