MLOps Observability
Goal
To implement a "Glass Box" system where every result is Reproducible, every asset has Lineage, and system health is Monitored, Alerted on, and Explained.
Prerequisites
- •Language: Python
- •Context: Production monitoring and debugging.
- •Platform Suggestion: MLflow, SHAP, Evidently, ...
Instructions
1. Guarantee Reproducibility
Consistency is key. For instance:
- •Randomness: Set seeds for
random,numpy,torch,tensorflow. - •Environment: Use
dockerand locked dependencies (uv.lock). - •Builds: Use
justfilewithuv build --build-constraintfor deterministic wheels. - •Code: Track git commit hash for every run.
2. Track Data Lineage
Know the origin of your data. For instance:
- •Datasets: Create MLflow Datasets with
mlflow.data.from_pandas. - •Logging: Log inputs to MLflow context with
mlflow.log_input. - •Versioning: Version data files (e.g.,
data/v1.csv) or use DVC. - •Transformations: Log preprocessing parameters mapping data versions to model versions.
3. Monitoring & Drift Detection
Watch for silent failures. For instance:
- •Validation: Use
MLflow Evaluateto gate models against quality thresholds. - •Drift: Use
evidentlyto comparereference(training) vscurrent(production) data.- •Detect Data Drift (input distribution changes) and Concept Drift (relationship changes).
- •System: Enable MLflow System Metrics (
log_system_metrics=True) for CPU/GPU.
4. Alerting
Don't stare at dashboards. For instance:
- •Local: Use
plyerfor desktop notifications during long training runs. - •Production: Use
PagerDuty(critical) orSlack(warnings). - •Thresholds: Use Static (fixed value) or Dynamic (anomaly detection) rules.
- •Action: Alerts must link to a dashboard or playbook.
5. Explainability (XAI)
Trust but verify. For instance:
- •Global: Use Feature Importance (e.g., Random Forest) to understand overall logic.
- •Local: Use
SHAPvalues to explain individual predictions. - •Artifacts: Save explanations (plots/tables) as MLflow artifacts.
6. Infrastructure & Costs
Optimize resources. For instance:
- •Tags: Tag runs with
project,env,user. - •Costs: Log
run_timeand instance type to estimate ROI.
Self-Correction Checklist
- • Seeds: Are random seeds fixed?
- • Inputs: Are input datasets logged to MLflow?
- • System Metrics: Is
log_system_metricsenabled? - • Explanations: Are SHAP values generated?
- • Alerts: Are thresholds defined for failures?