OCI Monitoring and Observability - Expert Knowledge
🏗️ Use OCI Landing Zone Terraform Modules
Don't reinvent the wheel. Use oracle-terraform-modules/landing-zone for observability stack.
Landing Zone solves:
- •❌ Bad Practice #10: No logging, monitoring, notifications (Landing Zone deploys complete observability)
- •❌ Bad Practice #7: Limited security services (Landing Zone integrates Cloud Guard, VSS, OSMS)
This skill provides: Metrics, alarms, and troubleshooting for monitoring deployed WITHIN a Landing Zone.
⚠️ OCI CLI/API Knowledge Gap
You don't know OCI CLI commands or OCI API structure.
Your training data has limited and outdated knowledge of:
- •OCI CLI syntax and parameters (updates monthly)
- •OCI API endpoints and request/response formats
- •Monitoring service CLI operations (
oci monitoring alarm,oci monitoring metric) - •Metric namespaces and MQL (Monitoring Query Language)
- •Latest Logging and Service Connector features
When OCI operations are needed:
- •Use exact CLI commands from this skill's references
- •Do NOT guess metric namespace names
- •Do NOT assume AWS CloudWatch patterns work in OCI
- •Load reference files for detailed MQL documentation
What you DO know:
- •General observability concepts
- •Alerting and threshold design principles
- •Log aggregation patterns
This skill bridges the gap by providing current OCI-specific monitoring patterns and gotchas.
NEVER Do This
❌ NEVER assume metrics are instant (10-15 minute lag)
- •Metrics published every 1-5 minutes
- •Processing delay: 5-10 minutes
- •Total lag: 10-15 minutes from event to visible metric
- •Don't debug "missing metrics" within first 15 minutes of resource creation
❌ NEVER use = for alarm thresholds with sparse metrics
# WRONG - alarm never fires if metric has gaps
MetricName[1m].mean() = 0
# RIGHT - handle missing data
MetricName[1m]{dataMissing=zero}.mean() > 0
❌ NEVER forget metric dimensions (causes "no data")
# WRONG - missing required dimension
CPUUtilization[1m].mean()
# RIGHT - include resourceId dimension
CPUUtilization[1m]{resourceId="<instance-ocid>"}.mean()
❌ NEVER set alarm thresholds without trigger delay (alert fatigue)
# BAD - fires on every CPU spike CPUUtilization[1m].mean() > 80 # BETTER - sustained high CPU CPUUtilization[5m].mean() > 80 Trigger delay: 5 minutes (fires after 5 consecutive breaches)
❌ NEVER create alarms without notification channels
# WRONG - alarm fires but nobody knows oci monitoring alarm create ... --destinations '[]' # RIGHT - always link to notification topic oci monitoring alarm create ... --destinations '["<notification-topic-ocid>"]'
Cost impact: Undetected outages cost $5,000-50,000/hour in production
❌ NEVER ignore Cloud Guard findings (security audit failure)
- •Cloud Guard detects misconfigurations BEFORE they become incidents
- •Integrate Cloud Guard → Notifications → Email/Slack/PagerDuty
- •Cost impact: $100,000+ per security breach vs $0 for proactive remediation
Metric Namespace Gotchas
OCI Metrics Use Service-Specific Namespaces:
| Service | Namespace | Example Metric |
|---|---|---|
| Compute | oci_computeagent | CPUUtilization, MemoryUtilization |
| Autonomous DB | oci_autonomous_database | CpuUtilization, StorageUtilization |
| Load Balancer | oci_lbaas | HttpRequests, UnHealthyBackendServers |
| Object Storage | oci_objectstorage | ObjectCount, BytesUploaded |
Common Mistake: Using wrong namespace (oci_compute vs oci_computeagent)
Alarm Missing Data Handling
| Setting | Behavior | Use When |
|---|---|---|
treatMissingDataAsBreaching | Alarm fires if no data | Critical services (outage = breach) |
treatMissingDataAsNotBreaching | Alarm silent if no data | Optional monitoring |
{dataMissing=zero} | Treat missing as 0 | Counters (requests/sec) |
Log Collection Common Gaps
Problem: Logs not showing in Log Analytics
Logs not appearing? ├─ Is log enabled on resource? │ └─ Compute: oci-compute-agent must be running │ └─ Function: Logging enabled in function config │ ├─ Is Service Connector configured? │ └─ Source: Log Group → Target: Log Analytics │ └─ Check: Service Connector status = ACTIVE │ ├─ IAM policy for Service Connector? │ └─ "Allow any-user to use log-content in tenancy" │ └─ "Allow service loganalytics to READ logcontent in tenancy" │ └─ 10-15 minute ingestion lag? └─ Wait before debugging
Metric Query Optimization
Expensive (slow):
# Queries ALL instances CPUUtilization[1m].mean()
Optimized (filter by dimension):
# Query specific instance
CPUUtilization[1m]{resourceId='<instance-ocid>'}.mean()
Cost: Queries free, but rate limited (1000 req/min)
Progressive Loading References
OCI Monitoring Reference (Official Oracle Documentation)
WHEN TO LOAD oci-monitoring-reference.md:
- •Need comprehensive list of all OCI service metrics
- •Understanding MQL (Monitoring Query Language) in depth
- •Implementing complex alarm conditions and composites
- •Need official Oracle guidance on Logging and Service Connector
- •Setting up Log Analytics and APM integration
Do NOT load for:
- •Quick alarm setup (examples in this skill)
- •Common metric patterns (tables above)
- •Troubleshooting decision trees (covered above)
When to Use This Skill
- •Alarms: threshold configuration, missing data handling, trigger delay
- •Troubleshooting: metrics not showing, alarms not firing, namespace errors
- •Log collection: Service Connector, IAM policies, missing logs
- •Performance: query optimization, dimension filtering