GCP Observability Best Practices
Structured Logging
JSON Log Format
Use structured JSON logging for better queryability:
json
{
"severity": "ERROR",
"message": "Payment failed",
"httpRequest": { "requestMethod": "POST", "requestUrl": "/api/payment" },
"labels": { "user_id": "123", "transaction_id": "abc" },
"timestamp": "2025-01-15T10:30:00Z"
}
Severity Levels
Use appropriate severity for filtering:
- •DEBUG: Detailed diagnostic info
- •INFO: Normal operations, milestones
- •NOTICE: Normal but significant events
- •WARNING: Potential issues, degraded performance
- •ERROR: Failures that don't stop the service
- •CRITICAL: Failures requiring immediate action
- •ALERT: Person must take action immediately
- •EMERGENCY: System is unusable
Log Filtering Queries
Common Filters
code
# By severity severity >= WARNING # By resource resource.type="cloud_run_revision" resource.labels.service_name="my-service" # By time timestamp >= "2025-01-15T00:00:00Z" # By text content textPayload =~ "error.*timeout" # By JSON field jsonPayload.user_id = "123" # Combined severity >= ERROR AND resource.labels.service_name="api"
Advanced Queries
code
# Regex matching
textPayload =~ "status=[45][0-9]{2}"
# Substring search
textPayload : "connection refused"
# Multiple values
severity = (ERROR OR CRITICAL)
Metrics vs Logs vs Traces
When to Use Each
Metrics: Aggregated numeric data over time
- •Request counts, latency percentiles
- •Resource utilization (CPU, memory)
- •Business KPIs (orders/minute)
Logs: Detailed event records
- •Error details and stack traces
- •Audit trails
- •Debugging specific requests
Traces: Request flow across services
- •Latency breakdown by service
- •Identifying bottlenecks
- •Distributed system debugging
Alert Policy Design
Alert Best Practices
- •Avoid alert fatigue: Only alert on actionable issues
- •Use multi-condition alerts: Reduce noise from transient spikes
- •Set appropriate windows: 5-15 min for most metrics
- •Include runbook links: Help responders act quickly
Common Alert Patterns
Error rate:
- •Condition: Error rate > 1% for 5 minutes
- •Good for: Service health monitoring
Latency:
- •Condition: P99 latency > 2s for 10 minutes
- •Good for: Performance degradation detection
Resource exhaustion:
- •Condition: Memory > 90% for 5 minutes
- •Good for: Capacity planning triggers
Cost Optimization
Reducing Log Costs
- •Exclusion filters: Drop verbose logs at ingestion
- •Sampling: Log only percentage of high-volume events
- •Shorter retention: Reduce default 30-day retention
- •Downgrade logs: Route to cheaper storage buckets
Exclusion Filter Examples
code
# Exclude health checks resource.type="cloud_run_revision" AND httpRequest.requestUrl="/health" # Exclude debug logs in production severity = DEBUG
Debugging Workflow
- •Start with metrics: Identify when issues started
- •Correlate with logs: Filter logs around problem time
- •Use traces: Follow specific requests across services
- •Check resource logs: Look for infrastructure issues
- •Compare baselines: Check against known-good periods