Grafana Loki Log Management
Instructions
Use this skill to interact with Grafana Loki using the logcli command-line tool for log aggregation, querying, and analysis. Loki provides a horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus.
Configuration
The preferred method of interaction is the logcli CLI tool (installed on the host). Loki runs as part of the observability stack in the Home Lab, collecting logs from various services and applications.
Smart Authentication: Before running queries, verify logcli can reach Loki and authentication is configured.
# Check if Loki is accessible if ! logcli stats > /dev/null 2>&1; then echo "Loki not accessible. Checking configuration..." export LOKI_ADDR=https://loki.bjzy.me # Retrieve credentials from Vault if required export LOKI_USERNAME=$(vault kv get -field=read_user kvProd_v2/Loki/Application-Prod) export LOKI_PASSWORD=$(vault kv get -field=read_password kvProd_v2/Loki/Application-Prod) fi # Verify connectivity logcli stats || echo "Loki connectivity issue"
Certificate Handling (Homelab)
Loki may use self-signed certificates in the homelab environment. Always configure the CA certificate rather than disabling TLS verification.
# Configure logcli with homelab CA certificate (RECOMMENDED): export LOKI_ADDR=https://loki.bjzy.me export LOKI_CA_CERT=/etc/ssl/certs/homelab-ca.pem # Test connection logcli stats # Alternative: Add homelab CA to system trust store # macOS: sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain /path/to/homelab-ca.pem # Linux: sudo cp /path/to/homelab-ca.pem /usr/local/share/ca-certificates/homelab-ca.crt && sudo update-ca-certificates # ⚠️ SECURITY: Never use LOKI_TLS_SKIP_VERIFY=true in production # This disables certificate verification and enables man-in-the-middle attacks
Bjzy Labs defaults
- •Loki Cluster: Part of the observability stack with Alloy for log collection
- •Common use cases:
- •Log Querying: Run LogQL queries to search through aggregated logs
- •Live Tailing: Stream logs in real-time from services and applications
- •Batch Export: Download large volumes of logs for analysis and auditing
- •Label Discovery: Explore available labels and log streams
- •Volume Analysis: Understand log volume and storage patterns
- •Integration Points:
- •Alloy: Log collection and forwarding to Loki
- •Grafana: Primary visualization and query interface
- •Docker Swarm: Container log collection
- •Kubernetes: Pod and service log aggregation
Environment and Guardrails (Bjzy Labs)
- •Cluster Access:
- •Loki server accessible via HTTP API
- •Local CLI access available from management workstations
- •HTTP API available at configured Loki endpoint
- •Security Rules:
- •Verify Loki connectivity before running queries
- •Use time ranges to limit query scope and prevent resource exhaustion
- •Be cautious with broad queries that may return large result sets
- •Never run delete operations without proper verification
- •CLI Availability:
- •The
logcliCLI is installed on management workstations - •Environment variable
LOKI_ADDRshould point to Loki endpoint - •Verify access with
logcli stats
- •The
Standard Operating Procedure (SOP)
When asked to "Query logs," "Check Loki," or "Analyze log data":
- •Verify Connectivity: Run
logcli statsto check Loki connection - •Identify Scope: Determine time range, services, and log patterns needed
- •Construct Query: Build appropriate LogQL query with proper filters
- •Execute Query: Use logcli with appropriate output format and limits
- •Analyze Results: Review log entries, patterns, and anomalies
- •Document Findings: Log relevant insights and recommendations
Examples
1. Check Loki Status
Verify Loki is operational and accessible.
- •Method:
logcliCLI - •Command Pattern:
# Check Loki connection and basic stats
logcli stats
# Verify Loki version and connectivity
logcli --version
# Test basic query
logcli query '{job="varlogs"}' --limit=10
2. Discover Available Labels
Explore the label space to understand available log streams.
- •Method:
logcliCLI - •Command Pattern:
# List all labels
logcli labels
# List label values for specific label
logcli labels job
# Get series information
logcli series '{job="varlogs"}'
# Show label cardinality
logcli series '{job="varlogs"}' | wc -l
3. Basic Log Queries
Search through logs using LogQL filtering.
- •Method:
logcliCLI - •Command Pattern:
# Query logs from specific job
logcli query '{job="varlogs"}'
# Query logs with specific text
logcli query '{job="varlogs"} |= "error"'
# Query logs with regex pattern
logcli query '{job="varlogs"} =~ "error.*failed"'
# Query logs from specific time range
logcli query '{job="varlogs"}' --since=1h
# Query logs with output format options
logcli query '{job="varlogs"}' --output=jsonl
4. Service-Specific Log Queries
Target logs from specific services or applications.
- •Method:
logcliCLI - •Command Pattern:
# Docker container logs
logcli query '{container_name="nginx"}'
# Kubernetes pod logs
logcli query '{namespace="default",pod="my-app"}'
# Application logs by level
logcli query '{app="my-service"} |= "ERROR"'
# System service logs
logcli query '{job="systemd"} |= "service"'
5. Time-Based Analysis
Analyze logs within specific time windows.
- •Method:
logcliCLI - •Command Pattern:
# Recent logs (last hour)
logcli query '{job="varlogs"}' --since=1h
# Logs from specific time range
logcli query '{job="varlogs"}' --from="2024-01-01T00:00:00Z" --to="2024-01-01T23:59:59Z"
# Live tail logs (similar to tail -f)
logcli query '{job="varlogs"}' --tail
# Logs with step interval for metrics
logcli query 'rate({job="varlogs"}[5m])'
6. Error and Exception Analysis
Focus on error patterns and exceptions.
- •Method:
logcliCLI - •Command Pattern:
# Find all error logs
logcli query '{} |= "ERROR" or |= "FATAL"'
# HTTP error analysis
logcli query '{job="nginx"} |= "5[0-9][0-9]"'
# Application stack traces
logcli query '{app="my-service"} |= "Exception"' --output=raw
# Error rate calculation
logcli instant-query 'rate({job="varlogs"} |= "error"[5m])'
7. Performance and Resource Analysis
Analyze log volumes and patterns.
- •Method:
logcliCLI - •Command Pattern:
# Get volume statistics
logcli volume
# Volume for specific query
logcli volume '{job="varlogs"}'
# Volume range analysis
logcli volume_range '{job="varlogs"}' --from="2024-01-01" --to="2024-01-02"
# Detected fields in logs
logcli detected-fields '{job="varlogs"}'
8. Log Export and Batch Operations
Export logs for offline analysis.
- •Method:
logcliCLI - •Command Pattern:
# Export logs to file
logcli query '{job="varlogs"}' --output=raw > logs.txt
# Export JSON format
logcli query '{job="varlogs"}' --output=jsonl > logs.jsonl
# Export with large limit
logcli query '{job="varlogs"}' --limit=10000 --output=raw > large_logs.txt
# Export specific time window
logcli query '{job="varlogs"}' --from="2024-01-01T00:00:00Z" --to="2024-01-02T00:00:00Z" > daily_logs.txt
9. Advanced LogQL Queries
Use complex LogQL expressions for sophisticated analysis.
- •Method:
logcliCLI - •Command Pattern:
# Log aggregation with count
logcli query 'count by (level) ({job="varlogs"})'
# Rate calculations
logcli query 'rate({job="varlogs"}[5m])'
# Log filtering with multiple conditions
logcli query '{job="varlogs", level="error"} |= "database"'
# Log parsing and extraction
logcli query '{job="varlogs"} | logfmt | level="error"'
# Time series from logs
logcli instant-query 'sum(rate({job="varlogs"}[5m])) by (job)'
10. Integration with Grafana
Understand how queries relate to Grafana dashboards.
- •Method:
logcliCLI - •Command Pattern:
# Test dashboard queries
logcli query '{job="varlogs"} |= "error"' --limit=100
# Validate label selectors
logcli labels job
# Check query performance
logcli stats --query='{job="varlogs"}'
# Explore available metrics
logcli query 'rate({job="varlogs"}[1h])'
11. Advanced LogQL Query Patterns
Complex LogQL queries for log analysis and troubleshooting.
- •Method:
logcliCLI - •Purpose: Extract insights from logs using LogQL syntax
Command Pattern:
# Count log lines by label
logcli query 'count_over_time({job="varlogs"}[1h])'
# Rate of log entries per second
logcli query 'rate({app="api"}[5m])'
# Filter logs with regex
logcli query '{job="varlogs"} |~ "error|ERROR|failed"' --since=1h
# Extract JSON fields
logcli query '{app="api"} | json | level="error"' --since=30m
# Aggregate by extracted field
logcli query 'sum by (status_code) (count_over_time({app="nginx"} | json [5m]))'
# Multi-line log parsing
logcli query '{job="app"} |= "stack trace" --context=10' --since=1h
# Calculate percentiles (requires unwrap for numeric field extraction)
logcli query 'quantile_over_time(0.95, {app="api"} | json | unwrap response_time [5m])'
# Combine multiple labels
logcli query '{app="api", environment="production"} | json | status_code >= 500' --since=24h --limit=100
Troubleshooting
Connection Issues
# Verify Loki endpoint echo $LOKI_ADDR # Test basic connectivity curl -I $LOKI_ADDR/ready # Check logcli configuration logcli help # Verify authentication (if required) logcli stats --username=<user> --password=<pass>
Query Performance Issues
# Check query stats
logcli stats --query='{job="varlogs"}'
# Limit time range for better performance
logcli query '{job="varlogs"}' --since=10m
# Use specific labels to reduce scope
logcli query '{job="varlogs", instance="host1"}'
# Check volume before running large queries
logcli volume '{job="varlogs"}'
No Results Found
# Verify label names and values
logcli labels
# Check if job exists
logcli series '{job="varlogs"}'
# Expand time range
logcli query '{job="varlogs"}' --since=24h
# Try broader query
logcli query '{}'
Large Result Sets
# Use limit to prevent overwhelming output
logcli query '{job="varlogs"}' --limit=100
# Use output formatting for better parsing
logcli query '{job="varlogs"}' --output=jsonl
# Export to file for large datasets
logcli query '{job="varlogs"}' --limit=10000 > large_output.txt
# Use tail for live monitoring
logcli query '{job="varlogs"}' --tail
Label Cardinality Issues
# Check label cardinality
logcli series '{job="varlogs"}' | wc -l
# Find high-cardinality labels
logcli labels | head -20
# Use more specific queries
logcli query '{job="varlogs", instance="specific-host"}'
# Monitor label usage
logcli stats
Common Service Patterns
Docker Swarm Log Collection
Docker Swarm services forward logs to Loki via Alloy:
# Query specific service logs
logcli query '{com_docker_swarm_service="my-app"}'
# Query by container name
logcli query '{container_name="my-app"}'
# Query Swarm node logs
logcli query '{com_docker_swarm_node="worker1"}'
System Log Analysis
System logs are collected and forwarded to Loki:
# Systemd service logs
logcli query '{job="systemd"} |= "nginx"'
# Authentication logs
logcli query '{job="auth"} |= "failed"'
# Kernel messages
logcli query '{job="kernel"} |= "error"'
Application Log Patterns
Application logs follow structured formats:
# Structured JSON logs
logcli query '{app="my-service"} | json'
# Logfmt formatted logs
logcli query '{app="my-service"} | logfmt'
# Error tracking
logcli query '{app="my-service"} | logfmt | level="error"'
Security and Audit Logs
Security events are captured and analyzed:
# Failed authentication attempts
logcli query '{job="auth"} |= "failed"'
# Suspicious activity patterns
logcli query '{job="security"} |= "suspicious"'
# Access log analysis
logcli query '{job="nginx"} |= "404"'
Performance Monitoring
Use logs for performance insights:
# Response time analysis
logcli query '{app="api"} |= "response_time"'
# Database query logs
logcli query '{app="database"} |= "slow_query"'
# Resource utilization
logcli query '{job="system"} |= "memory"'
LogQL Reference
Basic Selectors
# Label selector
{job="varlogs"}
# Multiple labels
{job="varlogs", instance="localhost"}
# Regex label matching
{job=~"varlogs|system"}
# Text filtering
{job="varlogs"} |= "error"
# Regex text filtering
{job="varlogs"} =~ "error.*failed"
Pipeline Operations
# Line format parsing
{job="varlogs"} | logfmt
# JSON parsing
{job="varlogs"} | json
# Field extraction
{job="varlogs"} | line_format "{{.message}}"
# Label extraction
{job="varlogs"} | label_format level="error"
Aggregation Operations
# Count by labels
count by (level) ({job="varlogs"})
# Rate calculation
rate({job="varlogs"}[5m])
# Sum over time
sum_over_time({job="varlogs"}[5m])
# Top K
topk(10, {job="varlogs"})
Best Practices
Query Optimization
- •Use specific time ranges to limit data scanned
- •Include relevant labels to reduce query scope
- •Use
limitto prevent excessive result sets - •Check query volume before running large queries
Label Management
- •Monitor label cardinality to avoid performance issues
- •Use consistent labeling strategies
- •Avoid high-cardinality labels like user IDs
- •Regularly review and clean up unused labels
Security Considerations
- •Be mindful of sensitive data in logs
- •Use appropriate authentication for Loki access
- •Consider log retention policies
- •Audit access to log data
Performance Monitoring
- •Monitor Loki resource usage
- •Track query performance metrics
- •Set up alerts for high query latency
- •Regularly review log volume patterns