Azure Resource Health & Issue Diagnosis
This workflow analyzes a specific Azure resource to assess its health status, diagnose potential issues using logs and telemetry data, and develop a comprehensive remediation plan for any problems discovered.
Prerequisites
- •Azure MCP server configured and authenticated
- •Target Azure resource identified (name and optionally resource group/subscription)
- •Resource must be deployed and running to generate logs/telemetry
- •Prefer Azure MCP tools (
azmcp-*) over direct Azure CLI when available
Workflow Steps
Step 1: Get Azure Best Practices
Action: Retrieve diagnostic and troubleshooting best practices Tools: Azure MCP best practices tool Process:
- •Load Best Practices:
- •Execute Azure best practices tool to get diagnostic guidelines
- •Focus on health monitoring, log analysis, and issue resolution patterns
- •Use these practices to inform diagnostic approach and remediation recommendations
Step 2: Resource Discovery & Identification
Action: Locate and identify the target Azure resource Tools: Azure MCP tools + Azure CLI fallback Process:
- •
Resource Lookup:
- •If only resource name provided: Search across subscriptions using
azmcp-subscription-list - •Use
az resource list --name <resource-name>to find matching resources - •If multiple matches found, prompt user to specify subscription/resource group
- •Gather detailed resource information:
- •Resource type and current status
- •Location, tags, and configuration
- •Associated services and dependencies
- •If only resource name provided: Search across subscriptions using
- •
Resource Type Detection:
- •Identify resource type to determine appropriate diagnostic approach:
- •Web Apps/Function Apps: Application logs, performance metrics, dependency tracking
- •Virtual Machines: System logs, performance counters, boot diagnostics
- •Cosmos DB: Request metrics, throttling, partition statistics
- •Storage Accounts: Access logs, performance metrics, availability
- •SQL Database: Query performance, connection logs, resource utilization
- •Application Insights: Application telemetry, exceptions, dependencies
- •Key Vault: Access logs, certificate status, secret usage
- •Service Bus: Message metrics, dead letter queues, throughput
- •Identify resource type to determine appropriate diagnostic approach:
Step 3: Health Status Assessment
Action: Evaluate current resource health and availability Tools: Azure MCP monitoring tools + Azure CLI Process:
- •
Basic Health Check:
- •Check resource provisioning state and operational status
- •Verify service availability and responsiveness
- •Review recent deployment or configuration changes
- •Assess current resource utilization (CPU, memory, storage, etc.)
- •
Service-Specific Health Indicators:
- •Web Apps: HTTP response codes, response times, uptime
- •Databases: Connection success rate, query performance, deadlocks
- •Storage: Availability percentage, request success rate, latency
- •VMs: Boot diagnostics, guest OS metrics, network connectivity
- •Functions: Execution success rate, duration, error frequency
Step 4: Log & Telemetry Analysis
Action: Analyze logs and telemetry to identify issues and patterns Tools: Azure MCP monitoring tools for Log Analytics queries Process:
- •
Find Monitoring Sources:
- •Use
azmcp-monitor-workspace-listto identify Log Analytics workspaces - •Locate Application Insights instances associated with the resource
- •Identify relevant log tables using
azmcp-monitor-table-list
- •Use
- •
Execute Diagnostic Queries: Use
azmcp-monitor-log-querywith targeted KQL queries based on resource type:General Error Analysis:
kql// Recent errors and exceptions union isfuzzy=true AzureDiagnostics, AppServiceHTTPLogs, AppServiceAppLogs, AzureActivity | where TimeGenerated > ago(24h) | where Level == "Error" or ResultType != "Success" | summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h) | order by TimeGenerated descPerformance Analysis:
kql// Performance degradation patterns Perf | where TimeGenerated > ago(7d) | where ObjectName == "Processor" and CounterName == "% Processor Time" | summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h) | where avg_CounterValue > 80
Application-Specific Queries:
kql// Application Insights - Failed requests requests | where timestamp > ago(24h) | where success == false | summarize FailureCount=count() by resultCode, bin(timestamp, 1h) | order by timestamp desc // Database - Connection failures AzureDiagnostics | where ResourceProvider == "MICROSOFT.SQL" | where Category == "SQLSecurityAuditEvents" | where action_name_s == "CONNECTION_FAILED" | summarize ConnectionFailures=count() by bin(TimeGenerated, 1h)
- •
Pattern Recognition:
- •Identify recurring error patterns or anomalies
- •Correlate errors with deployment times or configuration changes
- •Analyze performance trends and degradation patterns
- •Look for dependency failures or external service issues
Step 5: Issue Classification & Root Cause Analysis
Action: Categorize identified issues and determine root causes Process:
- •
Issue Classification:
- •Critical: Service unavailable, data loss, security breaches
- •High: Performance degradation, intermittent failures, high error rates
- •Medium: Warnings, suboptimal configuration, minor performance issues
- •Low: Informational alerts, optimization opportunities
- •
Root Cause Analysis:
- •Configuration Issues: Incorrect settings, missing dependencies
- •Resource Constraints: CPU/memory/disk limitations, throttling
- •Network Issues: Connectivity problems, DNS resolution, firewall rules
- •Application Issues: Code bugs, memory leaks, inefficient queries
- •External Dependencies: Third-party service failures, API limits
- •Security Issues: Authentication failures, certificate expiration
- •
Impact Assessment:
- •Determine business impact and affected users/systems
- •Evaluate data integrity and security implications
- •Assess recovery time objectives and priorities
Step 6: Generate Remediation Plan
Action: Create a comprehensive plan to address identified issues Process:
- •
Immediate Actions (Critical issues):
- •Emergency fixes to restore service availability
- •Temporary workarounds to mitigate impact
- •Escalation procedures for complex issues
- •
Short-term Fixes (High/Medium issues):
- •Configuration adjustments and resource scaling
- •Application updates and patches
- •Monitoring and alerting improvements
- •
Long-term Improvements (All issues):
- •Architectural changes for better resilience
- •Preventive measures and monitoring enhancements
- •Documentation and process improvements
- •
Implementation Steps:
- •Prioritized action items with specific Azure CLI commands
- •Testing and validation procedures
- •Rollback plans for each change
- •Monitoring to verify issue resolution
Step 7: User Confirmation & Report Generation
Action: Present findings and get approval for remediation actions Process:
- •
Display Health Assessment Summary:
code🏥 Azure Resource Health Assessment 📊 Resource Overview: • Resource: [Name] ([Type]) • Status: [Healthy/Warning/Critical] • Location: [Region] • Last Analyzed: [Timestamp] 🚨 Issues Identified: • Critical: X issues requiring immediate attention • High: Y issues affecting performance/reliability • Medium: Z issues for optimization • Low: N informational items 🔍 Top Issues: 1. [Issue Type]: [Description] - Impact: [High/Medium/Low] 2. [Issue Type]: [Description] - Impact: [High/Medium/Low] 3. [Issue Type]: [Description] - Impact: [High/Medium/Low] 🛠️ Remediation Plan: • Immediate Actions: X items • Short-term Fixes: Y items • Long-term Improvements: Z items • Estimated Resolution Time: [Timeline] ❓ Proceed with detailed remediation plan? (y/n)
- •
Generate Detailed Report:
markdown# Azure Resource Health Report: [Resource Name] **Generated**: [Timestamp] **Resource**: [Full Resource ID] **Overall Health**: [Status with color indicator] ## 🔍 Executive Summary [Brief overview of health status and key findings] ## 📊 Health Metrics - **Availability**: X% over last 24h - **Performance**: [Average response time/throughput] - **Error Rate**: X% over last 24h - **Resource Utilization**: [CPU/Memory/Storage percentages] ## 🚨 Issues Identified ### Critical Issues - **[Issue 1]**: [Description] - **Root Cause**: [Analysis] - **Impact**: [Business impact] - **Immediate Action**: [Required steps] ### High Priority Issues - **[Issue 2]**: [Description] - **Root Cause**: [Analysis] - **Impact**: [Performance/reliability impact] - **Recommended Fix**: [Solution steps] ## 🛠️ Remediation Plan ### Phase 1: Immediate Actions (0-2 hours) ```bash # Critical fixes to restore service [Azure CLI commands with explanations]
Phase 2: Short-term Fixes (2-24 hours)
bash# Performance and reliability improvements [Azure CLI commands with explanations]
Phase 3: Long-term Improvements (1-4 weeks)
bash# Architectural and preventive measures [Azure CLI commands and configuration changes]
📈 Monitoring Recommendations
- •Alerts to Configure: [List of recommended alerts]
- •Dashboards to Create: [Monitoring dashboard suggestions]
- •Regular Health Checks: [Recommended frequency and scope]
✅ Validation Steps
- • Verify issue resolution through logs
- • Confirm performance improvements
- • Test application functionality
- • Update monitoring and alerting
- • Document lessons learned
📝 Prevention Measures
- •[Recommendations to prevent similar issues]
- •[Process improvements]
- •[Monitoring enhancements]
code
Error Handling
- •Resource Not Found: Provide guidance on resource name/location specification
- •Authentication Issues: Guide user through Azure authentication setup
- •Insufficient Permissions: List required RBAC roles for resource access
- •No Logs Available: Suggest enabling diagnostic settings and waiting for data
- •Query Timeouts: Break down analysis into smaller time windows
- •Service-Specific Issues: Provide generic health assessment with limitations noted
Success Criteria
- •✅ Resource health status accurately assessed
- •✅ All significant issues identified and categorized
- •✅ Root cause analysis completed for major problems
- •✅ Actionable remediation plan with specific steps provided
- •✅ Monitoring and prevention recommendations included
- •✅ Clear prioritization of issues by business impact
- •✅ Implementation steps include validation and rollback procedures