AWS HyperShift Debug Skill
This skill helps debug AWS resources related to HyperShift clusters, identifying resources that may be blocking cluster deletion or orphaned infrastructure.
When to Use
- •Cluster deletion is stuck (finalizer not removed)
- •Need to identify orphaned AWS resources
- •Debugging IAM permission issues during cluster operations
- •Verifying cleanup after cluster destruction
- •User asks "debug hypershift" or "why is cluster stuck"
Arguments
- •
cluster-name: The HyperShift cluster name or suffix to debug (optional, defaults to local) - •
--check: Quiet mode - returns exit code only (0=no resources, 1=resources exist)
Quick Debug Commands
Prerequisites
Ensure credentials are loaded:
bash
# Load HyperShift credentials (from kagenti repo root) source .env.kagenti-hypershift-custom # or .env.hypershift-ci
Run Full Debug Script
bash
# Run comprehensive AWS debug ./.github/scripts/hypershift/debug-aws-hypershift.sh [cluster-name] # Examples: ./.github/scripts/hypershift/debug-aws-hypershift.sh # defaults to username ./.github/scripts/hypershift/debug-aws-hypershift.sh ladas # suffix only ./.github/scripts/hypershift/debug-aws-hypershift.sh kagenti-hypershift-custom-ladas # full name ./.github/scripts/hypershift/debug-aws-hypershift.sh pr529 # Check mode (quiet, returns exit code only) ./.github/scripts/hypershift/debug-aws-hypershift.sh --check ladas echo $? # 0 = no resources, 1 = resources exist
Manual Debug Commands
Check HostedCluster Status
bash
# Set management cluster kubeconfig
source .env.kagenti-hypershift-custom # loads KUBECONFIG
# Get HostedCluster status
oc get hostedcluster -n clusters <cluster-name> -o yaml
# Check deletion timestamp and finalizers
oc get hostedcluster -n clusters <cluster-name> -o jsonpath='{.metadata.deletionTimestamp}{"\n"}{.metadata.finalizers}'
# Check conditions
oc get hostedcluster -n clusters <cluster-name> -o jsonpath='{range .status.conditions[*]}{.type}{": "}{.status}{" - "}{.message}{"\n"}{end}'
Check HyperShift Operator Logs
bash
# Operator logs (filtered for cluster) oc logs -n hypershift -l app=operator --tail=200 | grep -i "<cluster-name>\|error\|failed\|denied" # All operator logs oc logs -n hypershift -l app=operator --tail=500
EC2 Resources
bash
# EC2 Instances
aws ec2 describe-instances \
--filters "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'Reservations[*].Instances[*].[InstanceId,State.Name,InstanceType,Tags[?Key==`Name`].Value|[0]]' \
--output table
# VPCs
aws ec2 describe-vpcs \
--filters "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'Vpcs[*].[VpcId,State,CidrBlock]' \
--output table
# Subnets
aws ec2 describe-subnets \
--filters "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'Subnets[*].[SubnetId,State,CidrBlock,AvailabilityZone]' \
--output table
# Security Groups
aws ec2 describe-security-groups \
--filters "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'SecurityGroups[*].[GroupId,GroupName,VpcId]' \
--output table
# NAT Gateways
aws ec2 describe-nat-gateways \
--filter "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'NatGateways[*].[NatGatewayId,State,VpcId]' \
--output table
# Internet Gateways
aws ec2 describe-internet-gateways \
--filters "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'InternetGateways[*].[InternetGatewayId,Attachments[0].VpcId]' \
--output table
# Elastic IPs
aws ec2 describe-addresses \
--filters "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'Addresses[*].[AllocationId,PublicIp]' \
--output table
Load Balancers
bash
# Classic ELBs
aws elb describe-load-balancers \
--query "LoadBalancerDescriptions[?contains(LoadBalancerName, '<cluster-name>')].[LoadBalancerName,Scheme]" \
--output table
# ALB/NLBs
aws elbv2 describe-load-balancers \
--query "LoadBalancers[?contains(LoadBalancerName, '<cluster-name>')].[LoadBalancerName,Type,State.Code]" \
--output table
S3 Buckets
bash
# List S3 buckets for cluster aws s3api list-buckets --query "Buckets[?contains(Name, '<cluster-name>')].Name" --output text # Check bucket contents (if exists) aws s3 ls s3://<bucket-name>/ --recursive
IAM Resources
bash
# IAM Roles aws iam list-roles --query "Roles[?contains(RoleName, '<cluster-name>')].RoleName" --output text # Instance Profiles aws iam list-instance-profiles --query "InstanceProfiles[?contains(InstanceProfileName, '<cluster-name>')].InstanceProfileName" --output text # OIDC Providers aws iam list-open-id-connect-providers --query 'OpenIDConnectProviderList[*].Arn' --output text | tr '\t' '\n' | grep "<cluster-name>"
Route53
bash
# Hosted zones aws route53 list-hosted-zones --query "HostedZones[?contains(Name, '<cluster-name>')].[Name,Id,Config.PrivateZone]" --output table # All zones (for reference) aws route53 list-hosted-zones --query 'HostedZones[*].[Name,Id,Config.PrivateZone]' --output table
Cleanup Commands (Use with Caution)
Force Remove HostedCluster Finalizer
Warning: This orphans AWS resources. Only use if resources are already deleted or you will clean them manually.
bash
oc patch hostedcluster -n clusters <cluster-name> \
-p '{"metadata":{"finalizers":null}}' --type=merge
Manual Resource Cleanup Order
If you need to manually delete AWS resources, delete in this order:
- •EC2 Instances - Terminate worker nodes
- •NAT Gateways - Delete and wait for deletion
- •Elastic IPs - Release after NAT gateways deleted
- •Load Balancers - Delete ELBs/ALBs/NLBs
- •Security Groups - Delete (may need multiple passes for dependencies)
- •Subnets - Delete all subnets
- •Internet Gateways - Detach and delete
- •VPCs - Delete VPC
- •Route53 - Delete private hosted zone
- •S3 Buckets - Empty and delete OIDC bucket
- •IAM Roles - Delete roles and instance profiles
- •OIDC Provider - Delete OIDC provider
bash
# Example: Delete all EC2 instances for a cluster
INSTANCE_IDS=$(aws ec2 describe-instances \
--filters "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'Reservations[*].Instances[*].InstanceId' \
--output text)
aws ec2 terminate-instances --instance-ids $INSTANCE_IDS
Find Orphaned Resources
bash
# Find orphaned resources across all clusters ./.github/scripts/hypershift/find-orphaned-resources.sh
Common Issues
Issue: Cluster stuck in deletion
Symptoms: HostedCluster has deletionTimestamp but finalizer remains
Debug Steps:
- •Check HyperShift operator logs for errors
- •Check AWS resources still exist (VPCs, subnets, etc.)
- •Verify IAM permissions for resource deletion
- •Check if resources have dependencies blocking deletion
Common Causes:
- •Security group has dependencies
- •ENIs attached to resources
- •Load balancer still has targets
- •IAM policy doesn't allow deletion
Issue: Permission denied during deletion
Symptoms: Operator logs show "Access Denied" or "UnauthorizedOperation"
Debug Steps:
- •Check which resource deletion is failing
- •Verify IAM policy has permission for that resource type
- •Check if tag conditions match (kubernetes.io/cluster/<cluster-name>)
Fix: Update IAM policies and re-run setup script:
bash
./.github/scripts/hypershift/setup-hypershift-ci-credentials.sh
Issue: AWS resources orphaned
Symptoms: HostedCluster deleted but AWS resources remain
Debug Steps:
- •Run this debug skill to find remaining resources
- •Note all resource IDs
- •Follow manual cleanup order above
Integration with Other Skills
- •hypershift:cluster: Create and destroy clusters
- •hypershift:quotas: Check AWS quotas before debugging
- •k8s:health: Check overall cluster health after debugging
- •k8s:logs: Examine HyperShift operator logs in detail
- •k8s:pods: Debug control plane pods on management cluster
Related Documentation
- •
.github/scripts/local-setup/README.md- Local setup documentation