AgentSkillsCN

multi-cloud-architecture

在 AWS、Azure 与 GCP 之间设计多云与混合云架构——涵盖服务选型、网络配置、成本优化,以及迁移策略。适用于构建多云系统、规划混合连接、优化云支出,或避免厂商锁定时使用。

SKILL.md
--- frontmatter
name: multi-cloud-architecture
description: Design multi-cloud and hybrid architectures across AWS, Azure, and GCP — covering service selection, networking, cost optimization, and migration strategy. Use when building multi-cloud systems, planning hybrid connectivity, optimizing cloud spend, or avoiding vendor lock-in.

Multi-Cloud Architecture

Service Comparison

Compute

AWSAzureGCPUse Case
EC2Virtual MachinesCompute EngineIaaS VMs
ECSContainer InstancesCloud RunContainers
EKSAKSGKEKubernetes
LambdaFunctionsCloud FunctionsServerless

Storage

AWSAzureGCPUse Case
S3Blob StorageCloud StorageObject
EBSManaged DisksPersistent DiskBlock
EFSAzure FilesFilestoreFile

Database

AWSAzureGCPUse Case
RDSSQL DatabaseCloud SQLManaged SQL
DynamoDBCosmos DBFirestoreNoSQL
AuroraPostgreSQL/MySQLCloud SpannerDistributed SQL
ElastiCacheCache for RedisMemorystoreCaching

Architecture Patterns

Single Provider with DR -- Primary workload in one cloud, DR in another. Database replication + automated failover.

Best-of-Breed -- AI/ML on GCP, Enterprise apps on Azure, General compute on AWS. Pick strengths per provider.

Geographic Distribution -- Serve from nearest region, data sovereignty compliance, global load balancing.

Cloud-Agnostic Abstraction -- Portable stack to reduce lock-in:

LayerPortable Choice
ComputeKubernetes (EKS/AKS/GKE)
DatabasePostgreSQL/MySQL
MessagingApache Kafka
CacheRedis
Object StorageS3-compatible API (MinIO)
MonitoringPrometheus/Grafana
Service MeshIstio/Linkerd
IaCTerraform/OpenTofu

Networking

Connection Options

ProviderVPNDedicated
AWSSite-to-Site VPN (1.25 Gbps/tunnel)Direct Connect (1-100 Gbps)
AzureVPN Gateway (varies by SKU)ExpressRoute (up to 100 Gbps)
GCPCloud VPN HA (3 Gbps/tunnel, 99.99% SLA)Cloud Interconnect (10-100 Gbps)

VPN vs Dedicated Connection Decision

FactorVPNDedicated (DC/ER/Interconnect)
Bandwidth need< 1.25 Gbps> 1 Gbps or consistent throughput
Latency toleranceVariable OKPredictable required
Setup timeHoursWeeks-months
CostLow (pay per hour)Higher (port + data)
EncryptionBuilt-in IPsecMust add if needed (MACsec or overlay)

Default: Start with VPN, upgrade to dedicated when bandwidth or latency demands it.

Hub-and-Spoke Topology

code
On-Premises Datacenter
         |
    VPN / Direct Connect
         |
    Transit Gateway (AWS) / vWAN (Azure) / Cloud Router (GCP)
    +-- Production VPC/VNet
    +-- Staging VPC/VNet
    +-- Development VPC/VNet

BGP Essentials

  • On-prem router advertises internal CIDRs (e.g., 10.0.0.0/8) with private ASN (64512-65534)
  • Cloud-side ASNs: AWS default 64512, Azure fixed 65515, GCP configurable
  • Always run dual tunnels for HA -- active/active with ECMP or active/passive
  • Monitor: tunnel status, BGP session state, packet loss, latency, bytes in/out

Cost Optimization

Pricing Models

ModelAWSAzureGCP
ReservedRI + Savings Plans (30-72%)Reserved VMs (up to 72%)Committed Use (up to 57%)
Spot/PreemptibleSpot (up to 90% off, 2-min notice)Spot VMsPreemptible (80% off, 24h max)
Auto-discountNoneHybrid Benefit (existing licenses)Sustained Use (auto 30%)

Tagging Strategy (Required Tags)

TagPurposeExample
EnvironmentEnv separationproduction, staging, dev
ProjectCost allocationmy-project
CostCenterChargebackengineering
OwnerAccountabilityteam@example.com
ManagedByDrift detectionterraform

Cost Optimization Checklist

  • Tag all resources with required tags above
  • Delete unused resources (unattached disks, idle LBs, old snapshots, unassociated EIPs)
  • Right-size instances based on utilization (use provider advisors/recommenders)
  • Reserved capacity for steady-state workloads; spot/preemptible for fault-tolerant
  • Implement auto-scaling with appropriate cooldowns
  • Storage lifecycle policies: hot -> warm -> cold -> archive
  • Set budget alerts at 50%, 80%, 100% thresholds
  • Enable cost anomaly detection
  • Optimize data transfer (same-AZ where possible, VPC endpoints, CDN)
  • Add caching layers to reduce compute/DB load

Cost Tools

  • AWS: Cost Explorer, Compute Optimizer, Cost Anomaly Detection
  • Azure: Cost Management, Advisor
  • GCP: Cost Management, Recommender
  • Multi-cloud: CloudHealth, Cloudability, Kubecost

Migration Strategy

  1. Assessment -- Inventory workloads, map dependencies, estimate costs, identify compliance constraints
  2. Pilot -- Select low-risk workload, implement, validate, document lessons
  3. Migration -- Incremental moves, dual-run period, automated testing, rollback plan per workload
  4. Optimization -- Right-size, adopt cloud-native services, implement cost governance

Gotchas and Anti-Patterns

  • Lift-and-shift everything -- Re-platform or re-architect where ROI justifies it
  • Ignoring egress costs -- Data transfer between clouds/regions adds up fast; design data gravity around primary provider
  • Multi-cloud for the sake of it -- Real multi-cloud adds operational complexity; have a concrete reason (DR, best-of-breed, compliance)
  • No abstraction layer -- Without Terraform/K8s, multi-cloud becomes multi-headache
  • Skipping tagging -- Impossible to optimize costs or enforce governance without consistent tags
  • Single tunnel -- Always deploy redundant VPN tunnels; single tunnel = single point of failure
  • Overlapping CIDRs -- Plan IP address space across all environments upfront; retrofitting is painful
  • No network monitoring -- Hybrid connectivity issues are invisible without proactive tunnel/BGP monitoring