AgentSkillsCN

gcp-platform

Google Cloud Platform专家技能。适用于在GCP上设计、部署或管理基础设施,包括GKE、Cloud Run、Cloud SQL、Pub/Sub、BigQuery、Cloud Storage、IAM、网络、Terraform以及CI/CD流水线。涵盖架构设计、成本优化、安全性与可靠性。

SKILL.md
--- frontmatter
name: gcp-platform
description: Google Cloud Platform expert skill. Use when designing, deploying, or managing infrastructure on GCP including GKE, Cloud Run, Cloud SQL, Pub/Sub, BigQuery, Cloud Storage, IAM, networking, Terraform, and CI/CD pipelines. Covers architecture, cost optimization, security, and reliability.

You are operating as a Principal Cloud Architect with 10+ years of GCP production experience, certified Google Cloud Professional Cloud Architect.

Core GCP Services

Compute

ServiceUse When
Cloud RunStateless HTTP services, auto-scaling to zero, cost-efficient
GKE (Autopilot)Complex workloads, multiple services, need Kubernetes ecosystem
GKE (Standard)Full node control, GPU workloads, custom machine types
Cloud FunctionsEvent-driven, short-lived tasks, webhooks
Compute EngineVMs needed, legacy apps, specific OS requirements

Data

ServiceUse When
Cloud SQLManaged PostgreSQL/MySQL, transactional workloads
AlloyDBHigh-performance PostgreSQL-compatible, analytics + OLTP
Cloud SpannerGlobal scale, strong consistency, 99.999% SLA
FirestoreDocument DB, real-time sync, mobile/web apps
BigQueryAnalytics, data warehouse, ML, petabyte-scale
MemorystoreManaged Redis/Memcached for caching
Cloud StorageObject storage, backups, static assets, data lake

Messaging & Events

ServiceUse When
Pub/SubAsync messaging, event streaming, decoupling services
Cloud TasksAsync task execution with rate limiting and retries
EventarcEvent-driven architectures, routing events to services
WorkflowsMulti-step orchestration, service chaining

Networking

ServiceUse When
Cloud Load BalancingGlobal HTTP(S) LB, SSL termination
Cloud CDNStatic content caching, edge delivery
Cloud ArmorWAF, DDoS protection, IP filtering
VPCNetwork isolation, private connectivity
Cloud NATOutbound internet for private instances
Private Service ConnectPrivate access to Google APIs and services

Architecture Patterns

Microservices on Cloud Run

code
Internet → Cloud Load Balancer → Cloud Armor (WAF)
  → Cloud Run (API Gateway)
    → Cloud Run (Service A) → Cloud SQL
    → Cloud Run (Service B) → Firestore
    → Cloud Run (Service C) → Pub/Sub → Cloud Run (Worker)
  → Cloud CDN → Cloud Storage (Static Assets)

Event-Driven Architecture

code
Source → Pub/Sub Topic → Subscription → Cloud Run/Functions
  ├── Dead Letter Topic → Alert
  ├── BigQuery Subscription → Analytics
  └── Cloud Storage → Archive

Data Pipeline

code
Sources → Pub/Sub → Dataflow → BigQuery
  ├── Cloud Composer (Orchestration)
  ├── Cloud Storage (Data Lake)
  └── Vertex AI (ML)

Terraform Best Practices

hcl
# Use modules for reusable infrastructure
module "cloud_run_service" {
  source = "./modules/cloud-run"

  project_id   = var.project_id
  region       = var.region
  service_name = "api"
  image        = "gcr.io/${var.project_id}/api:${var.image_tag}"

  env_vars = {
    DB_HOST = module.cloud_sql.private_ip
    REDIS_HOST = module.memorystore.host
  }

  service_account = google_service_account.api.email
}

Terraform Structure

code
terraform/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   └── prod/
├── modules/
│   ├── cloud-run/
│   ├── cloud-sql/
│   ├── networking/
│   ├── iam/
│   └── monitoring/
└── shared/          # Shared state, backend config

Key Terraform Rules

  • Remote state in GCS bucket with locking
  • Workspaces or directories per environment (prefer directories)
  • Least privilege IAM in every module
  • Data sources over hardcoded values
  • Outputs for cross-module references
  • Variables with descriptions and validation
  • No hardcoded project IDs - always variables

IAM & Security

Principle of Least Privilege

  • Use custom IAM roles when predefined roles are too broad
  • Service accounts per service (never shared)
  • No user accounts in production (service accounts + Workload Identity)
  • Use Workload Identity Federation for external services
  • No service account keys (use attached service accounts)

Security Layers

code
1. Cloud Armor        → WAF, DDoS, IP allowlists
2. IAP                → Identity-aware proxy for internal apps
3. VPC Service Controls → Data exfiltration prevention
4. IAM                → Resource access control
5. Secret Manager     → Secrets, API keys, certificates
6. KMS                → Encryption key management
7. Binary Authorization → Container image verification

Networking Security

  • Private GKE clusters (no public endpoint)
  • VPC-native networking
  • Private Google Access for GCP APIs
  • Cloud NAT for outbound (no public IPs on instances)
  • Firewall rules: deny all, allow specific
  • Shared VPC for multi-project networking

GKE Best Practices

  • Prefer Autopilot unless you need node-level control
  • Workload Identity (not service account keys)
  • Network Policies to restrict pod-to-pod traffic
  • Pod Disruption Budgets for availability during updates
  • Resource requests/limits on every container
  • Horizontal Pod Autoscaler based on custom metrics
  • Binary Authorization for verified images only
  • Private clusters with authorized networks

CI/CD Pipeline

yaml
# Cloud Build example
steps:
  - name: 'golang'
    args: ['go', 'test', './...']

  - name: 'gcr.io/kaniko-project/executor'
    args:
      - '--destination=gcr.io/$PROJECT_ID/api:$SHORT_SHA'
      - '--cache=true'

  - name: 'gcr.io/cloud-builders/gcloud'
    args: ['run', 'deploy', 'api',
           '--image=gcr.io/$PROJECT_ID/api:$SHORT_SHA',
           '--region=us-central1',
           '--platform=managed']

Cost Optimization

  • Committed Use Discounts for predictable workloads (1yr/3yr)
  • Preemptible/Spot VMs for fault-tolerant workloads
  • Cloud Run min instances = 0 when cold start is acceptable
  • Lifecycle policies on Cloud Storage (move to Nearline/Coldline/Archive)
  • BigQuery on-demand vs flat-rate based on usage
  • Right-size instances - use Recommender API
  • Budget alerts and quotas per project
  • Label everything for cost attribution

Monitoring & Observability

  • Cloud Monitoring dashboards for golden signals (latency, traffic, errors, saturation)
  • Cloud Logging with structured JSON logs
  • Cloud Trace for distributed tracing
  • Error Reporting for exception tracking
  • Uptime Checks for availability monitoring
  • Alerting Policies with notification channels
  • SLOs defined in Cloud Monitoring

Reliability

  • Multi-zone deployments minimum
  • Multi-region for critical services
  • Automated backups with tested restore procedures
  • Chaos engineering practices
  • Runbooks for common incidents
  • Post-incident reviews
  • Load testing before launches

Architecture Review Format

code
## CRITICAL - Must fix before production
[Security gaps, single points of failure, data loss risks]

## HIGH - Address soon
[Cost inefficiencies, missing monitoring, scaling concerns]

## MEDIUM - Improve
[Architecture improvements, automation gaps]

## RECOMMENDATIONS
[Best practices, future-proofing, optimization opportunities]

## COST ANALYSIS
[Current spend, optimization opportunities, projected savings]

For detailed references see references/services.md