AgentSkillsCN

infrastructure-strategy

infrastructure-strategy

SKILL.md

Infrastructure Strategy for Engineering Leaders

For VPs, Directors, and Senior Managers setting multi-year infrastructure direction.

Infrastructure strategy is about making big bets that enable your business for years to come - cloud platform choices, build vs buy decisions, technology investments, and multi-year roadmaps.


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 🎯 SKILL ACTIVATED: infrastructure-strategy ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

When to Use This Skill

You need help with:

  • Cloud strategy (AWS vs Azure vs GCP, multi-cloud vs single-cloud)
  • Build vs buy decisions for infrastructure components
  • Platform investment ROI calculations
  • Multi-year infrastructure roadmapping
  • Technology evaluation and selection (technology radar)
  • Migration planning at executive level
  • Balancing innovation vs stability
  • Infrastructure budget prioritization

This skill does NOT cover:

  • Day-to-day technical decisions (see technical-leadership)
  • Hands-on implementation (see technical skills)
  • Operational management (see engineering-operations-management)

1. Cloud Strategy

Single Cloud vs Multi-Cloud

Single Cloud (Recommended for most)

code
Pros:
✅ Deep integration with platform services
✅ Team becomes expert in one platform
✅ Simpler operations and tooling
✅ Lower cost (volume discounts, reserved instances)
✅ Faster development (use platform services)

Cons:
❌ Vendor lock-in risk
❌ Less negotiating leverage
❌ Subject to platform outages
❌ Limited to platform capabilities

Best for:
- Startups and scale-ups
- Teams < 100 engineers
- Standard workloads
- Cost-sensitive orgs

Multi-Cloud (For specific use cases)

code
Pros:
✅ Avoid vendor lock-in
✅ Negotiating leverage
✅ Use best-of-breed services
✅ Geographic coverage (e.g., China requires local cloud)

Cons:
❌ Operational complexity (2-3x overhead)
❌ Team knowledge fragmentation
❌ Higher costs (no volume discounts)
❌ Integration challenges
❌ Security complexity

Best for:
- Large enterprises (500+ engineers)
- Regulatory requirements (data sovereignty)
- M&A integration (acquired companies on different clouds)
- Specific workload requirements

Decision Framework:

  1. Start with single cloud unless you have specific reason for multi-cloud
  2. Choose cloud based on:
    • Existing team skills
    • Services needed (ML, analytics, compute)
    • Geographic presence
    • Pricing for your workload
  3. Design for portability (Kubernetes, IaC) but don't pay multi-cloud tax

Which Cloud Provider?

AWSAzureGCPOracle Cloud (OCI)
Strengths: Largest ecosystem, most services, mature, global coverageStrengths: Enterprise sales, Microsoft integration, hybrid cloud (Arc)Strengths: Data/ML services, Kubernetes, developer experience, pricingStrengths: Oracle DB, enterprise support, government clouds
Weaknesses: Complexity, older UI, pricing opacityWeaknesses: Service quality inconsistency, documentation gapsWeaknesses: Smaller ecosystem, fewer enterprise featuresWeaknesses: Smaller ecosystem, fewer services
Best for: Startups, tech companies, most use casesBest for: Enterprises with Microsoft stack, hybrid cloudBest for: Data-heavy workloads, ML/AI, Kubernetes-firstBest for: Oracle workloads, government, highly regulated

Choosing strategy:

  • Startup/scale-up: AWS (ecosystem) or GCP (developer experience)
  • Enterprise: Azure (if Microsoft shop) or AWS (if tech-forward)
  • Regulated/government: AWS GovCloud, Azure Government, or OCI
  • Oracle DB heavy: OCI (database licensing savings)

Cloud Strategy Scenarios

Scenario: "Should we go all-in on AWS or stay flexible?"

  • All-in (Recommended): Use AWS-specific services (Lambda, DynamoDB, etc.) for faster development
  • Flexible: Use portable tech (Kubernetes, Postgres) but sacrifice AWS integration benefits
  • Reality: Portability is expensive. Most companies that plan for multi-cloud never actually migrate.
  • Decision: Go all-in unless you have specific multi-cloud requirement

Scenario: "Is multi-cloud worth the complexity?"

  • Answer: Usually NO. Multi-cloud costs 2-3x in operational overhead
  • Only do multi-cloud if:
    • Large enterprise (500+ engineers) with resources
    • Regulatory requirement (data must stay in specific regions/clouds)
    • M&A (acquired company on different cloud, temporary state)
  • Alternative: Design for cloud portability (Kubernetes, Terraform) but run on single cloud

Scenario: "Do we need disaster recovery in another cloud?"

  • Question: "What's the failure mode? Entire AWS region or all of AWS?"
  • Reality: Multi-region in same cloud is simpler and handles 99.9% of DR scenarios
  • Multi-cloud DR: Only for catastrophic cloud-wide failures (extremely rare)
  • Decision: Multi-region DR first, multi-cloud DR only if mandated by compliance

Scenario: "Serverless vs container strategy?"

  • Serverless (Lambda/Cloud Functions):
    • Best for: Event-driven, variable load, stateless functions
    • Not for: Long-running, stateful, complex orchestration
  • Containers (ECS/EKS/Cloud Run):
    • Best for: Always-on services, stateful apps, complex dependencies
    • Not for: Simple event handlers, variable load (without autoscaling)
  • Decision: Use both - serverless for events, containers for services

Scenario: "Moving from on-prem to cloud?"

  • Timeline: 12-36 months depending on complexity
  • Strategy:
    • Phase 1: Lift-and-shift (VMs) to derisk
    • Phase 2: Re-platform (containerize, use managed services)
    • Phase 3: Re-architect (cloud-native, serverless)
  • Don't: Big-bang migration. Do: Incremental, service by service

Scenario: "Cost difference between clouds?"

  • Reality: Pricing is similar for compute/storage (within 10-20%)
  • True cost differences:
    • Data egress (can be 3-5x different)
    • Managed services (varies widely)
    • Enterprise support (20% of spend)
    • Reserved instance discounts (negotiate these!)
  • Decision: Choose based on services/expertise, not just pricing

Scenario: "Should we use GCP for ML workloads and AWS for everything else?"

  • Sounds smart, but: Operational complexity of managing two clouds
  • Better: Use AWS SageMaker or GCP Vertex AI - both are excellent
  • Only split if: ML team is separate and has strong GCP preference
  • Reality: Integration complexity usually outweighs best-of-breed benefits

Scenario: "GovCloud requirement - what changes?"

  • Limited services: Not all AWS services available in GovCloud
  • Higher cost: Separate infrastructure, lower economies of scale
  • Compliance burden: STIG hardening, continuous monitoring, audit paperwork
  • Staffing: Need cleared personnel for some operations
  • Timeline: Add 3-6 months to normal cloud migration

Scenario: "Cloud-native vs cloud-agnostic?"

  • Cloud-native: Use cloud-specific services (managed databases, serverless)
    • Faster development, lower operational burden
    • Trade-off: Harder to migrate clouds
  • Cloud-agnostic: Use portable tech (Kubernetes, open source)
    • Flexibility to move clouds
    • Trade-off: More operational burden, slower development
  • Recommendation: Be pragmatic - use cloud services but document dependencies

Government and Cleared Clouds

For regulated industries:

  • FedRAMP: AWS GovCloud, Azure Government, GCP for Government, OCI Government
  • IL4/IL5: AWS Secret Region, Azure Government Secret, GCP Assured Workloads
  • IL6 (Top Secret): AWS Top Secret Region, Azure Government Top Secret

Considerations:

  • Limited service availability in government clouds
  • Higher costs (separate infrastructure)
  • Longer procurement cycles
  • Compliance overhead (STIG, NIST 800-53)

2. Build vs Buy Decisions

Framework for Deciding

code
BUILD when:
✅ Core differentiator for your business
✅ Existing solutions don't meet needs
✅ You have unique requirements
✅ Team has expertise and capacity
✅ Long-term cost justifies initial investment

BUY when:
✅ Not a differentiator (infrastructure, auth, payments)
✅ Commodity problem with good solutions
✅ Time to market is critical
✅ Team lacks expertise
✅ Ongoing maintenance would be burden

Decision Matrix

ComponentBuildBuyRationale
Authentication✅ Buy (Auth0, Okta)Commodity, security-critical, complex
CI/CD✅ Buy (GitHub Actions, CircleCI)Mature market, not differentiator
Observability✅ Buy (Datadog, New Relic)Complex to build, mature vendors
Internal Developer PlatformCore to productivity, unique needs
ML Platform❌ If ML is core businessDifferentiator, specific workflows
API GatewayMaybeMaybeDepends on customization needs

Total Cost of Ownership (TCO)

Build TCO:

code
Initial Development:
├── Engineering time (months × $150K/year avg)
├── Opportunity cost (what else could they build?)
└── Infrastructure costs

Ongoing:
├── Maintenance (20-30% of dev cost annually)
├── Operations (monitoring, on-call)
├── Updates and security patches
├── Documentation and training
└── Infrastructure costs

3-Year TCO = Initial + (3 × Annual Ongoing)

Buy TCO:

code
Year 1:
├── Vendor cost (licenses/seats)
├── Implementation/integration (1-3 months engineer time)
├── Training
└── Infrastructure (if self-hosted)

Years 2-3:
├── Annual license growth (plan for 20-30% growth)
├── Support/premium features
├── Minimal maintenance
└── Infrastructure

3-Year TCO = Y1 + Y2 + Y3

Example: Auth System

code
BUILD:
├── 6 months × 2 engineers = $150K initial
├── Ongoing: $60K/year maintenance
└── 3-year TCO: $150K + $180K = $330K

BUY (Auth0):
├── $2/MAU × 100K users = $200K/year
├── Integration: $30K one-time
└── 3-year TCO: $30K + $600K = $630K

Conclusion: Build seems cheaper BUT:
- Auth0 includes: MFA, SSO, compliance, security updates
- Building all that: 12+ months, $300K+
- Hidden costs: security incidents, compliance audits
- Decision: BUY unless auth is your core business

Build vs Buy Checklist

code
□ Is this a core differentiator for our business?
□ Do existing solutions meet 80%+ of our needs?
□ Do we have team expertise to build and maintain?
□ Have we calculated full 3-year TCO for both options?
□ Can we afford the opportunity cost of building?
□ Is vendor lock-in acceptable? (most cases: yes)
□ What's the risk if we choose wrong? Can we switch later?
□ Does "buy" option have enterprise SLA and support?

Build vs Buy Scenarios

Scenario: "Should we build an internal platform like Heroku?"

  • Build cost: 8-12 engineers × 12 months = $2M+ initial, $1.5M/year ongoing
  • Buy alternative: Heroku, Cloud Run, App Runner - $50-200K/year
  • Build if: 150+ engineers, unique workflows, platform is differentiator
  • Buy if: < 100 engineers, standard app deployment, want speed
  • Hidden costs of building: In-house support, documentation, feature requests, security updates

Scenario: "Payment processing - build or use Stripe?"

  • Build: PCI compliance alone costs $500K+/year
  • Stripe: 2.9% + $0.30 per transaction
  • Break-even: Only makes sense at $100M+ annual GMV with specialized needs
  • Decision: Almost always buy. Payments are not your core business.

Scenario: "APM - commercial (DataDog/New Relic) vs open source (Prometheus/Grafana)?"

  • Commercial ($200-500K/year):
    • Full-featured, hosted, 24/7 support
    • Fast time to value (days)
    • Best for teams < 50 engineers
  • Open Source ($100-200K/year in engineering time):
    • Self-hosted, requires dedicated team
    • Slower time to value (months)
    • Best for teams > 100 engineers with SRE expertise
  • Decision: Buy commercial until you have SRE team to run OSS

Scenario: "Service mesh - build custom vs buy Istio/Linkerd vs buy Consul?"

  • Build custom: 6-12 months, ongoing maintenance nightmare
  • Open source (Istio/Linkerd): Complex to operate, requires expertise
  • Commercial (Consul Enterprise, Gloo): Easier, supported, expensive
  • Reality: Most companies don't need service mesh. Use it if:
    • 50+ microservices
    • Need mTLS everywhere
    • Complex traffic routing requirements
  • Decision: Buy managed service mesh or don't use one

Scenario: "Managed Kubernetes (EKS/GKE) vs self-hosted?"

  • Managed ($150/cluster/month):
    • Control plane managed, auto-updates, integrated
    • Still need to manage worker nodes
  • Self-hosted (save $150/month, cost $10K/month in engineering time):
    • Full control, complex setup, manual updates
  • Decision: Always use managed unless you have 10+ dedicated Kubernetes experts

Scenario: "Observability - should we buy DataDog or build our own?"

  • Build cost: $500K-1M first year, $300K/year ongoing
  • DataDog: $100-300K/year depending on scale
  • Build if: > 500 engineers, unique observability needs, cost > $1M/year
  • Buy if: < 500 engineers, standard needs, want to focus on product
  • Hidden build costs: Integration with all services, alerting, dashboards, on-call for observability platform

Scenario: "Should finance approve this observability tooling?"

  • Cost: $200K/year for observability seems expensive
  • Value: Reduce MTTR from 2 hours to 15 minutes
    • 100 incidents/year × 1.75 hours saved × 3 engineers × $100/hour = $52.5K/year
    • Prevented outages: 10/year × $50K revenue impact = $500K/year saved
  • ROI: $752K value for $200K cost = 276% ROI
  • Decision: Approve - observability prevents costly outages

Scenario: "Terraform Cloud vs self-hosted Terraform?"

  • Terraform Cloud: $20/user/month = $24K/year for 100 engineers
  • Self-hosted: Free but requires CI/CD integration, state management, RBAC
    • Engineering cost: $50K/year
  • Decision: Use Terraform Cloud unless you already have robust CI/CD for state management

3. Platform Investment ROI

Calculating Platform ROI

Formula:

code
ROI = (Productivity Gains - Platform Cost) / Platform Cost × 100%

Productivity Gains = (Time Saved × Engineer Count × Avg Salary)
Platform Cost = (Team Cost + Infrastructure Cost)

Example: Internal Developer Platform

code
Investment:
├── Platform team: 8 engineers × $200K = $1.6M/year
├── Infrastructure: $400K/year
└── Total Cost: $2M/year

Productivity Gains:
├── Faster deployments: 2 hours/week saved × 50 engineers
├── Reduced incidents: 50% reduction = 10 hours/week saved
├── Faster onboarding: 2 weeks → 1 week for 20 new hires/year
├──Total time saved: ~5,000 hours/year
├── Value: 5,000 hours × $100/hour = $500K/year

Wait, that's negative ROI!

But indirect benefits:
├── Faster time to market: 2 week reduction × 12 features = 24 weeks
├── Value of shipping faster: $5M revenue brought forward
├── Reduced risk: Fewer outages = better customer retention
├── Improved hiring: Better developer experience attracts talent

True ROI: Hard to quantify, but likely 3-5x over 3 years

When to Invest in Platform

Invest when:

  • Team size > 30-50 engineers
  • Development velocity slowing down
  • High cognitive load on engineers
  • Inconsistent practices across teams
  • Frequent production incidents
  • Hard to hire/onboard engineers

Don't invest when:

  • Team < 30 engineers (not enough leverage)
  • Business model unproven (premature scaling)
  • Existential priorities (fundraising, shipping core product)

ROI Calculation Scenarios

Scenario: "How do we calculate platform team ROI?"

  • Direct metrics:
    • Deployment frequency: 1/week → 10/day
    • Lead time: 2 weeks → 2 days
    • MTTR: 4 hours → 30 minutes
    • Onboarding time: 4 weeks → 1 week
  • Value calculation:
    • 50 engineers × 5 hours/week saved = 250 hours/week
    • 250 hours × 50 weeks × $100/hour = $1.25M/year
  • Platform cost: 8 engineers × $200K = $1.6M
  • ROI: Breakeven year 1, positive thereafter
  • Intangibles: Better hiring, less burnout, faster innovation

Scenario: "Justifying Kubernetes migration"

  • Cost of migration: 6 months × 4 engineers = $400K
  • Benefits:
    • Better resource utilization: Save 30% on infrastructure = $150K/year
    • Faster deployments: 2 hours → 10 minutes = 100 hours/week saved = $250K/year
    • Multi-cloud optionality (intangible)
  • Payback period: 12-18 months
  • Decision: Worth it if infrastructure cost > $500K/year or scaling quickly

Scenario: "Platform team value - what should we measure?"

  • Avoid vanity metrics:
    • ❌ Number of deployments (more isn't always better)
    • ❌ Lines of code (meaningless)
    • ❌ Tickets closed (focuses on wrong thing)
  • Focus on impact metrics:
    • ✅ Developer survey scores (NPS for platform)
    • ✅ Time to first deployment (new engineer)
    • ✅ DORA metrics (deployment frequency, lead time, MTTR, change failure rate)
    • ✅ Time saved per engineer per week
    • ✅ Incident reduction (fewer production issues)

Scenario: "Infrastructure cost per developer?"

  • Calculate: Total infrastructure cost / number of engineers
  • Benchmarks:
    • Early stage: $2-5K per engineer/month
    • Scale-up: $5-10K per engineer/month
    • Enterprise: $10-20K per engineer/month
  • High cost reasons: Data-intensive, ML workloads, inefficient usage, overprovisioning
  • Optimization: Right-size instances, use spot/reserved, implement autoscaling

Scenario: "How do we measure developer velocity improvement?"

  • Lead Time for Changes:
    • Before: 2 weeks from commit to production
    • After platform investment: 2 days
    • Improvement: 10x faster
  • Developer satisfaction:
    • Survey: "How easy is it to deploy a new service?" 1-10
    • Target: Improve from 4 → 8
  • Time to productivity:
    • New engineer: Productive in 1 week vs 4 weeks
    • Value: 3 weeks × 20 new hires/year = 60 weeks saved

Scenario: "Service mesh cost-benefit analysis"

  • Cost:
    • 2 engineers × 6 months setup = $200K
    • Ongoing: 1 engineer × $200K/year
    • Overhead: 10% latency increase, 20% infrastructure increase = $100K/year
    • Total: $200K + $300K/year
  • Benefit:
    • mTLS everywhere (security win)
    • Traffic management (canary deploys)
    • Observability (better debugging)
    • Value: Hard to quantify - mainly security/compliance
  • Decision: Only do it if:
    • Security/compliance requirement
    • 50+ microservices
    • Sophisticated traffic management needs

Scenario: "Platform break-even point"

  • Question: "When does investing in platform pay off?"
  • Formula: Break-even when (Time Saved Value) > (Platform Cost)
  • Example:
    • Platform team cost: $2M/year (10 engineers)
    • Time saved: 100 engineers × 10 hours/week × $100/hour = $5M/year
    • Break-even: Immediate (2.5x return)
  • Reality: Benefits compound - velocity improvements enable more velocity

Scenario: "Opportunity cost of platform investment"

  • Question: "What else could these 8 engineers build instead of platform?"
  • Option A: Platform team → enables 100 engineers to be 20% more productive = 20 FTE equivalent
  • Option B: Product team → ship 2-3 more features/year
  • Trade-off: Short-term features vs long-term productivity
  • Decision: At 50+ engineers, platform investment usually wins

Investment Priorities by Stage

Startup (0-30 engineers):

code
Priority 1: Ship product, find product-market fit
Infrastructure: Use managed services, don't build platform
Investment: Observability, CI/CD (buy, don't build)

Scale-up (30-150 engineers):

code
Priority: Scale engineering productivity
Infrastructure: Start investing in platform
Investment:
├── Developer experience (CI/CD optimization, faster builds)
├── Observability (centralized logs, metrics, traces)
├── Self-service infrastructure (IaC templates, K8s)
└── SRE function (reliability, on-call)

Enterprise (150+ engineers):

code
Priority: Maintain velocity as org scales
Infrastructure: Platform as product
Investment:
├── Internal developer platform (self-service everything)
├── Platform teams (dedicated orgs)
├── SRE org (production excellence)
├── Security org (AppSec, compliance)
└── Data platform (analytics, ML)

4. Multi-Year Roadmapping

Infrastructure Roadmap Framework

Year 1: Foundation

code
Q1-Q2: Stabilize
├── Production reliability (reduce incidents)
├── Observability (visibility into systems)
├── CI/CD basics (automated deployments)
└── Security fundamentals (secrets management, scanning)

Q3-Q4: Optimize
├── Developer experience improvements
├── Performance optimization
├── Cost optimization
└── Team hiring and growth

Year 2: Scale

code
Q1-Q2: Platform Investment
├── Internal developer platform (IDP) foundation
├── Self-service infrastructure
├── Advanced observability (tracing, SLOs)
└── Expand platform team

Q3-Q4: Productivity
├── Faster deployments (reduce cycle time)
├── Better testing (reduce bugs)
├── Documentation and enablement
└── Platform adoption

Year 3: Excellence

code
Q1-Q2: Maturity
├── Platform as product mindset
├── Multi-region/global infrastructure
├── Advanced security and compliance
└── Disaster recovery and business continuity

Q3-Q4: Innovation
├── Emerging technologies (ML, edge computing)
├── Next-generation architecture
├── Strategic bets
└── Continuous improvement

Balancing Roadmap

The 70-20-10 Rule:

  • 70% Core Business: Keep the lights on, support product roadmap
  • 20% Platform Investment: Developer experience, reliability, security
  • 10% Innovation: Experiments, R&D, emerging tech

Adjust by maturity:

  • Early stage: 85% core, 10% platform, 5% innovation
  • Growth stage: 70% core, 20% platform, 10% innovation
  • Mature: 60% core, 25% platform, 15% innovation

Roadmap Communication

Quarterly Infrastructure Review (with leadership):

code
1. Last Quarter Recap (15 min)
   ├── What we shipped
   ├── Impact and metrics
   └── What we learned

2. This Quarter Plan (20 min)
   ├── Top 3-5 priorities
   ├── Resource allocation
   ├── Dependencies and risks
   └── Success criteria

3. Long-term Strategy (15 min)
   ├── Year-ahead preview
   ├── Strategic bets
   └── Investment needs

4. Q&A (10 min)

5. Technology Radar

What is a Technology Radar?

A framework for tracking and evaluating technologies.

Four Rings:

  1. Adopt: Proven, ready for production, recommended
  2. Trial: Worth exploring, pilot projects
  3. Assess: Interesting, but not ready yet
  4. Hold: Avoid for now, or phase out

Four Quadrants:

  1. Techniques: Development practices, architectures
  2. Tools: Software, frameworks, products
  3. Platforms: Infrastructure, cloud services
  4. Languages & Frameworks: Programming languages, libraries

Example Technology Radar (Infrastructure)

ADOPT (Use in production):

code
├── Kubernetes (Container orchestration)
├── Terraform (Infrastructure as Code)
├── GitHub Actions (CI/CD)
├── Datadog (Observability)
├── PostgreSQL (Relational database)
└── AWS (Cloud platform)

TRIAL (Pilot projects):

code
├── ArgoCD (GitOps)
├── Pulumi (IaC alternative to Terraform)
├── Temporal (Workflow orchestration)
├── ClickHouse (Analytics database)
└── OpenTelemetry (Observability standard)

ASSESS (Evaluate):

code
├── WebAssembly (Edge computing)
├── Serverless containers (AWS Fargate, Cloud Run)
├── Service mesh (Istio, Linkerd)
└── eBPF (Observability and security)

HOLD (Avoid or deprecate):

code
├── Monolithic architectures (favor microservices)
├── Manual deployments (automate everything)
├── Homegrown auth (use Auth0/Okta)
└── [Legacy tool you're migrating from]

Technology Evaluation Process

Before adopting new technology:

code
1. Problem Validation
   └── What problem does this solve?
   └── Do we actually have this problem?
   └── How are we solving it today?

2. Technology Research
   └── Maturity: Production-ready? Stable?
   └── Community: Active? Well-supported?
   └── Ecosystem: Good documentation? Libraries? Integrations?

3. Proof of Concept
   └── Build small prototype (2-4 weeks max)
   └── Test with real use case
   └── Assess developer experience

4. Team Assessment
   └── Do we have skills? Can we learn?
   └── Can we operate and maintain this?
   └── What's the training investment?

5. Decision
   └── Adopt: Roll out to production
   └── Trial: More POCs, pilot projects
   └── Assess: Keep watching, not ready
   └── Hold: Not right for us, pass

6. Review Annually
   └── Revisit decisions yearly
   └── Move technologies between rings
   └── Deprecate old choices

6. Migration Planning (Executive Level)

Types of Migrations

1. Cloud Migration (On-prem → Cloud)

code
Approaches:
├── Lift-and-shift (Rehost): Fast, minimal changes, technical debt
├── Replatform: Optimize for cloud (managed services, containers)
├── Refactor: Rewrite for cloud-native (microservices, serverless)
└── Recommended: Hybrid (replatform most, refactor critical)

Timeline: 12-36 months depending on scope
Investment: 20-40% of engineering capacity
Risk: Medium-High

2. Multi-Cloud (Single cloud → Multi-cloud)

code
Why:
├── Vendor negotiation leverage
├── Regulatory requirements (data sovereignty)
├── M&A integration
└── Avoid vendor lock-in

Cost: 2-3x operational overhead
Timeline: 18-36 months
Recommendation: Only if compelling business reason

3. Modernization (Monolith → Microservices)

code
Approach:
├── Strangler fig pattern (gradually extract services)
├── Don't rewrite everything at once
└── Extract highest-value services first

Timeline: 24-48 months
Investment: 30-50% of engineering capacity
Risk: High (many fail, scope creep)

Migration Planning Framework

Phase 1: Assessment (2-3 months)

code
├── Current state analysis
│   ├── Inventory of systems
│   ├── Dependencies mapped
│   └── Technical debt identified
├── Target state definition
│   ├── Architecture vision
│   ├── Technology choices
│   └── Success criteria
└── Migration strategy
    ├── Wave planning (which systems, what order)
    ├── Risk assessment
    └── Resource planning

Phase 2: Pilot (3-6 months)

code
├── Choose 1-2 non-critical systems
├── Migrate end-to-end
├── Learn and refine process
├── Build runbooks and automation
└── Validate costs and effort estimates

Phase 3: Execution (12-24 months)

code
├── Migrate in waves (monthly or quarterly)
│   ├── Wave 1: Easy wins (stateless apps)
│   ├── Wave 2: Medium complexity
│   └── Wave 3: Complex/critical systems
├── Decommission old systems
└── Continuous optimization

Phase 4: Optimization (Ongoing)

code
├── Cost optimization
├── Performance tuning
├── Security hardening
└── Team training

Migration Risks and Mitigations

RiskImpactMitigation
Cost overrunsBudget exceeded 2-3xDetailed estimation, quarterly reviews, kill switch
Timeline delaysMigration takes 2x longerConservative estimates, buffer time, phased approach
Data lossCritical data corrupted/lostBackups, dual-write, rollback plan
Performance issuesSystem slower after migrationLoad testing, gradual rollout, performance baseline
Team burnoutEngineers exhaustedLimit migration to 30-40% capacity, rotations
Vendor lock-inStuck with new vendorDesign for portability (Kubernetes, IaC)

7. Balancing Innovation vs Stability

The Innovation Spectrum

code
Bleeding Edge → Leading Edge → Mainstream → Legacy
     ↑              ↑              ↑            ↑
  High Risk    Medium Risk    Low Risk     High Risk
  High Reward  Medium Reward  Low Reward   Technical Debt

Where to be:

  • Core infrastructure: Mainstream (proven, stable)
  • Product features: Leading edge (competitive advantage)
  • Experiments: Bleeding edge (limited blast radius)
  • Legacy: Migrate to mainstream

Innovation Budget

Allocate engineering time:

code
├── 70% Mainstream: Proven technologies, low risk
├── 20% Leading Edge: 1-2 year old, early adopters
└── 10% Bleeding Edge: New, experimental, R&D

Example:

  • Mainstream: Kubernetes, Postgres, AWS
  • Leading Edge: ArgoCD (GitOps), OpenTelemetry
  • Bleeding Edge: WebAssembly at edge, new ML frameworks

Decision Framework: When to Adopt New Technology?

Adopt if:

  • ✅ Solves real problem we have today
  • ✅ Mature enough (1-2 years in production elsewhere)
  • ✅ Active community and support
  • ✅ Team excited and willing to learn
  • ✅ Can pilot with low risk

Wait if:

  • ❌ No clear problem it solves
  • ❌ Too new (< 1 year, frequent breaking changes)
  • ❌ Small community, unclear future
  • ❌ Team lacks bandwidth to learn
  • ❌ Can't fail safely

Key Takeaways for Leaders

  1. Cloud strategy: Single cloud for most, multi-cloud only if required
  2. Build vs buy: Buy unless it's your core differentiator
  3. Platform ROI: Invest when team > 30-50 engineers
  4. Roadmap balance: 70% core, 20% platform, 10% innovation
  5. Technology radar: Be deliberate about tech adoption
  6. Migration planning: 12-36 months, 20-40% capacity
  7. Innovation budget: 70% mainstream, 20% leading edge, 10% experimental
  8. Make reversible decisions: Avoid vendor lock-in where possible
  9. Measure everything: Track productivity, costs, reliability
  10. Think in years: Infrastructure strategy is long-term game

Remember: Infrastructure strategy is about enabling your business to move faster, scale efficiently, and compete effectively - not about using the coolest technology.


Templates

Technology Decision Template

markdown
# Technology Decision: [Technology Name]

## Problem
[What problem are we solving?]

## Proposed Solution
[Technology/approach we're evaluating]

## Alternatives Considered
1. [Alternative 1]
2. [Alternative 2]
3. Status quo

## Evaluation
| Criteria | Weight | Score (1-5) | Notes |
|----------|--------|-------------|-------|
| Solves problem | High | | |
| Maturity | High | | |
| Team skills | Medium | | |
| Cost | Medium | | |
| Vendor support | Low | | |

## Decision
[Adopt | Trial | Assess | Hold]

## Next Steps
- [ ] Prototype (if Trial)
- [ ] Training plan
- [ ] Migration plan
- [ ] Success metrics

## Review Date
[When we'll revisit this decision]

Integration with Other Skills

This skill works with:

  • technical-leadership - Evaluating technical proposals, architecture reviews
  • engineering-management - Resource planning, team organization
  • budget-and-cost-management - Infrastructure budgets, cost optimization
  • engineering-operations-management - SRE strategy, reliability

Your infrastructure strategy should enable your business strategy, not constrain it.