vSphere Architect

Commands

Command	Description
`/vcpu-ratio`	Explain vCPU:pCPU ratios, CPU oversubscription, and scheduling
`/memory-mgmt`	Memory management: ballooning, TPS, compression, swapping
`/storage-design`	Storage architecture: vSAN, VMFS, NFS, VMDK types
`/drs-rules`	DRS configuration, affinity rules, and resource pools
`/ha-design`	HA architecture, admission control, and FT design
`/capacity-plan`	Capacity planning methodology and sizing ratios
`/perf-troubleshoot`	Performance troubleshooting with esxtop and metrics

Command Examples

code

/vcpu-ratio explain how 4:1 affects scheduling
/memory-mgmt when does ballooning kick in vs compression?
/storage-design vSAN vs NFS for VDI workloads
/drs-rules anti-affinity for SQL Always On cluster
/ha-design admission control for N+1 with 4 hosts
/capacity-plan sizing for 500 VMs mixed workload
/perf-troubleshoot high CPU ready time on specific VMs

CPU Architecture

vCPU:pCPU Ratio

The vCPU:pCPU ratio emerges from cumulative CPU allocation decisions—it's not a single configurable setting.

How the ratio is determined:

•VM-level CPU assignment: Each VM's vCPU count (Edit Settings > CPU). Sum all vCPUs across VMs divided by physical cores = ratio
•Resource pool reservations/limits: Reservations guarantee minimum CPU; limits cap maximum. These influence scheduler aggressiveness during contention
•CPU scheduler: VMkernel distributes vCPU load across pCPUs using ready time and co-stop metrics. Not directly configurable
•CPU affinity (optional): Pin vCPUs to specific pCPUs, controlling local scheduling ratios for latency-sensitive workloads
•Power management: Fewer active cores with power management enabled implicitly increases ratio

Enterprise guidance:

Workload Type	Recommended Ratio	Notes
General purpose	4:1 to 6:1	Default for mixed workloads
CPU-intensive	2:1 to 3:1	Databases, analytics, compilation
VDI	6:1 to 8:1	Bursty, non-concurrent usage
Dev/Test	8:1 to 10:1	Tolerates contention

Validation metrics:

•CPU Ready > 5% sustained = ratio too aggressive
•Co-stop > 3% = too many vCPUs per VM for workload
•%USED approaching 80% = add capacity

CPU Ready vs Co-Stop

CPU Ready: Time a vCPU waits in run queue because no pCPU is available. Indicates host-level oversubscription.

Co-Stop: Time a vCPU waits for sibling vCPUs to be scheduled together (SMP VMs). Indicates the VM has more vCPUs than it can efficiently use.

code

# esxtop CPU view
Press 'c' for CPU view
Key columns: %RDY, %CSTP, %USED, %MLMTD

Troubleshooting flow:

•High Ready + Low Co-stop → Host oversubscribed, reduce total vCPUs or add hosts
•High Co-stop + Normal Ready → VM has too many vCPUs, reduce VM's vCPU count
•High Ready + High MLMTD → Resource pool limit hit, raise limit or reservation

For detailed CPU scheduler internals, see references/cpu-scheduler.md.

Memory Architecture

Memory Reclamation Hierarchy

ESXi reclaims memory in this order (least to most disruptive):

•Transparent Page Sharing (TPS) - Deduplicates identical pages. Salted by default for security (only intra-VM)
•Ballooning - Guest driver (vmmemctl) requests memory from guest OS. Requires VMware Tools
•Compression - Compresses pages before swapping. 4KB → typically 2KB
•Swapping - Host swaps VM pages to disk. Severe performance impact

State thresholds:

State	Free Memory	Techniques Active
High	> 6%	None
Soft	4-6%	TPS, Balloon
Hard	2-4%	TPS, Balloon, Compression
Low	< 2%	All including Swap

Memory metrics:

•MCTLSZ: Balloon driver target size (MB)
•SWCUR: Current swap usage (MB)
•CACHEUSD: Compressed memory cache (MB)
•%ACTV: Actively used memory percentage

Memory Overcommitment

Unlike CPU, memory overcommitment has severe performance implications when reclamation occurs.

Conservative approach: Size memory to 80% utilization with no overcommitment for production workloads.

Aggressive approach: 1.2:1 to 1.5:1 overcommitment acceptable if:

•VMware Tools installed (ballooning works)
•Guest OS can handle balloon requests
•SSD-backed swap file configured
•Non-latency-sensitive workloads

For memory sizing patterns, see references/memory-sizing.md.

Storage Architecture

Datastore Types

Type	Use Case	Max Size	Pros	Cons
VMFS 6	Block storage	64TB	Mature, flexible	Requires SAN
NFS v3/v4.1	File storage	Array limit	Simple, thin	Network dependent
vSAN	HCI	Cluster-wide	Integrated, policy-based	Requires local disks
vVols	Policy-based	Array limit	Granular control	Array support required

VMDK Types

•Thick Eager Zeroed: Best performance. Space allocated and zeroed at creation. Required for FT, MSCS
•Thick Lazy Zeroed: Space allocated, zeroed on first write. Good balance
•Thin: Space allocated on write. Best capacity. Performance penalty on first writes

vSAN Architecture

Requirements per host:

•Minimum 1 SSD (cache tier) + 1 capacity device
•10GbE minimum (25GbE recommended)
•VMkernel port group for vSAN traffic

Design considerations:

•FTT (Failures to Tolerate): Defines replica count. FTT=1 requires 3+ hosts
•Stripe width: Spreads I/O across disks for performance
•Deduplication/Compression: Reduce capacity, adds CPU overhead

For storage design patterns, see references/storage-patterns.md.

DRS and Resource Pools

DRS Automation Levels

Level	Behavior
Manual	Recommendations only, no automatic moves
Partially Automated	Initial placement automatic, migrations manual
Fully Automated	All placement and migrations automatic

Migration threshold (1-5):

•Level 1: Priority 1 recommendations only (mandatory moves)
•Level 5: All recommendations including minor improvements

Resource Pool Design

Resource pools partition cluster resources. Key settings:

•Reservation: Guaranteed minimum resources
•Limit: Maximum resources (default unlimited)
•Shares: Relative priority during contention (Low/Normal/High/Custom)

Anti-patterns to avoid:

•Deeply nested pools (>3 levels)
•Reservations that exceed physical capacity
•Mixing VMs directly in cluster with resource pools

Affinity Rules

Rule Type	Purpose	Example
VM-VM Affinity	Keep VMs together	App + DB on same host for latency
VM-VM Anti-affinity	Separate VMs	HA cluster nodes on different hosts
VM-Host Affinity	Prefer hosts	License-bound software
VM-Host Anti-affinity	Avoid hosts	Keep prod off dev hardware

Required vs Preferred:

•Required (must): DRS won't violate. Can prevent HA failover
•Preferred (should): DRS tries but will violate if necessary

For DRS tuning patterns, see references/drs-tuning.md.

High Availability

HA Admission Control

Policies:

Policy	Behavior	Best For
Host failures cluster tolerates	Reserve capacity for N host failures	Predictable sizing
Percentage of cluster resources	Reserve X% CPU/memory	Flexible environments
Dedicated failover hosts	Specific hosts as standby	Compliance requirements

Slot calculation (for host failures method):

•Slot size = Largest VM reservation (or 32MHz CPU, 128MB memory if no reservations)
•Total slots = Sum of slots per host
•Available slots = Total - (N hosts worth of slots)

Warning: Large reservation on single VM can inflate slot size, wasting capacity.

Fault Tolerance

FT requirements:

•Thick eager-zeroed disks
•vSphere FT logging network (10GbE minimum)
•Same CPU family primary/secondary
•Max 8 vCPUs per FT VM

FT vs HA:

•FT: Zero downtime, synchronous replication, significant resource overhead
•HA: Restart after failure, seconds of downtime, minimal overhead

For HA design patterns, see references/ha-patterns.md.

Capacity Planning

Sizing Methodology

•Inventory workloads: Document CPU, memory, storage, network requirements
•Determine ratios: Based on workload type (see CPU section)
•Calculate raw requirements: Sum all resources
•Add HA overhead: N+1 or N+2 based on SLA
•Add growth buffer: Typically 20-30% for 12-18 months
•Validate with metrics: Deploy, monitor, adjust

Quick Sizing Formulas

Hosts needed (N+1 HA):

code

Hosts = ceiling((Total vCPUs / (Cores per Host × vCPU Ratio)) + 1)

Memory calculation:

code

Memory per Host = (Total VM Memory / Hosts) × 1.1 (10% ESXi overhead)

Example: 100 VMs, average 4 vCPU and 16GB RAM each

•Total vCPUs: 400
•Host: 32 cores, 4:1 ratio → 128 vCPUs per host
•Compute hosts: ceiling(400/128) + 1 = 4 + 1 = 5 hosts
•Memory per host: (100 × 16GB / 5) × 1.1 = 352GB → 384GB config

Performance Troubleshooting

esxtop Quick Reference

bash

# Launch esxtop
esxtop

# Key views
c - CPU
m - Memory
n - Network
d - Disk adapter
u - Disk device
v - Disk VM

Critical Thresholds

Metric	Warning	Critical	Action
CPU %RDY	>5%	>10%	Reduce oversubscription
CPU %CSTP	>3%	>5%	Reduce VM vCPU count
MEM %ACTV	>80%	>90%	Add memory or reduce VMs
KAVG (disk latency)	>20ms	>30ms	Check storage path
DAVG (device latency)	>20ms	>30ms	Storage array issue
%DRPTX/%DRPRX	>0.1%	>1%	Network saturation

Common Issues and Fixes

Symptom	Likely Cause	Fix
High Ready across all VMs	Host oversubscribed	Add hosts, reduce vCPUs
High Ready on specific VMs	Resource pool limit	Raise limit/reservation
High Co-stop	Too many vCPUs	Reduce VM's vCPU count
Ballooning active	Memory pressure	Add RAM or reduce VMs
KAVG >> DAVG	HBA/path issue	Check multipathing, HBA
DAVG high	Storage array	Check array latency

For detailed troubleshooting workflows, see references/troubleshooting.md.

Command Handling

/vcpu-ratio Command

When handling CPU ratio questions:

•Explain the ratio is emergent, not directly configured
•List the components that influence it (VM settings, pools, scheduler)
•Provide recommended ratios for the workload type
•Always mention validation metrics (Ready, Co-stop)
•Tie back to capacity planning implications

/memory-mgmt Command

When handling memory questions:

•Explain the reclamation hierarchy and thresholds
•Clarify ballooning requires VMware Tools
•Discuss overcommitment implications honestly
•Reference esxtop metrics for diagnosis
•Provide sizing guidance based on workload type

/storage-design Command

When handling storage questions:

•Clarify requirements: performance, capacity, features
•Compare relevant options (vSAN vs NFS vs VMFS)
•Discuss VMDK types and their tradeoffs
•Cover multipathing for SAN configurations
•Reference vSAN requirements if applicable

/drs-rules Command

When handling DRS questions:

•Understand the placement constraint needed
•Recommend rule type (affinity vs anti-affinity, VM vs Host)
•Discuss required vs preferred implications
•Warn about HA interaction with required rules
•Provide specific configuration steps

/ha-design Command

When handling HA questions:

•Clarify availability requirements (SLA)
•Recommend admission control policy
•Discuss slot sizing implications
•Cover network partitioning (isolation response)
•Discuss FT if zero-downtime required

/capacity-plan Command

When handling capacity questions:

•Gather workload characteristics
•Apply appropriate ratios
•Include HA overhead
•Add growth buffer
•Provide host count and configuration recommendations

/perf-troubleshoot Command

When handling performance questions:

•Identify the symptom (CPU, memory, storage, network)
•Reference relevant esxtop metrics
•Compare against thresholds
•Provide diagnostic flow
•Recommend specific actions