OneLake Performance remediate
Systematic diagnostic and remediation toolkit for Microsoft Fabric OneLake performance issues. Covers the full stack from capacity-level throttling down to individual Delta table file layout problems.
When to Use This Skill
- •OneLake read or write operations are slow or timing out
- •Lakehouse or warehouse queries have unexpectedly high latency
- •Spark jobs are being throttled with HTTP 430 errors
- •Delta tables have accumulated many small files (small file problem)
- •Direct Lake semantic models are falling back to DirectQuery
- •Cold cache performance is significantly slower than warm cache
- •Cross-region data access is adding network latency
- •V-Order is not applied or needs to be enabled/disabled
- •Table maintenance (OPTIMIZE, VACUUM) is failing or not improving performance
- •Capacity utilization is high and jobs are queuing
Prerequisites
- •Microsoft Fabric workspace with Contributor or higher role
- •Access to the Monitoring Hub in the Fabric portal
- •PowerShell 7+ with Az.Fabric module (for automation scripts)
- •Familiarity with Spark SQL or T-SQL for diagnostic queries
Diagnostic Decision Tree
Follow this sequence to isolate the root cause:
1. Is the issue capacity-level? → Check Spark VCore utilization and queue depth 2. Is the issue cold cache? → Check data_scanned_remote_storage_mb 3. Is the issue file layout? → Check small file count and V-Order status 4. Is the issue cross-region? → Verify data and capacity are co-located 5. Is the issue query design? → Check string column widths, partition pruning
Step-by-Step Workflows
Workflow 1: Diagnose Capacity Throttling
When Spark jobs fail with HTTP 430 (TooManyRequestsForCapacity):
- •Open the Monitoring Hub in the Fabric portal
- •Check active Spark sessions against your SKU's VCore limit (1 CU = 2 Spark VCores)
- •Review the queue depth against your SKU's queue limit (see capacity-sku-reference.md)
- •Cancel unnecessary jobs or scale up the capacity SKU
- •For burst workloads, use the spark-capacity-check.ps1 script to monitor utilization
Workflow 2: Resolve Cold Cache Latency
When first query execution is significantly slower than subsequent runs:
- •Query the
queryinsights.exec_requests_historyview - •Check the
data_scanned_remote_storage_mbcolumn — non-zero indicates cold start - •Do NOT judge performance on first execution; measure subsequent runs
- •For pre-warming strategies and diagnostic queries, see cold-cache-diagnostics.md
Workflow 3: Fix Small File Problem
When Delta tables have hundreds or thousands of small Parquet files:
- •Run the table-health-check.ps1 script to assess file counts and sizes
- •Apply OPTIMIZE to consolidate files (target: 128 MB–1 GB per file)
- •Apply V-Order for read-optimized workloads
- •Schedule recurring maintenance — see table-maintenance-workflow.md
Workflow 4: Optimize V-Order Configuration
When choosing between read-heavy and write-heavy resource profiles:
- •Identify your dominant workload pattern (ingestion vs. analytics)
- •New Fabric workspaces default to
writeHeavyprofile (V-Order disabled) - •For Power BI / interactive queries, switch to
readHeavyForSparkorreadHeavyForPBI - •Apply V-Order at session, table, or OPTIMIZE command level
- •See v-order-decision-guide.md for detailed configuration
Workflow 5: Diagnose Cross-Region Latency
When data in OneLake or external storage is in a different region than Fabric capacity:
- •Verify the Fabric capacity region in the Admin portal
- •Check shortcut destinations — are they in the same region?
- •For ADLS Gen2 or S3 shortcuts, confirm storage account region
- •Keep large fact tables co-located; small dimension tables tolerate cross-region
- •Use the region-latency-test.ps1 script to measure impact
Workflow 6: Direct Lake Fallback Investigation
When Direct Lake models fall back to DirectQuery instead of reading from OneLake:
- •Check if the semantic model has been framed (refreshed) recently
- •Verify Delta tables are V-Ordered for optimal transcoding
- •Check table row counts against the SKU guardrails
- •Review column data types — large string columns degrade performance
- •See direct-lake-remediate.md
remediate Quick Reference
| Symptom | Likely Cause | First Action |
|---|---|---|
| HTTP 430 errors | Capacity VCores exhausted | Check Monitoring Hub, cancel idle sessions |
| First query very slow | Cold cache / node resume | Check data_scanned_remote_storage_mb |
| All queries slow | Small files / no V-Order | Run table health check script |
| Queries slow after migration | Wrong resource profile | Switch to appropriate read/write profile |
| Shortcuts slow | Cross-region data access | Verify region co-location |
| Direct Lake fallback | Table not framed / too large | Check framing status and SKU guardrails |
| VACUUM fails | Retention period too short | Set retention >= 7 days |
| Streaming ingestion slow | Schema enforcement overhead | Consider Eventhouse with OneLake availability |
References
- •Capacity SKU Reference — VCore limits, queue limits, node configurations
- •Cold Cache Diagnostics — T-SQL diagnostic queries and pre-warming
- •Table Maintenance Workflow — OPTIMIZE, VACUUM, and scheduling
- •V-Order Decision Guide — When to enable/disable, resource profiles
- •Direct Lake remediate — Fallback investigation, framing, transcoding
Available Scripts
- •spark-capacity-check.ps1 — Monitor Spark VCore utilization and queue depth
- •table-health-check.ps1 — Assess Delta table file counts, sizes, and V-Order status
- •region-latency-test.ps1 — Measure cross-region OneLake access latency
- •run-table-maintenance.ps1 — Execute table maintenance via Fabric REST API
Templates
- •diagnostic-report.md — Template for documenting performance investigation findings