AgentSkillsCN

fabric-onelake-perf-remediate

诊断并解决 Microsoft Fabric OneLake 的性能问题,包括查询缓慢、冷缓存延迟、小文件问题、Delta 表碎片化、V-Order 优化、Spark 限流、容量 SKU 规模设定,以及跨区域数据访问等问题。适用于在修复 OneLake 读写性能、湖仓查询缓慢、Direct Lake 回退、表维护失败、Spark 并发限制、仓库冷启动,或优化 Delta Parquet 文件布局时使用。支持 PowerShell、T-SQL 以及 Spark SQL 的诊断工作流。

SKILL.md
--- frontmatter
name: fabric-onelake-perf-remediate
description: Diagnose and resolve Microsoft Fabric OneLake performance issues including slow queries, cold cache latency, small file problems, Delta table fragmentation, V-Order optimization, Spark throttling, capacity SKU sizing, and cross-region data access. Use when remediate OneLake read/write performance, lakehouse query slowness, Direct Lake fallback, table maintenance failures, Spark concurrency limits, warehouse cold starts, or optimizing Delta parquet file layouts. Supports PowerShell, T-SQL, and Spark SQL diagnostic workflows.
license: Complete terms in LICENSE.txt

OneLake Performance remediate

Systematic diagnostic and remediation toolkit for Microsoft Fabric OneLake performance issues. Covers the full stack from capacity-level throttling down to individual Delta table file layout problems.

When to Use This Skill

  • OneLake read or write operations are slow or timing out
  • Lakehouse or warehouse queries have unexpectedly high latency
  • Spark jobs are being throttled with HTTP 430 errors
  • Delta tables have accumulated many small files (small file problem)
  • Direct Lake semantic models are falling back to DirectQuery
  • Cold cache performance is significantly slower than warm cache
  • Cross-region data access is adding network latency
  • V-Order is not applied or needs to be enabled/disabled
  • Table maintenance (OPTIMIZE, VACUUM) is failing or not improving performance
  • Capacity utilization is high and jobs are queuing

Prerequisites

  • Microsoft Fabric workspace with Contributor or higher role
  • Access to the Monitoring Hub in the Fabric portal
  • PowerShell 7+ with Az.Fabric module (for automation scripts)
  • Familiarity with Spark SQL or T-SQL for diagnostic queries

Diagnostic Decision Tree

Follow this sequence to isolate the root cause:

code
1. Is the issue capacity-level? → Check Spark VCore utilization and queue depth
2. Is the issue cold cache? → Check data_scanned_remote_storage_mb
3. Is the issue file layout? → Check small file count and V-Order status
4. Is the issue cross-region? → Verify data and capacity are co-located
5. Is the issue query design? → Check string column widths, partition pruning

Step-by-Step Workflows

Workflow 1: Diagnose Capacity Throttling

When Spark jobs fail with HTTP 430 (TooManyRequestsForCapacity):

  1. Open the Monitoring Hub in the Fabric portal
  2. Check active Spark sessions against your SKU's VCore limit (1 CU = 2 Spark VCores)
  3. Review the queue depth against your SKU's queue limit (see capacity-sku-reference.md)
  4. Cancel unnecessary jobs or scale up the capacity SKU
  5. For burst workloads, use the spark-capacity-check.ps1 script to monitor utilization

Workflow 2: Resolve Cold Cache Latency

When first query execution is significantly slower than subsequent runs:

  1. Query the queryinsights.exec_requests_history view
  2. Check the data_scanned_remote_storage_mb column — non-zero indicates cold start
  3. Do NOT judge performance on first execution; measure subsequent runs
  4. For pre-warming strategies and diagnostic queries, see cold-cache-diagnostics.md

Workflow 3: Fix Small File Problem

When Delta tables have hundreds or thousands of small Parquet files:

  1. Run the table-health-check.ps1 script to assess file counts and sizes
  2. Apply OPTIMIZE to consolidate files (target: 128 MB–1 GB per file)
  3. Apply V-Order for read-optimized workloads
  4. Schedule recurring maintenance — see table-maintenance-workflow.md

Workflow 4: Optimize V-Order Configuration

When choosing between read-heavy and write-heavy resource profiles:

  1. Identify your dominant workload pattern (ingestion vs. analytics)
  2. New Fabric workspaces default to writeHeavy profile (V-Order disabled)
  3. For Power BI / interactive queries, switch to readHeavyForSpark or readHeavyForPBI
  4. Apply V-Order at session, table, or OPTIMIZE command level
  5. See v-order-decision-guide.md for detailed configuration

Workflow 5: Diagnose Cross-Region Latency

When data in OneLake or external storage is in a different region than Fabric capacity:

  1. Verify the Fabric capacity region in the Admin portal
  2. Check shortcut destinations — are they in the same region?
  3. For ADLS Gen2 or S3 shortcuts, confirm storage account region
  4. Keep large fact tables co-located; small dimension tables tolerate cross-region
  5. Use the region-latency-test.ps1 script to measure impact

Workflow 6: Direct Lake Fallback Investigation

When Direct Lake models fall back to DirectQuery instead of reading from OneLake:

  1. Check if the semantic model has been framed (refreshed) recently
  2. Verify Delta tables are V-Ordered for optimal transcoding
  3. Check table row counts against the SKU guardrails
  4. Review column data types — large string columns degrade performance
  5. See direct-lake-remediate.md

remediate Quick Reference

SymptomLikely CauseFirst Action
HTTP 430 errorsCapacity VCores exhaustedCheck Monitoring Hub, cancel idle sessions
First query very slowCold cache / node resumeCheck data_scanned_remote_storage_mb
All queries slowSmall files / no V-OrderRun table health check script
Queries slow after migrationWrong resource profileSwitch to appropriate read/write profile
Shortcuts slowCross-region data accessVerify region co-location
Direct Lake fallbackTable not framed / too largeCheck framing status and SKU guardrails
VACUUM failsRetention period too shortSet retention >= 7 days
Streaming ingestion slowSchema enforcement overheadConsider Eventhouse with OneLake availability

References

Available Scripts

Templates