AgentSkillsCN

fabric-spark-compute-perf-remediate

诊断并解决 Microsoft Fabric Delta Lake 和 Spark 工作负载中的性能问题。适用于在修复 Spark 笔记本运行缓慢、Spark 作业长期运行、Delta 表查询性能下降、小文件问题、VOrder 配置、资源配置调优、自动调优配置、容量限流(HTTP 430)、会话启动延迟、执行器内存错误、分区倾斜,或流式摄取瓶颈时使用。涵盖湖仓、SQL 分析端点、Spark 计算、环境配置、Delta OPTIMIZE、VACUUM、Z-Order、二进制压缩,以及 Fabric 容量 SKU 规模设定。

SKILL.md
--- frontmatter
name: fabric-spark-compute-perf-remediate
description: >-
  Diagnose and resolve performance issues in Microsoft Fabric Delta Lake and Spark workloads.
  Use when remediate slow Spark notebooks, long-running Spark jobs, Delta table query
  degradation, small file problems, VOrder configuration, resource profile tuning, autotune
  configuration, capacity throttling (HTTP 430), session startup delays, executor memory
  errors, skewed partitions, or streaming ingestion bottlenecks. Covers Lakehouse, SQL
  analytics endpoint, Spark compute, environment configuration, Delta OPTIMIZE, VACUUM,
  Z-Order, bin-compaction, and Fabric capacity SKU sizing.
license: Complete terms in LICENSE.txt

Fabric Delta Lake & Spark Performance remediate

Systematic remediate toolkit for diagnosing and resolving performance issues across Microsoft Fabric Spark compute, Delta Lake tables, and Lakehouse workloads.

When to Use This Skill

  • Spark notebooks or jobs running slower than expected
  • Delta table queries degrading over time (small file problem)
  • Capacity throttling errors (HTTP 430 / TooManyRequestsForCapacity)
  • Session startup taking minutes instead of seconds
  • Out-of-memory errors on executors or drivers
  • Streaming ingestion falling behind or producing small files
  • VOrder or resource profile misconfiguration
  • Deciding between read-heavy vs. write-heavy workload profiles
  • Planning table maintenance (OPTIMIZE, VACUUM, Z-Order)
  • Tuning autoscaling, dynamic executor allocation, or autotune

Prerequisites

  • Access to a Microsoft Fabric workspace with Data Engineering enabled
  • Contributor or higher role on the workspace
  • Fabric capacity (F2+) or Trial capacity
  • PowerShell 7+ for diagnostic scripts
  • PySpark for notebook-based diagnostics

Triage: Identify the Symptom Category

Start by matching the observed symptom to one of these categories, then follow the linked reference for the detailed workflow.

SymptomCategoryReference
Notebook takes minutes to startSession Startupsession-startup.md
Queries slow, getting worse over timeDelta Table Healthdelta-table-health.md
HTTP 430 or "TooManyRequestsForCapacity"Capacity Throttlingcapacity-throttling.md
OOM errors, executor lost, task failedMemory & Computememory-compute.md
Streaming lag, checkpoint delaysStreaming Performancestreaming-perf.md
Power BI reports slow against LakehouseRead Optimizationread-optimization.md
Spark job slow but no errorsGeneral Tuninggeneral-tuning.md

Quick Diagnostic Checklist

Run through these checks before diving into detailed remediate.

1. Check Capacity and Concurrency

Verify your Fabric capacity SKU and current utilization. Each SKU has a queue limit for concurrent Spark jobs:

SKUSpark VCoresQueue Limit
F244
F8168
F163216
F326432
F6412864
F128256128
Trial128No queueing

If hitting queue limits, cancel idle jobs via the Monitoring Hub, scale the SKU, or stagger job schedules.

2. Check Resource Profile

New Fabric workspaces default to the writeHeavy profile. Verify the active profile matches your workload:

ProfileBest ForVOrder Default
writeHeavyETL, ingestion, batch writesDisabled
readHeavyForSparkInteractive Spark queries, EDAEnabled
readHeavyForPBIPower BI Direct Lake, dashboardsEnabled

Mismatch is one of the most common performance issues. Read-heavy workloads on the writeHeavy profile miss VOrder optimization entirely.

3. Check Delta Table Health

Run the diagnostic script or execute in a notebook:

python
# Quick Delta table health check
from delta.tables import DeltaTable

dt = DeltaTable.forName(spark, "your_table_name")
history = dt.history()
history.select("version", "operation", "operationMetrics").show(20, truncate=False)

Look for: high version counts without OPTIMIZE, many small ADD operations, and missing VACUUM runs.

4. Check Environment Configuration

Verify Spark compute settings in the attached environment:

python
# Print active Spark configuration
for item in spark.sparkContext.getConf().getAll():
    print(f"{item[0]} = {item[1]}")

Key properties to verify:

  • spark.sql.parquet.vorder.default — true for read workloads
  • spark.microsoft.delta.optimizeWrite.enabled — true to reduce small files
  • spark.ms.autotune.enabled — true to enable ML-based query tuning
  • spark.sql.adaptive.enabled — true for adaptive query execution

5. Check Session Startup Time

Fabric Spark session startup depends on pool configuration:

ScenarioExpected Startup
Default settings, no custom libraries5–10 seconds
Default + library dependencies5–10 sec + 30 sec–5 min
High regional traffic2–5 minutes
Private Links or Managed VNets2–5 minutes
High traffic + libraries + networking2–5 min + 30 sec–5 min

If consistently slow, check whether Private Links or Managed VNets are enabled (these bypass Starter Pools).

Key Configuration Properties

VOrder (Read Optimization)

VOrder applies a special sort order to Parquet files at write time that dramatically improves read performance for Power BI Direct Lake and SQL analytics endpoint queries.

Enable: spark.sql.parquet.vorder.default = true

Trade-off: Write operations take slightly longer. Enable only when read performance matters more than write throughput.

Native Execution Engine

Vectorized C++ engine that runs Spark SQL and DataFrame operations significantly faster than the default JVM engine.

Enable: spark.fabric.nativeExecution.enabled = true

Compatible with Fabric Runtime 1.3+. Not all operations supported; the engine falls back to JVM for unsupported operations transparently.

Autotune

ML-based automatic Spark configuration tuning. Builds a model per query pattern and optimizes settings iteratively over 20–25 runs.

Enable: spark.ms.autotune.enabled = true

Requirements: Fabric Runtime 1.1 or 1.2 only. Incompatible with high concurrency mode and private endpoints. Targets queries running 15+ seconds.

Optimized Write

Merges or splits partitions before writing to maximize file sizes and reduce small files.

Enable: spark.microsoft.delta.optimizeWrite.enabled = true

Trade-off: Adds a full shuffle before writes. Beneficial for append-heavy workloads; may degrade performance if data is already well-partitioned.

Available Scripts

ScriptPurpose
Diagnose-DeltaTableHealth.ps1Assess Delta table health via Fabric REST API
Get-SparkCapacityStatus.ps1Check capacity utilization and queue status
Invoke-TableMaintenance.ps1Run OPTIMIZE + VACUUM via REST API

Available Templates

TemplatePurpose
perf-diagnostic-notebook.pyPySpark notebook for comprehensive diagnostics

remediate

ProblemLikely CauseQuick Fix
Queries slow, no errorsSmall files, missing VOrderRun OPTIMIZE with V-Order
Write jobs slowreadHeavy profile on ETL workloadSwitch to writeHeavy profile
HTTP 430 errorsQueue limit exceededScale SKU or stagger jobs
OOM on executorSkewed partition or undersized nodesIncrease executor memory, check for skew
Session takes 5+ minutesPrivate Link/VNet or library installPre-install libraries, check networking
Autotune not activatingWrong runtime or short queriesVerify Runtime 1.1/1.2, queries must run 15s+
Streaming falling behindToo-frequent microbatchesIncrease trigger interval, use Optimized Write

References