AgentSkillsCN

fabric-delta-spark-perf

在 Microsoft Fabric 中排查并优化 Delta Lake 和 Apache Spark 的性能。适用于在诊断 Spark 作业运行缓慢、小文件问题、数据倾斜、shuffle 瓶颈、内存不足错误、V-Order 调优、OPTIMIZE/VACUUM 操作、分区策略、资源配置选择(writeHeavy、readHeavyForSpark、readHeavyForPBI)、自动调优配置、Native 执行引擎、广播连接、AQE(自适应查询执行),或在 Fabric Lakehouse 工作负载中发现 Spark 笔记本或 Spark 作业定义的运行速度低于预期时使用。

SKILL.md
--- frontmatter
name: fabric-delta-spark-perf
description: Troubleshoot and optimize Delta Lake and Apache Spark performance in Microsoft Fabric. Use when diagnosing slow Spark jobs, small file problems, data skew, shuffle bottlenecks, out-of-memory errors, V-Order tuning, OPTIMIZE/VACUUM operations, partition strategy, resource profile selection (writeHeavy, readHeavyForSpark, readHeavyForPBI), autotune configuration, Native Execution Engine, broadcast joins, AQE (Adaptive Query Execution), or when Spark notebooks or Spark Job Definitions run slower than expected in Fabric Lakehouse workloads.

Microsoft Fabric Delta Lake Spark Performance remediate

Systematic workflows for diagnosing and resolving Apache Spark and Delta Lake performance issues in Microsoft Fabric Lakehouse environments.

When to Use This Skill

Activate when the user mentions any of the following:

  • Spark job is slow, taking too long, or timing out
  • Small file problem, too many small files, file fragmentation
  • Data skew, straggler tasks, unbalanced partitions
  • Out of memory (OOM) errors on driver or executor
  • Shuffle spill, excessive shuffle read/write
  • OPTIMIZE, VACUUM, bin-compaction, or table maintenance
  • V-Order, Z-Order, or Parquet optimization
  • Resource profiles: writeHeavy, readHeavyForSpark, readHeavyForPBI
  • Autotune, Adaptive Query Execution (AQE), broadcast join thresholds
  • Native Execution Engine configuration
  • Streaming performance, microbatch tuning, checkpoint issues
  • Spark pool sizing, autoscale, dynamic executor allocation
  • Direct Lake performance tied to Delta table structure
  • Capacity throttling, TooManyRequestsForCapacity errors

Prerequisites

  • Microsoft Fabric workspace with Data Engineering or Data Science experience
  • Apache Spark notebooks or Spark Job Definitions
  • Lakehouse with Delta tables
  • Appropriate Fabric capacity SKU (F2 through F2048)

Quick Diagnostic Workflow

When a user reports slow Spark performance, follow this triage sequence:

Step 1: Identify the Symptom Category

SymptomLikely Root CauseJump To
Job runs much longer than expectedData skew or small filesStep 2
OOM error on drivercollect(), toPandas(), or large broadcastdiagnostic-checklist.md
OOM error on executorWide joins, large shuffles, insufficient memorydiagnostic-checklist.md
Many tasks, most finish fast, few stragglersData skewdiagnostic-checklist.md
High shuffle read/write in Spark UIMissing broadcast join or too many partitionsdiagnostic-checklist.md
Query reads thousands of small filesSmall file problem, needs OPTIMIZEStep 3
Capacity throttled (HTTP 430)Too many concurrent jobs for SKUspark-configurations.md
Streaming lag increasingMicrobatch interval or partition mismatchdiagnostic-checklist.md

Step 2: Check for Data Skew

Run the diagnostic script to detect skew in a target table:

python
# Quick skew detection
df = spark.read.format("delta").table("your_table")
df.groupBy("partition_column") \
  .count() \
  .orderBy(F.desc("count")) \
  .show(20)

If the top partition has 10x+ more rows than the median, apply skew mitigation. See diagnostic-checklist.md for remediation steps.

Step 3: Check File Health

python
# Check file count and sizes for a Delta table
from delta.tables import DeltaTable
dt = DeltaTable.forName(spark, "your_table")
detail = spark.sql("DESCRIBE DETAIL your_table")
detail.select("numFiles", "sizeInBytes").show()

# Check for small files (< 32MB)
files_df = spark.sql("DESCRIBE DETAIL your_table")

If file count is high relative to data volume (e.g., >1000 files for <10GB), run OPTIMIZE. See table-maintenance-guide.md.

Step 4: Verify Spark Configuration

Check that the environment has an appropriate resource profile:

python
# Check current resource profile
print(spark.conf.get("spark.fabric.resourceProfile", "not set"))

# Check key write/read settings
configs = [
    "spark.sql.parquet.vorder.default",
    "spark.databricks.delta.optimizeWrite.enabled",
    "spark.databricks.delta.optimizeWrite.binSize",
    "spark.sql.shuffle.partitions",
    "spark.sql.autoBroadcastJoinThreshold",
    "spark.sql.files.maxPartitionBytes",
    "spark.databricks.optimizer.adaptive.enabled"
]
for c in configs:
    try:
        print(f"{c} = {spark.conf.get(c)}")
    except:
        print(f"{c} = (default)")

See spark-configurations.md for recommended values per workload type.

Step 5: Apply Fix and Validate

After identifying the root cause, apply the appropriate fix from the references:

Root CauseFixReference
Small filesRun OPTIMIZE with V-Ordertable-maintenance-guide.md
Stale files bloating storageRun VACUUM with 7+ day retentiontable-maintenance-guide.md
Data skew on joinsEnable AQE skew join + key saltingdiagnostic-checklist.md
Wrong resource profileSwitch to appropriate profilespark-configurations.md
Driver OOMAvoid collect(), increase driver memorydiagnostic-checklist.md
Executor OOMIncrease executor memory, reduce partition sizediagnostic-checklist.md
Excessive shufflingUse broadcast joins for small tablesdiagnostic-checklist.md
V-Order not appliedEnable at table or session leveltable-maintenance-guide.md

Available Scripts

ScriptPurpose
diagnose-delta-performance.pyAutomated diagnostic scan: file health, skew detection, config audit
table-maintenance.pyRun OPTIMIZE + VACUUM across all tables in a Lakehouse schema

Available Templates

TemplatePurpose
notebook-performance-template.pyStarter notebook with performance best practices baked in

Key Principles

  1. Profile before optimizing — Use Spark UI metrics and the diagnostic script before changing configurations. Most performance issues trace to data layout, not Spark settings.

  2. Match resource profile to workload — Write-heavy ETL should use writeHeavy (V-Order off, stats collection off). Read-heavy analytics should use readHeavyForSpark or readHeavyForPBI (V-Order on, Optimized Write on).

  3. Right-size files first — Target ~128MB–1GB per Parquet file. Use OPTIMIZE regularly for tables with frequent writes. Streaming tables need more frequent compaction.

  4. Partition wisely — Only partition tables over 1TB. Use low-cardinality columns (< 200 distinct values). Over-partitioning creates the small file problem.

  5. Enable AQE — Adaptive Query Execution handles skew joins, partition coalescing, and dynamic broadcast automatically. Keep it enabled.

  6. VACUUM safely — Always retain at least 7 days. Shorter retention can break time travel and concurrent readers.

References