AgentSkillsCN

fabric-spark-perf-remediate

诊断并解决 Apache Spark 在 Microsoft Fabric 中的性能问题。适用于在被要求排查 Spark 笔记本运行缓慢、优化 Spark SQL 查询、修复数据倾斜或 shuffle 瓶颈、调优 spark.sql.shuffle.partitions 或 autoBroadcastJoinThreshold、配置资源配置(writeHeavy、readHeavyForSpark、readHeavyForPBI)、启用自动调优、解决 HTTP 430 限流错误、分析 Spark UI 的阶段与执行器、通过 VOrder 或 Optimized Write 优化 Delta Lake 写入、执行表维护(二进制压缩、真空、Z-Order)、修复小文件问题、调优流式吞吐量,或合理调整 Fabric Spark 池与容量 SKU 时使用。

SKILL.md
--- frontmatter
name: fabric-spark-perf-remediate
description: Diagnose and resolve Apache Spark performance issues in Microsoft Fabric. Use when asked to troubleshoot slow Spark notebooks, optimize Spark SQL queries, fix data skew or shuffle bottlenecks, tune spark.sql.shuffle.partitions or autoBroadcastJoinThreshold, configure resource profiles (writeHeavy, readHeavyForSpark, readHeavyForPBI), enable autotune, resolve HTTP 430 throttling errors, analyze Spark UI stages and executors, optimize Delta Lake writes with VOrder or Optimized Write, run table maintenance (bin-compaction, vacuum, Z-Order), fix small files problems, tune streaming throughput, or right-size Fabric Spark pools and capacity SKUs.
license: Complete terms in LICENSE.txt

Microsoft Fabric Apache Spark Performance remediate

Systematic workflows for diagnosing, analyzing, and resolving Apache Spark performance problems in Microsoft Fabric Data Engineering and Data Science workloads.

When to Use This Skill

Activate when encountering any of the following scenarios:

  • Spark notebooks or jobs running slower than expected
  • Capacity throttling errors (HTTP 430 / TooManyRequestsForCapacity)
  • Data skew detected by Spark Advisor in notebook cells
  • Excessive shuffle read/write in Spark UI stages
  • Small files accumulation in Delta Lake tables
  • Streaming ingestion throughput degradation
  • Need to select or tune a Fabric Spark resource profile
  • VOrder vs. Optimized Write decision-making
  • Autotune configuration and validation
  • Right-sizing Spark pools, node counts, or Fabric capacity SKUs

Prerequisites

  • Access to a Microsoft Fabric workspace with Data Engineering enabled
  • Contributor or higher role on the workspace
  • Familiarity with PySpark or Spark SQL
  • PowerShell 7+ (for diagnostic scripts)
  • Fabric REST API access token (for API-based diagnostics)

Quick Diagnosis Decision Tree

Start here when a Spark job is slow:

  1. Is the job queued or throttled? Check Monitoring Hub for HTTP 430.

  2. Did the Spark Advisor flag warnings? Check notebook cell indicators.

  3. Is a single stage disproportionately slow? Open Spark UI → Stages tab.

  4. Are executors underutilized? Check Resources tab in monitoring detail.

  5. Is the issue write-related? Check write duration in Spark UI.

Core Spark Configuration Quick Reference

These are the three settings Fabric Autotune manages automatically. If autotune is disabled, tune them manually:

SettingDefaultPurposeTuning Guidance
spark.sql.shuffle.partitions200Partition count during joins/aggregationsSet to 2-3x total executor cores for your pool
spark.sql.autoBroadcastJoinThreshold10 MBMax table size for broadcast joinsIncrease to 100-256 MB for star-schema joins
spark.sql.files.maxPartitionBytes128 MBMax bytes per file-read partitionIncrease for large sequential scans, decrease for high parallelism

Resource Profiles Quick Reference

Fabric provides predefined profiles that bundle optimized Spark settings:

ProfileBest ForVOrderKey Characteristics
writeHeavyETL, batch ingestion, streamingDisabledDefault for new workspaces; optimized write throughput
readHeavyForSparkInteractive Spark queries, analyticsEnabledOptimized read paths for Spark workloads
readHeavyForPBIPower BI dashboards, DW queriesEnabledOptimized for DirectLake and cross-engine reads

Apply a profile at the environment level or override per-session:

python
# Per-session override example
spark.conf.set("spark.fabric.resource.profile", "readHeavyForSpark")

Autotune Quick Start

Enable autotune to let Fabric automatically optimize shuffle partitions, broadcast thresholds, and partition bytes:

python
# Enable in a notebook session
spark.conf.set("spark.ms.autotune.enabled", "true")

# Or set in Environment > Spark Properties
# spark.ms.autotune.enabled = true

Requirements: Runtime 1.1 or 1.2 only. Not compatible with high concurrency mode or private endpoints. Needs 20-25 iterations to converge on optimal settings.

Check autotune status after a query:

python
# View autotune decisions in Spark UI SQL tab
# Status values: QUERY_TUNING_SUCCEED, QUERY_TUNING_DISABLED,
#                QUERY_PATTERN_NOT_MATCH, QUERY_DURATION_TOO_SHORT

Common Error Patterns

Error / SymptomRoot CauseQuick Fix
HTTP 430: TooManyRequestsForCapacityAll Spark VCores consumedCancel idle jobs in Monitoring Hub or upgrade SKU
Stage with 200 tasks, 1 task 100x slowerData skew on join/group keyAdd salting or use broadcast join
OOM on executorPartition too large or broadcast too bigIncrease partitions or lower broadcast threshold
Write takes >60% of total job timeSmall files or missing optimizationEnable Optimized Write or run table maintenance
Streaming micro-batch latency increasingCheckpoint overhead or partition mismatchTune trigger interval and Event Hub partitions

Step-by-Step Workflows

For detailed procedures, see the reference guides:

Available Scripts

Run the Fabric Spark diagnostics script to collect Spark application metrics via the Fabric REST API:

powershell
./scripts/Get-FabricSparkDiagnostics.ps1 -WorkspaceId "<guid>" -Token "<bearer-token>"

Available Templates

Use the performance analysis notebook template as a starting point for in-session diagnostics:

python
# Paste into a Fabric notebook to analyze current session performance

remediate

ProblemCheckResolution
Autotune not activatingRuntime version, HC mode, private endpointSwitch to Runtime 1.1/1.2, disable HC mode
Resource profile not applyingEnvironment publish statusRepublish environment after profile change
Pool autoscale not scaling upCapacity SKU limitsVerify VCore headroom in Capacity Metrics app
Table maintenance job stuckConcurrent maintenance on same tableWait for previous job or cancel via API
Notebook cell shows no Spark AdvisorRuntime < 3.4Upgrade to Spark 3.4+ runtime

References