AgentSkillsCN

computation-analysis

分析 Ascend NPU 的计算密集型算子及其性能表现。适用于考察模型运算、性能瓶颈,以及 CANN 算子库的支持情况时使用。

SKILL.md
--- frontmatter
name: computation-analysis
description: Analyze computation-intensive operators and performance for Ascend NPU. Use when examining model operations, performance bottlenecks, and CANN operator library support.

Computation Analysis for Ascend NPU

You are analyzing computation patterns for Ascend NPU performance. This skill helps identify:

  1. Computation-intensive operators and their locations
  2. CANN operator library support status
  3. CPU fallback operations (performance impact)
  4. Optimization opportunities with torch_npu
  5. Performance profiling approach

When to Use

Invoke this skill when:

  • User asks about performance or computation
  • Analyzing model operators and operations
  • Looking for performance bottlenecks
  • Planning optimization strategies

Analysis Approach

1. Identify Heavy Operators

Search for compute-intensive patterns:

bash
# Matrix operations
grep -rn "torch\.matmul\|@\|mm" <repo_path>
grep -rn "nn\.Linear" <repo_path>

# Convolutions
grep -rn "nn\.Conv" <repo_path>

# Attention mechanisms
grep -rn "attention\|scaled_dot_product" <repo_path>
grep -rn "F\.scaled_dot_product_attention" <repo_path>

# Normalization
grep -rn "LayerNorm\|BatchNorm" <repo_path>

# Activation functions
grep -rn "relu\|gelu\|silu\|softmax" <repo_path>

2. CANN Operator Support

CANN (Compute Architecture for Neural Networks) provides:

  • Automatic acceleration for standard PyTorch ops
  • TBE (Tensor Boost Engine) operator fusion
  • AI Core acceleration for matrix ops

Natively Supported (High Performance):

  • Matrix multiplication (MatMul, GEMM)
  • Convolutions (Conv1d, Conv2d, Conv3d)
  • Standard activations (ReLU, GELU, SiLU)
  • LayerNorm, BatchNorm
  • Standard attention (scaled_dot_product_attention)

CPU Fallback (Low Performance):

  • Custom CUDA kernels
  • Third-party library operations
  • Unsupported fusion operations

Check CANN documentation for operator support status.

3. Graph Optimization Opportunities

Operator Fusion:

  • Combine multiple operations into single kernel
  • Reduces memory transfers
  • Ascend TBE compiler does automatic fusion

Identify opportunities:

  • Sequential linear + activation
  • Conv + batch norm + activation
  • Multiple element-wise operations

4. Automatic Mixed Precision (AMP)

Performance Benefits:

  • 2-4x speedup on supported operations
  • Lower memory bandwidth requirements
  • Better AI Core utilization

Check for AMP usage:

python
# Existing
torch.cuda.amp.autocast  # → torch.npu.amp.autocast

# Opportunities
# - FP32 models that can use FP16
# - Operations supporting FP16 acceleration

5. Distributed Training

Communication Operations:

bash
grep -rn "all_reduce\|broadcast\|gather" <repo_path>

HCCL (Huawei Collective Communication Library):

  • Replaces NCCL for Ascend
  • Used for multi-NPU training
  • Backend change required

Output Format

Computation-Intensive Operators

List heavy operators with locations:

Operator TypeCountLocationsComplexity
nn.Linear50model.py:23,45,67...O(n²)
nn.Conv2d20model.py:10-30...O(k²n²)
MatMul30attention.py:55...O(n³)
Attention5attention.py:40-80O(n²)

CANN Support Analysis

Natively Supported (High Performance):

  • List operators with CANN acceleration
  • Expected speedup vs CPU

CPU Fallback (Performance Risk):

  • List operators requiring CPU execution
  • Performance impact assessment
  • Suggested workarounds

Optimization Recommendations

torch_npu AMP:

  • Enable automatic mixed precision
  • Expected speedup: 2-4x
  • Operations supporting FP16

Graph Optimization:

  • Operator fusion opportunities
  • Expected performance gain
  • TBE compiler optimization

Distributed Training:

  • HCCL communication optimization
  • Gradient compression opportunities
  • Overlap compute and communication

Performance Profiling

Recommended Tools:

bash
# NPU monitoring
npu-smi info  # Real-time NPU status
npu-smi info -t usages  # Memory and utilization

# Profiling
torch_npu.npu.profile  # Profile NPU operations
msprof  # CANN profiling tool

# Python profiling
python -m torch_npu.testing  # Benchmark utilities

Profiling Approach:

  1. Run model with small batch
  2. Use npu-smi to monitor utilization
  3. Profile individual operations
  4. Identify bottlenecks
  5. Optimize hotspots

Expected Performance on Ascend

Based on analysis:

  • Bottleneck identification: What limits performance
  • Optimization priority: Rank optimizations by impact
  • Expected speedup: vs GPU baseline
  • Performance model: Operations/sec, memory bandwidth

Tools to Use

Documentation First:

Computation Analysis:

  • Use Grep to search for computation patterns
  • Reference project PDFs:
    • knowledge/CANN商用版 8.5.0 算子库接口参考 01.pdf

Key Considerations

  • CANN automatically accelerates standard PyTorch ops
  • Custom ops require special handling
  • Profiling essential for optimization
  • AMP provides easy performance wins
  • HCCL required for multi-NPU training
  • Graph fusion provides additional speedup