Computation Analysis for Ascend NPU

You are analyzing computation patterns for Ascend NPU performance. This skill helps identify:

•Computation-intensive operators and their locations
•CANN operator library support status
•CPU fallback operations (performance impact)
•Optimization opportunities with torch_npu
•Performance profiling approach

When to Use

Invoke this skill when:

•User asks about performance or computation
•Analyzing model operators and operations
•Looking for performance bottlenecks
•Planning optimization strategies

Analysis Approach

1. Identify Heavy Operators

Search for compute-intensive patterns:

bash

# Matrix operations
grep -rn "torch\.matmul\|@\|mm" <repo_path>
grep -rn "nn\.Linear" <repo_path>

# Convolutions
grep -rn "nn\.Conv" <repo_path>

# Attention mechanisms
grep -rn "attention\|scaled_dot_product" <repo_path>
grep -rn "F\.scaled_dot_product_attention" <repo_path>

# Normalization
grep -rn "LayerNorm\|BatchNorm" <repo_path>

# Activation functions
grep -rn "relu\|gelu\|silu\|softmax" <repo_path>

2. CANN Operator Support

CANN (Compute Architecture for Neural Networks) provides:

•Automatic acceleration for standard PyTorch ops
•TBE (Tensor Boost Engine) operator fusion
•AI Core acceleration for matrix ops

Natively Supported (High Performance):

•Matrix multiplication (MatMul, GEMM)
•Convolutions (Conv1d, Conv2d, Conv3d)
•Standard activations (ReLU, GELU, SiLU)
•LayerNorm, BatchNorm
•Standard attention (scaled_dot_product_attention)

CPU Fallback (Low Performance):

•Custom CUDA kernels
•Third-party library operations
•Unsupported fusion operations

Check CANN documentation for operator support status.

3. Graph Optimization Opportunities

Operator Fusion:

•Combine multiple operations into single kernel
•Reduces memory transfers
•Ascend TBE compiler does automatic fusion

Identify opportunities:

•Sequential linear + activation
•Conv + batch norm + activation
•Multiple element-wise operations

4. Automatic Mixed Precision (AMP)

Performance Benefits:

•2-4x speedup on supported operations
•Lower memory bandwidth requirements
•Better AI Core utilization

Check for AMP usage:

python

# Existing
torch.cuda.amp.autocast  # → torch.npu.amp.autocast

# Opportunities
# - FP32 models that can use FP16
# - Operations supporting FP16 acceleration

5. Distributed Training

Communication Operations:

bash

grep -rn "all_reduce\|broadcast\|gather" <repo_path>

HCCL (Huawei Collective Communication Library):

•Replaces NCCL for Ascend
•Used for multi-NPU training
•Backend change required

Output Format

Computation-Intensive Operators

List heavy operators with locations:

Operator Type	Count	Locations	Complexity
nn.Linear	50	model.py:23,45,67...	O(n²)
nn.Conv2d	20	model.py:10-30...	O(k²n²)
MatMul	30	attention.py:55...	O(n³)
Attention	5	attention.py:40-80	O(n²)

CANN Support Analysis

Natively Supported (High Performance):

•List operators with CANN acceleration
•Expected speedup vs CPU

CPU Fallback (Performance Risk):

•List operators requiring CPU execution
•Performance impact assessment
•Suggested workarounds

Optimization Recommendations

torch_npu AMP:

•Enable automatic mixed precision
•Expected speedup: 2-4x
•Operations supporting FP16

Graph Optimization:

•Operator fusion opportunities
•Expected performance gain
•TBE compiler optimization

Distributed Training:

•HCCL communication optimization
•Gradient compression opportunities
•Overlap compute and communication

Performance Profiling

Recommended Tools:

bash

# NPU monitoring
npu-smi info  # Real-time NPU status
npu-smi info -t usages  # Memory and utilization

# Profiling
torch_npu.npu.profile  # Profile NPU operations
msprof  # CANN profiling tool

# Python profiling
python -m torch_npu.testing  # Benchmark utilities

Profiling Approach:

•Run model with small batch
•Use npu-smi to monitor utilization
•Profile individual operations
•Identify bottlenecks
•Optimize hotspots

Expected Performance on Ascend

Based on analysis:

•Bottleneck identification: What limits performance
•Optimization priority: Rank optimizations by impact
•Expected speedup: vs GPU baseline
•Performance model: Operations/sec, memory bandwidth

Tools to Use

Documentation First:

•
Read official Ascend documentation before analysis:
- •https://www.hiascend.com/doc_center/source/zh/Pytorch/730/ptmoddevg/trainingmigrguide/PT_LMTMOG_0002.html
- •https://www.hiascend.com/doc_center/source/zh/canncommercial/850/API/aolapi/operatorlist_00001.html

Computation Analysis:

•Use Grep to search for computation patterns
•
Reference project PDFs:
- •knowledge/CANN商用版 8.5.0 算子库接口参考 01.pdf

Key Considerations

•CANN automatically accelerates standard PyTorch ops
•Custom ops require special handling
•Profiling essential for optimization
•AMP provides easy performance wins
•HCCL required for multi-NPU training
•Graph fusion provides additional speedup