Computation Analysis for Ascend NPU
You are analyzing computation patterns for Ascend NPU performance. This skill helps identify:
- •Computation-intensive operators and their locations
- •CANN operator library support status
- •CPU fallback operations (performance impact)
- •Optimization opportunities with torch_npu
- •Performance profiling approach
When to Use
Invoke this skill when:
- •User asks about performance or computation
- •Analyzing model operators and operations
- •Looking for performance bottlenecks
- •Planning optimization strategies
Analysis Approach
1. Identify Heavy Operators
Search for compute-intensive patterns:
# Matrix operations grep -rn "torch\.matmul\|@\|mm" <repo_path> grep -rn "nn\.Linear" <repo_path> # Convolutions grep -rn "nn\.Conv" <repo_path> # Attention mechanisms grep -rn "attention\|scaled_dot_product" <repo_path> grep -rn "F\.scaled_dot_product_attention" <repo_path> # Normalization grep -rn "LayerNorm\|BatchNorm" <repo_path> # Activation functions grep -rn "relu\|gelu\|silu\|softmax" <repo_path>
2. CANN Operator Support
CANN (Compute Architecture for Neural Networks) provides:
- •Automatic acceleration for standard PyTorch ops
- •TBE (Tensor Boost Engine) operator fusion
- •AI Core acceleration for matrix ops
Natively Supported (High Performance):
- •Matrix multiplication (MatMul, GEMM)
- •Convolutions (Conv1d, Conv2d, Conv3d)
- •Standard activations (ReLU, GELU, SiLU)
- •LayerNorm, BatchNorm
- •Standard attention (scaled_dot_product_attention)
CPU Fallback (Low Performance):
- •Custom CUDA kernels
- •Third-party library operations
- •Unsupported fusion operations
Check CANN documentation for operator support status.
3. Graph Optimization Opportunities
Operator Fusion:
- •Combine multiple operations into single kernel
- •Reduces memory transfers
- •Ascend TBE compiler does automatic fusion
Identify opportunities:
- •Sequential linear + activation
- •Conv + batch norm + activation
- •Multiple element-wise operations
4. Automatic Mixed Precision (AMP)
Performance Benefits:
- •2-4x speedup on supported operations
- •Lower memory bandwidth requirements
- •Better AI Core utilization
Check for AMP usage:
# Existing torch.cuda.amp.autocast # → torch.npu.amp.autocast # Opportunities # - FP32 models that can use FP16 # - Operations supporting FP16 acceleration
5. Distributed Training
Communication Operations:
grep -rn "all_reduce\|broadcast\|gather" <repo_path>
HCCL (Huawei Collective Communication Library):
- •Replaces NCCL for Ascend
- •Used for multi-NPU training
- •Backend change required
Output Format
Computation-Intensive Operators
List heavy operators with locations:
| Operator Type | Count | Locations | Complexity |
|---|---|---|---|
| nn.Linear | 50 | model.py:23,45,67... | O(n²) |
| nn.Conv2d | 20 | model.py:10-30... | O(k²n²) |
| MatMul | 30 | attention.py:55... | O(n³) |
| Attention | 5 | attention.py:40-80 | O(n²) |
CANN Support Analysis
Natively Supported (High Performance):
- •List operators with CANN acceleration
- •Expected speedup vs CPU
CPU Fallback (Performance Risk):
- •List operators requiring CPU execution
- •Performance impact assessment
- •Suggested workarounds
Optimization Recommendations
torch_npu AMP:
- •Enable automatic mixed precision
- •Expected speedup: 2-4x
- •Operations supporting FP16
Graph Optimization:
- •Operator fusion opportunities
- •Expected performance gain
- •TBE compiler optimization
Distributed Training:
- •HCCL communication optimization
- •Gradient compression opportunities
- •Overlap compute and communication
Performance Profiling
Recommended Tools:
# NPU monitoring npu-smi info # Real-time NPU status npu-smi info -t usages # Memory and utilization # Profiling torch_npu.npu.profile # Profile NPU operations msprof # CANN profiling tool # Python profiling python -m torch_npu.testing # Benchmark utilities
Profiling Approach:
- •Run model with small batch
- •Use npu-smi to monitor utilization
- •Profile individual operations
- •Identify bottlenecks
- •Optimize hotspots
Expected Performance on Ascend
Based on analysis:
- •Bottleneck identification: What limits performance
- •Optimization priority: Rank optimizations by impact
- •Expected speedup: vs GPU baseline
- •Performance model: Operations/sec, memory bandwidth
Tools to Use
Documentation First:
- •Read official Ascend documentation before analysis:
Computation Analysis:
- •Use
Grepto search for computation patterns - •Reference project PDFs:
- •
knowledge/CANN商用版 8.5.0 算子库接口参考 01.pdf
- •
Key Considerations
- •CANN automatically accelerates standard PyTorch ops
- •Custom ops require special handling
- •Profiling essential for optimization
- •AMP provides easy performance wins
- •HCCL required for multi-NPU training
- •Graph fusion provides additional speedup