Architecture Analysis for Ascend NPU
You are analyzing model architecture for Ascend NPU compatibility. This skill helps identify:
- •CUDA-specific code patterns that need NPU migration
- •Model structure and Ascend AI Core compatibility
- •Distributed training requirements (DDP/FSDP → HCCL)
- •Custom operations that may lack NPU support
- •Automatic data migration compatibility
When to Use
Invoke this skill when:
- •User asks about model architecture and NPU compatibility
- •Examining a PyTorch codebase for CUDA usage
- •Identifying what needs to change for Ascend migration
- •Analyzing distributed training setup
Analysis Approach
1. Search for CUDA Patterns
bash
# Device placement
grep -rn "\.cuda()" <repo_path>
grep -rn "\.to('cuda')" <repo_path>
grep -rn "device.*['\"]cuda['\"]" <repo_path>
# CUDA APIs
grep -rn "torch\.cuda\." <repo_path>
grep -rn "torch\.cuda\.amp" <repo_path>
# Distributed training
grep -rn "DistributedDataParallel" <repo_path>
grep -rn "init_process_group" <repo_path>
2. Examine Model Architecture
Key aspects to analyze:
- •Layer types: Conv2d, Linear, Attention, Transformer blocks
- •Activation functions: Standard vs custom
- •Normalization: LayerNorm, BatchNorm usage
- •Attention mechanisms: Standard vs custom implementations
- •Special operations: Custom CUDA kernels, fused operations
3. Check for Custom Operations
Identify potential issues:
- •Flash Attention (needs NPU equivalent)
- •Custom CUDA kernels
- •Third-party libraries with CUDA extensions
- •Fused operations not standard in PyTorch
4. Distributed Training Analysis
Check for:
- •
torch.nn.parallel.DistributedDataParallel - •
torch.distributed.init_process_groupwith 'nccl' backend - •FSDP (Fully Sharded Data Parallel)
- •Model parallelism approaches
Output Format
Provide analysis in this structure:
Architecture Overview
- •Model type and characteristics
- •Key architectural components
- •Layer structure summary
CUDA Code Locations
- •List files with
.cuda()or.to('cuda')calls - •Files using
torch.cuda.*APIs - •Specific line numbers and functions
Required API Changes
- •
.cuda()→.npu()mappings with locations - •
torch.cuda.*→torch.npu.*mappings - •Distributed training: DDP → NPU DDP with HCCL
Custom Operations Analysis
- •Non-standard ops that may lack NPU support
- •Suggested NPU alternatives or workarounds
- •Operations requiring CPU fallback (performance impact)
Migration Complexity
- •Overall assessment: Low/Medium/High
- •High-risk components requiring special attention
- •Estimated effort
API Reference
Key CUDA → NPU mappings:
python
# Device
.cuda() → .npu() or .to('npu')
torch.device('cuda') → torch.device('npu')
# Properties
torch.cuda.device_count() → torch.npu.device_count()
torch.cuda.current_device() → torch.npu.current_device()
# Memory
torch.cuda.empty_cache() → torch.npu.empty_cache()
torch.cuda.memory_allocated() → torch.npu.memory_allocated()
# AMP
torch.cuda.amp.autocast → torch.npu.amp.autocast
Tools to Use
Documentation First:
- •Read official Ascend documentation before analysis:
Code Analysis:
- •Use
Readtool to read specific files - •Use
Greptool for pattern searching - •Use
Globtool to find files by patterns
Notes
- •Be thorough in identifying all CUDA references
- •Provide specific file paths and line numbers
- •Consider the impact of automatic data migration
- •Highlight any operations that may need special handling