Architecture Analysis for Ascend NPU

You are analyzing model architecture for Ascend NPU compatibility. This skill helps identify:

•CUDA-specific code patterns that need NPU migration
•Model structure and Ascend AI Core compatibility
•Distributed training requirements (DDP/FSDP → HCCL)
•Custom operations that may lack NPU support
•Automatic data migration compatibility

When to Use

Invoke this skill when:

•User asks about model architecture and NPU compatibility
•Examining a PyTorch codebase for CUDA usage
•Identifying what needs to change for Ascend migration
•Analyzing distributed training setup

Analysis Approach

1. Search for CUDA Patterns

bash

# Device placement
grep -rn "\.cuda()" <repo_path>
grep -rn "\.to('cuda')" <repo_path>
grep -rn "device.*['\"]cuda['\"]" <repo_path>

# CUDA APIs
grep -rn "torch\.cuda\." <repo_path>
grep -rn "torch\.cuda\.amp" <repo_path>

# Distributed training
grep -rn "DistributedDataParallel" <repo_path>
grep -rn "init_process_group" <repo_path>

2. Examine Model Architecture

Key aspects to analyze:

•Layer types: Conv2d, Linear, Attention, Transformer blocks
•Activation functions: Standard vs custom
•Normalization: LayerNorm, BatchNorm usage
•Attention mechanisms: Standard vs custom implementations
•Special operations: Custom CUDA kernels, fused operations

3. Check for Custom Operations

Identify potential issues:

•Flash Attention (needs NPU equivalent)
•Custom CUDA kernels
•Third-party libraries with CUDA extensions
•Fused operations not standard in PyTorch

4. Distributed Training Analysis

Check for:

•torch.nn.parallel.DistributedDataParallel
•torch.distributed.init_process_group with 'nccl' backend
•FSDP (Fully Sharded Data Parallel)
•Model parallelism approaches

Output Format

Provide analysis in this structure:

Architecture Overview

•Model type and characteristics
•Key architectural components
•Layer structure summary

CUDA Code Locations

•List files with .cuda() or .to('cuda') calls
•Files using torch.cuda.* APIs
•Specific line numbers and functions

Required API Changes

•.cuda() → .npu() mappings with locations
•torch.cuda.* → torch.npu.* mappings
•Distributed training: DDP → NPU DDP with HCCL

Custom Operations Analysis

•Non-standard ops that may lack NPU support
•Suggested NPU alternatives or workarounds
•Operations requiring CPU fallback (performance impact)

Migration Complexity

•Overall assessment: Low/Medium/High
•High-risk components requiring special attention
•Estimated effort

API Reference

Key CUDA → NPU mappings:

python

# Device
.cuda() → .npu() or .to('npu')
torch.device('cuda') → torch.device('npu')

# Properties
torch.cuda.device_count() → torch.npu.device_count()
torch.cuda.current_device() → torch.npu.current_device()

# Memory
torch.cuda.empty_cache() → torch.npu.empty_cache()
torch.cuda.memory_allocated() → torch.npu.memory_allocated()

# AMP
torch.cuda.amp.autocast → torch.npu.amp.autocast

Tools to Use

Documentation First:

•
Read official Ascend documentation before analysis:
- •https://www.hiascend.com/doc_center/source/zh/Pytorch/730/ptmoddevg/trainingmigrguide/PT_LMTMOG_0002.html
- •https://www.hiascend.com/doc_center/source/zh/canncommercial/850/API/aolapi/operatorlist_00001.html

Code Analysis:

•Use Read tool to read specific files
•Use Grep tool for pattern searching
•Use Glob tool to find files by patterns

Notes

•Be thorough in identifying all CUDA references
•Provide specific file paths and line numbers
•Consider the impact of automatic data migration
•Highlight any operations that may need special handling