Analyze SIMD Usage Opportunities

Name: analyze-simd-usage
Rating: 92
Author: HomericIntelligence

Identify where SIMD (Single Instruction Multiple Data) can improve performance.

When to Use

•Performance-critical tensor operations
•Element-wise operations on large arrays
•Vectorization of loops processing multiple elements
•Optimizing matrix/vector operations
•Finding performance bottlenecks in ML code

Quick Reference

bash

# Find loops processing arrays/tensors
grep -n "for.*in.*range\|@unroll\|@vectorize" *.mojo

# Find element-wise operations
grep -n "\.load\|\.store\|\.broadcast" *.mojo

# Check for SIMD parameters
grep -n "simd_width\|nelems\|\[.*:\]" *.mojo

# Identify candidates
grep -n "for i in range.*:" -A 10 *.mojo | grep -E "array\[i\]|tensor\[i\]"

SIMD Optimization Opportunities

Vectorizable Patterns:

•✅ Element-wise addition: a[i] + b[i] for all i
•✅ Scalar multiplication: a[i] * scalar for all i
•✅ Unary operations: sin(a[i]), exp(a[i]) for all i
•✅ Reduction operations: sum, max, min over array
•❌ Dependent iterations: a[i] = a[i-1] + value (sequential)
•❌ Conditional branches: if a[i] > threshold: (hard to vectorize)
•❌ Function calls: unpredictable latency (avoid in tight loops)

SIMD Width Selection:

•@parameter fn[simd_width: Int] - Generic SIMD width
•simd_width=4 - Typically good for float32
•simd_width=8 - Optimal for many operations
•simd_width=16+ - For int32 or specialized ops
•Match hardware capabilities (AVX2=4-8, AVX512=8-16)

Vectorization Patterns:

•✅ @vectorize decorator for simple loops
•✅ @unroll for small loops (2-4 iterations)
•✅ Manual SIMD with .load[] and .store[]
•✅ Tensor operations with SIMD dimensions

Analysis Workflow

•Profile code: Identify bottlenecks using time/memory metrics
•Find loops: Locate loops processing large amounts of data
•Check vectorizability: Verify no loop-carried dependencies
•Estimate speedup: SIMD could provide 4-16x improvement
•Implement SIMD: Use @vectorize, @unroll, or manual SIMD
•Measure performance: Verify improvement with benchmarks
•Document changes: Note what was optimized and why

Output Format

Report SIMD analysis with:

•Hotspots - Functions/loops using most CPU time
•Vectorization Potential - Operations that could use SIMD
•Estimated Speedup - Expected performance improvement
•Implementation Priority - High/medium/low impact
•Technical Approach - How to implement SIMD
•Risks - Potential issues with vectorization
•Recommendations - Which optimizations to pursue first

Optimization Examples

Example 1: Element-wise Addition

mojo

# Before: scalar loop
fn add_scalar(a: Tensor, b: Tensor) -> Tensor:
    var result = Tensor(a.shape)
    for i in range(a.num_elements()):
        result._data[i] = a._data[i] + b._data[i]
    return result

# After: vectorized
@vectorize
fn add_simd[simd_width: Int](i: Int):
    result._data.store[simd_width](i,
        a._data.load[simd_width](i) + b._data.load[simd_width](i))

def add_vectorized(a: Tensor, b: Tensor) -> Tensor:
    var result = Tensor(a.shape)
    # 4x-8x speedup typical
    return result

Example 2: Reduction (Sum)

mojo

# Before: scalar loop
fn sum_scalar(tensor: Tensor) -> Float32:
    var total: Float32 = 0
    for i in range(tensor.num_elements()):
        total += tensor._data[i]
    return total

# After: SIMD reduction
fn sum_simd[simd_width: Int](tensor: Tensor) -> Float32:
    # Process simd_width elements at a time
    # Then reduce results - can be much faster
    return total

Error Handling

Problem	Solution
Vectorization causes wrong results	Check for loop-carried dependencies
Segment fault with SIMD	Verify alignment and bounds
Minimal speedup	May not be vectorizable, profile to confirm
Complex logic	Break into simpler vectorizable operations
Type mismatches	Ensure SIMD width compatible with element type

SIMD Decision Tree

•Does loop process large arrays? → YES → Check vectorizability
•Loop-carried dependencies? → YES → Can't vectorize, optimize differently
•Simple operations on many elements? → YES → Use @vectorize or @unroll
•Critical path (hot loop)? → YES → Worth optimizing
•Implement → Measure → Iterate

References

•See mojo-simd-optimize for implementation guidance
•See CLAUDE.md for SIMD code patterns
•See performance section in module documentation