Optimizing R
This skill covers profiling, benchmarking, parallelization, and performance best practices for R.
Core Principle
Profile before optimizing - Use profvis and bench to identify real bottlenecks. Write readable code first, optimize only when necessary.
Profiling Tools Decision Matrix
| Tool | Use When | Don't Use When | What It Shows |
|---|---|---|---|
profvis | Complex code, unknown bottlenecks | Simple functions, known issues | Time per line, call stack |
bench::mark() | Comparing alternatives | Single approach | Relative performance, memory |
system.time() | Quick checks | Detailed analysis | Total runtime only |
Rprof() | Base R only environments | When profvis available | Raw profiling data |
Performance Workflow
- •Profile first - Find the actual bottlenecks
- •Focus on the slowest parts - 80/20 rule
- •Benchmark alternatives - For hot spots only
- •Consider tool trade-offs - Based on bottleneck type
See profiling-workflow.md for the complete workflow.
When Each Tool Helps vs Hurts
Parallel Processing (in_parallel())
Helps when:
- •CPU-intensive computations
- •Embarrassingly parallel problems
- •Large datasets with independent operations
- •I/O bound operations (file reading, API calls)
Hurts when:
- •Simple, fast operations (overhead > benefit)
- •Memory-intensive operations (may cause thrashing)
- •Operations requiring shared state
- •Small datasets
See parallel-examples.md for decision points.
Data Backend Selection
| Backend | Use When |
|---|---|
| data.table | Very large datasets (>1GB), complex grouping, maximum performance critical |
| dplyr | Readability priority, complex joins/window functions, moderate data (<100MB) |
| base R | No dependencies allowed, simple operations, teaching/learning |
See backend-selection.md for guidance.
Profiling Best Practices
- •Profile realistic data sizes - Not toy examples
- •Profile multiple runs - For stability
- •Check memory usage too - Not just time
- •Profile realistic usage patterns - Not isolated calls
See profiling-best-practices.md for examples.
Performance Anti-Patterns to Avoid
- •Don't optimize without measuring - Profile first
- •Don't over-engineer - Complex optimizations for 1% gains
- •Don't assume - "for loops are always slow" is a myth
- •Don't ignore readability costs - Readable code with targeted optimizations
See performance-anti-patterns.md for examples.
Modern purrr Patterns
Data Frame Binding (purrr 1.0+)
| Superseded | Modern Replacement |
|---|---|
map_dfr(x, f) | map(x, f) |> list_rbind() |
map_dfc(x, f) | map(x, f) |> list_cbind() |
map2_dfr(x, y, f) | map2(x, y, f) |> list_rbind() |
Side Effects with walk()
Use walk() and walk2() for side effects (file writing, plotting).
Parallel Processing (purrr 1.1.0+)
Use in_parallel() with mirai for scaling across cores.
See purrr-patterns.md for all patterns.
Backend Tools for Performance
When speed is critical, consider:
- •vctrs - Type-stable vector operations
- •rlang - Metaprogramming
- •data.table - Large data operations
Profile to identify whether these tools will help your specific bottleneck.
source: Sarah Johnson's gist https://gist.github.com/sj-io/3828d64d0969f2a0f05297e59e6c15ad