AgentSkillsCN

optimizing-r

R 语言的性能剖析、基准测试与优化策略。当您的代码运行缓慢、需要对比多种实现方案、在 dplyr、data.table 与基础 R 之间权衡取舍,或希望引入并行处理时,这一技能将助您一臂之力。本技能覆盖 profvis 与 bench 的使用方法、性能优化工作流、通过 in_parallel() 实现并行计算、数据后端的选择、现代 purrr 库的使用模式(list_rbind、walk),以及常见的性能反模式,助您规避潜在陷阱。

SKILL.md
--- frontmatter
name: optimizing-r
description: |
  R performance profiling, benchmarking, and optimization strategies. Use this skill when code is running slowly, comparing alternative implementations, deciding between dplyr/data.table/base R, or implementing parallel processing. Covers profvis and bench usage, performance workflow, parallel processing with in_parallel(), data backend selection, modern purrr patterns (list_rbind, walk), and common performance anti-patterns to avoid.

Optimizing R

This skill covers profiling, benchmarking, parallelization, and performance best practices for R.

Core Principle

Profile before optimizing - Use profvis and bench to identify real bottlenecks. Write readable code first, optimize only when necessary.

Profiling Tools Decision Matrix

ToolUse WhenDon't Use WhenWhat It Shows
profvisComplex code, unknown bottlenecksSimple functions, known issuesTime per line, call stack
bench::mark()Comparing alternativesSingle approachRelative performance, memory
system.time()Quick checksDetailed analysisTotal runtime only
Rprof()Base R only environmentsWhen profvis availableRaw profiling data

Performance Workflow

  1. Profile first - Find the actual bottlenecks
  2. Focus on the slowest parts - 80/20 rule
  3. Benchmark alternatives - For hot spots only
  4. Consider tool trade-offs - Based on bottleneck type

See profiling-workflow.md for the complete workflow.

When Each Tool Helps vs Hurts

Parallel Processing (in_parallel())

Helps when:

  • CPU-intensive computations
  • Embarrassingly parallel problems
  • Large datasets with independent operations
  • I/O bound operations (file reading, API calls)

Hurts when:

  • Simple, fast operations (overhead > benefit)
  • Memory-intensive operations (may cause thrashing)
  • Operations requiring shared state
  • Small datasets

See parallel-examples.md for decision points.

Data Backend Selection

BackendUse When
data.tableVery large datasets (>1GB), complex grouping, maximum performance critical
dplyrReadability priority, complex joins/window functions, moderate data (<100MB)
base RNo dependencies allowed, simple operations, teaching/learning

See backend-selection.md for guidance.

Profiling Best Practices

  1. Profile realistic data sizes - Not toy examples
  2. Profile multiple runs - For stability
  3. Check memory usage too - Not just time
  4. Profile realistic usage patterns - Not isolated calls

See profiling-best-practices.md for examples.

Performance Anti-Patterns to Avoid

  • Don't optimize without measuring - Profile first
  • Don't over-engineer - Complex optimizations for 1% gains
  • Don't assume - "for loops are always slow" is a myth
  • Don't ignore readability costs - Readable code with targeted optimizations

See performance-anti-patterns.md for examples.

Modern purrr Patterns

Data Frame Binding (purrr 1.0+)

SupersededModern Replacement
map_dfr(x, f)map(x, f) |> list_rbind()
map_dfc(x, f)map(x, f) |> list_cbind()
map2_dfr(x, y, f)map2(x, y, f) |> list_rbind()

Side Effects with walk()

Use walk() and walk2() for side effects (file writing, plotting).

Parallel Processing (purrr 1.1.0+)

Use in_parallel() with mirai for scaling across cores.

See purrr-patterns.md for all patterns.

Backend Tools for Performance

When speed is critical, consider:

  • vctrs - Type-stable vector operations
  • rlang - Metaprogramming
  • data.table - Large data operations

Profile to identify whether these tools will help your specific bottleneck.

source: Sarah Johnson's gist https://gist.github.com/sj-io/3828d64d0969f2a0f05297e59e6c15ad