Tidymodels Overview
The tidymodels ecosystem provides a consistent, modular framework for machine learning in R. Understanding the ecosystem context helps when working with any tidymodels pipeline before diving into package-specific details.
Core Principle: Recipes Are Plans, Not Actions
Critical: A recipe object is a specification of preprocessing steps. Adding steps like step_normalize() does not transform data immediately. Transformations execute only when:
- •
prep()estimates parameters from training data - •
bake()applies the prepped recipe to new data
# This does NOT transform data - it creates a plan rec <- recipe(outcome ~ ., data = train) |> step_normalize(all_numeric_predictors()) # This estimates parameters (means, sds) from training data prepped <- prep(rec, training = train) # This applies transformations to new data processed <- bake(prepped, new_data = test)
The Tidymodels Workflow
Follow this standard workflow for modeling projects:
1. Data Splitting (rsample)
Allocate data to training, validation, and test sets before any modeling:
set.seed(123) data_split <- initial_split(data, prop = 0.8, strata = outcome) train_data <- training(data_split) test_data <- testing(data_split) # For iterative evaluation during development resamples <- vfold_cv(train_data, v = 10)
2. Preprocessing (recipes)
Define feature engineering as a recipe specification:
rec_spec <- recipe(outcome ~ ., data = train_data) |> step_normalize(all_numeric_predictors()) |> step_dummy(all_factor_predictors()) |> step_zv(all_predictors())
Use tidyselect helpers for column selection:
- •
all_predictors(),all_outcomes()- by role - •
all_numeric_predictors(),all_nominal_predictors()- by type and role - •
has_role(),has_type()- explicit queries
3. Model Specification (parsnip)
Define the model type, engine, and mode separately from fitting:
model_spec <- rand_forest(mtry = tune(), trees = 1000) |>
set_engine("ranger") |>
set_mode("regression")
4. Bundling (workflows)
Combine preprocessing and model into a single object:
wflow <- workflow() |> add_recipe(rec_spec) |> add_model(model_spec)
5. Evaluation (tune + yardstick)
Use resampling or validation sets to assess performance:
# Define metrics metrics <- metric_set(rmse, rsq, mae) # Tune hyperparameters tuned <- tune_grid( wflow, resamples = resamples, grid = 10, metrics = metrics ) # Select best parameters best_params <- select_best(tuned, metric = "rmse")
6. Finalization
Finalize the workflow and fit to full training data:
final_wflow <- finalize_workflow(wflow, best_params) final_fit <- last_fit(final_wflow, split = data_split) # Extract test set metrics collect_metrics(final_fit)
Package Roles
| Package | Purpose | Key Functions |
|---|---|---|
| rsample | Data splitting and resampling | initial_split(), vfold_cv(), bootstraps() |
| recipes | Preprocessing specification | recipe(), step_*(), prep(), bake() |
| parsnip | Model specification | Model functions, set_engine(), set_mode() |
| workflows | Bundle recipe + model | workflow(), add_recipe(), add_model() |
| tune | Hyperparameter optimization | tune_grid(), tune_bayes(), select_best() |
| yardstick | Performance metrics | metric_set(), rmse(), accuracy() |
| workflowsets | Compare multiple pipelines | workflow_set(), workflow_map() |
| stacks | Model ensembling | stacks(), add_candidates(), blend_predictions() |
| hardhat | Internal infrastructure | mold(), forge(), blueprints |
Key Principles
Use Package Functions, Not Direct Access
Never directly modify tidymodels object internals. Always use provided functions:
# WRONG - directly modifying internals recipe_obj$steps[[1]]$means <- new_means # CORRECT - use proper functions rec <- recipe(...) |> step_normalize(...) |> prep()
Use Selectors, Not String Matching
Avoid constructing variable lists manually:
# WRONG - manual string matching numeric_cols <- names(data)[sapply(data, is.numeric)] rec |> step_normalize(all_of(numeric_cols)) # CORRECT - use tidyselect helpers rec |> step_normalize(all_numeric_predictors())
Understand Role Requirements
Custom roles are required at bake() time by default. When using step_rm() with custom roles, update requirements:
rec <- recipe(...) |>
update_role(id_column, new_role = "id") |>
update_role_requirements("id", bake = FALSE) |>
step_rm(has_role("id"))
workflowsets Require Same Outcome
All workflows in a workflow_set must predict the same outcome variable. For different outcomes, create separate workflow sets.
When to Use Each Package
- •Simple model: recipes + parsnip + workflows
- •Hyperparameter tuning: Add tune
- •Model comparison: Add workflowsets
- •Ensemble models: Add stacks (requires
save_pred = TRUE,save_workflow = TRUE) - •Custom preprocessing interfaces: Use hardhat
Additional Resources
Reference Files
For detailed information, consult:
- •
references/packages.md- Detailed package documentation including object structures, creation processes, and deep knowledge links - •
references/common-problems.md- Common pitfalls when working with tidymodels and how to avoid them
External Documentation
- •tidymodels.org - Official documentation and tutorials
- •recipes.tidymodels.org - Recipe step reference
- •parsnip.tidymodels.org - Model specifications