2  Models

Caution

This chapter is under active construction! You may be better off checking out other chapters in the meantime.

At a high level, this chapter aims to help readers build intuition about how long various model types and engines take to fit. Our first order of business, in Section 2.1, is demonstrating that fitting a model with tidymodels doesn’t take much longer than it would to fit the model without tidymodels’ unifying interface on top. That is, if I’d like to fit a boosted tree with XGBoost, will xgboost::xgb.train() fit notably more quickly than the analogous tidymodels interface boost_tree(engine = "xgboost")? Then, once you’re convinced the penalty in performance is negligible, we’ll move on to comparing fit times across modeling engines in Section 2.2. That is, does boost_tree(engine = "lightgbm") fit more quickly than boost_tree(engine = "xgboost")?

2.1 Tidymodels overhead

While the tidymodels team develops the infrastructure that users interact with directly, under the hood, we send calls out to other people’s modeling packages—or modeling engines—that provide the actual implementations that estimate parameters, generate predictions, etc. The process looks something like this:

A graphic representing the tidymodels interface. In order, step 1 “translate”, step 2 “call”, and step 3 “translate”, outline the process of translating from the standardized tidymodels interface to an engine’s specific interface, calling the modeling engine, and translating back to the standardized tidymodels interface. Step 1 and step 3 are in green, while step 2 is in orange.

When thinking about the time allotted to each of the three steps above, we refer to the “translate” steps in green as the tidymodels overhead. The time it takes to “translate” interfaces in steps 1) and 3) is within our control, while the time the modeling engine takes to do its thing in step 2) is not.

Let’s demonstrate with an example classification problem. Generating some data:

set.seed(1)
d <- simulate_classification(n_rows = 100)

d
# A tibble: 100 × 18
   class   two_factor_1 two_factor_2 non_linear_1 non_linear_2 non_linear_3
   <fct>   <fct>               <dbl> <fct>               <dbl>        <dbl>
 1 class_2 level_1          -1.17    level_1             0.554        0.814
 2 class_1 level_1           0.261   level_2             0.688        0.929
 3 class_2 level_1          -1.61    level_1             0.658        0.147
 4 class_1 level_1           2.14    level_1             0.663        0.750
 5 class_2 level_1           0.0360  level_1             0.472        0.976
 6 class_1 level_2          -0.00837 level_1             0.970        0.975
 7 class_2 level_1           1.05    level_2             0.402        0.351
 8 class_1 level_1           1.49    level_1             0.850        0.394
 9 class_2 level_1           0.967   level_2             0.757        0.951
10 class_2 level_2           0.603   level_1             0.533        0.107
# ℹ 90 more rows
# ℹ 12 more variables: linear_01 <dbl>, linear_02 <dbl>, linear_03 <dbl>,
#   linear_04 <dbl>, linear_05 <dbl>, linear_06 <dbl>, linear_07 <dbl>,
#   linear_08 <dbl>, linear_09 <dbl>, linear_10 <fct>, linear_11 <fct>,
#   linear_12 <fct>

…we’d like to model the class using the remainder of the variables in this dataset using a logistic regression. We can using the following code to do so:

fit(logistic_reg(), class ~ ., d)
parsnip model object


Call:  stats::glm(formula = class ~ ., family = stats::binomial, data = data)

Coefficients:
        (Intercept)  two_factor_1level_2         two_factor_2  
           7.080808            -7.463598            -2.821078  
non_linear_1level_2         non_linear_2         non_linear_3  
           0.550041            -0.878752            -2.039599  
          linear_01            linear_02            linear_03  
          -0.153653            -0.273809            -0.005318  
          linear_04            linear_05            linear_06  
           1.248566             0.880917            -0.112520  
          linear_07            linear_08            linear_09  
          -0.925063             1.261782            -1.251249  
   linear_10level_2     linear_10level_3     linear_10level_4  
         -18.020101            -3.242219            -1.336644  
   linear_11level_2     linear_11level_3  
          -0.640718            -4.402850  
 [ reached getOption("max.print") -- omitted 4 entries ]

Degrees of Freedom: 99 Total (i.e. Null);  76 Residual
Null Deviance:      134.6 
Residual Deviance: 54.75    AIC: 102.8

The default engine for a logistic regression in tidymodels is stats::glm(). So, in the style of the above graphic, this code:

  1. Translates the tidymodels code, which is consistent across engines, to the format that is specific to the chosen engine. In this case, there’s not a whole lot to do: it passes the preprocessor as formula, the data as data, and picks a family of stats::binomial.
  2. Calls stats::glm() and collects its output.
  3. Translates the output of stats::glm() back into a standardized model fit object.

Again, we can control what happens in steps 1) and 3), but step 2) belongs to the stats package.

The time that steps 1) and 3) take is relatively independent of the dimensionality of the training data. That is, regardless of whether we train on one hundred or a million data points, our code (as in, the translation) takes about the same time to run. Regardless of training set size, our code pushes around small, relational data structures to determine how to correctly interface with a given engine. The time it takes to run step 2), though, depends almost completely on the size of the data. Depending on the modeling engine, modeling 10 times as much data could result in step 2) taking twice as long, or 10x as long, or 100x as long as the original fit.

So, while the absolute time allotted to steps 1) and 3) is fixed, the portion of total time to fit a model with tidymodels that is “overhead” depends on how quick the engine code itself is. How quick is a logistic regression with glm() on 100 data points?

bench::mark(
  fit = glm(class ~ ., family = binomial, data = d)
) %>% 
  select(expression, median)
# A tibble: 1 × 2
  expression   median
* <bch:expr> <bch:tm>
1 fit          2.28ms

Pretty dang fast. That means that, if the tidymodels overhead is one second, we’ve made this model fit a thousand times slower!

In practice, the overhead here has hovered around a millisecond or two for the last couple years, and machine learning practitioners usually fit much more computationally expensive models than a logistic regression on 100 data points. You’ll just have to believe me on that second point. Regarding the first:

bench::mark(
  parsnip = fit(logistic_reg(), class ~ ., d),
  stats = glm(class ~ ., family = binomial, data = d),
  check = FALSE
)
# A tibble: 2 × 3
  expression   median mem_alloc
* <bch:expr> <bch:tm> <bch:byt>
1 parsnip      3.19ms    1.22MB
2 stats        2.33ms    1.21MB

Remember that the first expression calls the second one, so the increase in time from the second to the first is the “overhead.” In this case, it’s 0.858 milliseconds, or 26.9% of the total elapsed time.

So, to fit a boosted tree model on 1,000,000 data points, step 2) might take a few seconds. Steps 1) and 3) don’t care about the size of the data, so they still take a few thousandths of a second. No biggie—the overhead is negligible. Let’s quickly back that up by fitting boosted tree models on simulated datasets of varying sizes, once with the XGBoost interface and once with parsnip’s wrapper around it.

A ggplot line plot displaying numbers of rows, ranging from 100 to a million, on the x axis, and elapsed time, ranging from a millisecond to 10 seconds, on the y axis. One line shows the fit times for XGBoost itself and the other shows XGBoost with parsnip--fit times are only visually different for numbers of rows less than 10,000.

Elapsed fit times for XGBoost itself versus XBGoost interfaced with through parsnip. Fit times are non-negligigbly different only for very small data.

In the left-most model fits on 100 rows, the model fit with XGBoost itself takes 4.7 milliseconds while parsnip takes 8.99, a 91.1% increase in total elapsed time. However, that increase shrinks to 28.7% with an 1000-row dataset, and is fractions of a percent by the time there are hundreds of thousands of rows in the training data. This is the gist of tidymodels’ overhead for modeling engines: as dataset size and model complexity grow larger, the underlying model fit itself takes up increasingly large proportions of the total evaluation time.

Section 1.1.3 showed a number of ways users can cut down on the evaluation time of their tidymodels code. Making use of parallelism, reducing the total number of model fits needed to search a given grid, and carefully constructing that grid to search over are all major parts of the story. However, the rest of this chapter will focus explicitly on choosing performant modeling engines.

2.2 Benchmarks

The following is a shiny app based on experimental benchmarks. For a given selection of model configurations, the app displays the time to resample various model configurations across a given number of rows of training data.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| label: models-app
#| embed-resources: false
#| standalone: true
#| viewerHeight: 600
#| eval: true
#| echo: false
library(ggplot2)
library(bench)
library(qs)
library(shiny)
library(bslib)
library(shinylive)
library(scales)

options(
  ggplot2.discrete.colour = c(
    "#1a162d", "#42725c", "#cd6f3d", "#a8ab71",
    "#8b4b65", "#557088", "#d9b594", "#6b705c", "#956b4b", "#2d4041"
  )
)

n_rows <- round(10^seq(from = 2, to = 6, by = .5))
reference_mark <- 37002
footer_context <- paste0(collapse = "", c(
  "Timings estimate the time to evaluate an initial set of 10 models across 10 ",
  "resamples, resulting in 100 model fits on 9/10th of rows, 100 sets of ",
  "predictions on 1/10th of rows, and metric calculations on each set of predictions."
))

load(url("https://raw.githubusercontent.com/simonpcouch/emlwr/main/data/models/app/bm.rda"))
load(url("https://raw.githubusercontent.com/simonpcouch/emlwr/main/data/models/app/cpus.rda"))

# copying over r-lib/bench#144
bench_time_trans <- function(base = 10) {
  if (is.null(base)) {
    return(
      scales::trans_new("bch:tm", as.numeric, as_bench_time,
                        breaks = scales::pretty_breaks(), domain = c(1e-100, Inf)
      )
    )
  }
  trans <- function(x) log(as.numeric(x), base)
  inv <- function(x) bench::as_bench_time(base ^ as.numeric(x))
  
  trans_new(paste0("bch:tm-", format(base)), trans, inv, 
            breaks = log_breaks(base = base), domain = c(1e-100, Inf))
}

ui <- page_fillable(
  theme = bs_theme(
    bg = "#ffffff",
    fg = "#333333",
    primary = "#42725c",
  ),
  title = "Time To Tune",
  layout_columns(
    selectInput(
      "model", "Model:",
      choices = unique(bm$model),
      multiple = TRUE,
      selected = c(
        "linear_reg (glmnet)", 
        "boost_tree (xgboost)",
        "boost_tree (lightgbm)"
      )
    )
  ),
  layout_sidebar(
    sidebar = sidebar(
      open = "always",
      position = "right",
      selectInput(
        "task", "Task:",
        choices = unique(bm$task),
        selected = "regression"
      ),
      # sliderInput(
      #   "n_workers", "Number of Workers:",
      #   value = 1,
      #   min = 1,
      #   max = 10,
      #   step = 1
      # ),
      # selectInput(
      #   "tuning_fn", "Tuning Function:",
      #   choices = unique(bm$tuning_fn)
      # ),
      selectInput(
        "cpu", "CPU:", 
        choices = NULL
      ),
      markdown("Timings scaled according to [CPU benchmarks](https://www.cpubenchmark.net/cpu_list.php).")
    ),
    card(
      full_screen = TRUE,
      card_header("Time To Tune"),
      plotOutput("plot"),
      footer = footer_context
    )
  )
)

server <- function(input, output, session) {
  updateSelectizeInput(
    session,
    'cpu',
    choices = cpus$name,
    server = TRUE,
    selected = "Intel Core i7-13700"
  )
  
  output$plot <- renderPlot({
    new_data <- bm[
      bm$model %in% input$model &
        bm$task == input$task &
        bm$n_workers == 1 &
        bm$tuning_fn == "tune_grid",
    ]
    
    if (!identical(input$cpu, "")) {
      new_data$time_to_tune_float <-
        new_data$time_to_tune_float *
        (reference_mark / cpus$mark[cpus$name == input$cpu])
      new_data$time_to_tune <- as_bench_time(new_data$time_to_tune_float)
    }
    
    ggplot(new_data, aes(x = n_rows, y = time_to_tune, col = model, group = model)) +
      geom_point() +
      geom_line() +
      scale_x_log10(labels = comma) +
      scale_y_continuous(trans = bench_time_trans(base = 10)) +
      labs(x = "Number of Rows", y = "Time to Tune (seconds)", col = "Model") +
      theme(
        legend.position = "bottom"
      )
  })
}

app <- shinyApp(ui, server)

app

This app allows for quickly juxtaposing the time that it might take to evaluate performance across various modeling approaches. In Section 2.2.1, I’ll go into a bit more detail about what each data point in this app represents and why, and then analyze the data in further detail in Section 2.2.3.

2.2.1 One data point

When the app first starts, the left-most point labeled boost_tree (lightgbm) is the observed time to sequentially evaluate an initial set of 10 models across 10 cross-validation folds of an 1000 row training set, resulting in 100 model fits on 900 rows, 100 sets of predictions of 100 rows, and metric calculations on each set of predictions. The actual benchmarking code is a bit more involved, but the code underlying that single data point looks something like the following.

First, we load core packages as well as the bonsai parsnip extension (for lightgbm support):

library(tidymodels)
library(bonsai)

Next, we’ll simulate a dataset with 1000 rows using simulate_regression(), Efficient Machine Learning with R’s in-house simulation function:

d <- simulate_regression(1000)
d
# A tibble: 1,000 × 16
   outcome predictor_01 predictor_02 predictor_03 predictor_04 predictor_05
     <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
 1  -13.8         2.28        -0.419      -0.0188       -0.104      -1.10  
 2   26.6        -6.86        -1.37        0.434         4.77       -1.19  
 3   14.9         1.55         2.95        1.60         -0.888       0.0702
 4   24.1        -4.04         2.06       -3.53          3.97       -1.17  
 5   11.0         1.09        -3.33        7.45         -3.61        1.45  
 6   39.0         3.94         1.62       -4.00         -2.54        2.28  
 7    8.82       -1.34         2.34       -4.08          0.980       0.512 
 8   15.5        -2.42        -3.58       -0.705        -4.38        5.27  
 9   27.5        -0.259       -0.742      -0.269         3.67       -3.87  
10   16.2         4.76         4.96        0.602        -0.513       4.05  
# ℹ 990 more rows
# ℹ 10 more variables: predictor_06 <dbl>, predictor_07 <dbl>,
#   predictor_08 <dbl>, predictor_09 <dbl>, predictor_10 <fct>,
#   predictor_11 <fct>, predictor_12 <fct>, predictor_13 <fct>,
#   predictor_14 <fct>, predictor_15 <fct>

This step would happen as-is for every regression task on 1000 rows.

Now, splitting the data into 10 folds using cross-validation:

d_folds <- vfold_cv(d, v = 10)
d_folds
#  10-fold cross-validation 
# A tibble: 10 × 2
   splits            id    
   <list>            <chr> 
 1 <split [900/100]> Fold01
 2 <split [900/100]> Fold02
 3 <split [900/100]> Fold03
 4 <split [900/100]> Fold04
 5 <split [900/100]> Fold05
 6 <split [900/100]> Fold06
 7 <split [900/100]> Fold07
 8 <split [900/100]> Fold08
 9 <split [900/100]> Fold09
10 <split [900/100]> Fold10

Now, we specify a boosted tree model specification using the LightGBM engine. In this experiment, any tunable parameter (defined by whether tidymodels has a parameter definition that automatically kicks in when generating grids) is set to be tuned.

spec <- 
  boost_tree(
    tree_depth = tune(),
    trees = tune(),
    learn_rate = tune(),
    mtry = tune(),
    min_n = tune(),
    loss_reduction = tune(),
    sample_size = tune(),
    stop_iter = tune()
  ) %>%
  set_engine("lightgbm") %>%
  set_mode("regression")

Each of these model fits are carried out with minimal preprocessors based on Tidy Modeling with R’s “Recommended Preprocessing” appendix (Max Kuhn and Silge 2022). In this case, Kuhn and Silge recommend that users impute missing values for both numeric predictors (we do so using the median for all) and categorical predictors (we do so using the mode) when working with boosted trees.

rec <- 
  recipe(outcome ~ ., d) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_impute_mode(all_nominal_predictors())

With our data resampled and a modeling workflow defined, we’re ready to resample this model. The resampling process will propose 10 different possible sets of parameter values for each parameter tagged with tune(). This will happen automatically under the hood of tune_grid(), but we can replicate this ourselves using dials:

extract_parameter_set_dials(spec) %>% 
  finalize(d) %>%
  grid_space_filling(size = 10)
# A tibble: 10 × 8
    mtry trees min_n tree_depth  learn_rate loss_reduction sample_size stop_iter
   <int> <int> <int>      <int>       <dbl>          <dbl>       <dbl>     <int>
 1     1   445    23         10       1e-10       4.64e- 3         0.7        18
 2     2  1555     2         15       1e- 3       2.45e- 4         0.5        12
 3     4   889    31          4       1e- 7       3.16e+ 1         0.3         3
 4     6   223    35          5       1e- 1       3.59e- 8         0.6        16
 5     7  1111    10          2       1e- 9       1   e-10         0.9         8
 6     9  2000    40         11       1e- 5       1.90e- 9         0.4         6
 7    11  1777    14          1       1e- 2       1.67e+ 0         0.8        14
 8    12     1     6          8       1e- 4       6.81e- 7         0.2         4
 9    14  1333    18          7       1e- 8       1.29e- 5         0.1        20
10    16   667    27         13       1e- 6       8.80e- 2         1          10

tune_grid() will evaluate those sets of values by fitting them to 10 different subsets of d.

set.seed(1)
res <- tune_grid(workflow(rec, spec), d_folds)

The point in the above plot is the time that those 100 model fits took altogether; in other words, the time that we’d wait for tune_grid() to evaluate.

The data underlying this app is generated from thousands of experiments that look just like the one above, but varying the structure of the training data, the numbers of rows in the training data and the number of CPU cores utilized. In all, 585 such experiments form the underlying data for this app:

bm
# A tibble: 10 × 8
   model                          task           n_rows n_workers tuning_fn
   <fct>                          <fct>           <dbl>     <dbl> <fct>    
 1 decision_tree (partykit)       regression       2512         1 tune_grid
 2 rule_fit (xrf)                 regression       1000         1 tune_grid
 3 rand_forest (ranger)           classification   6310         2 tune_grid
 4 mars (earth)                   regression       1000         1 tune_grid
 5 logistic_reg (LiblineaR)       classification  15849         2 tune_grid
 6 discrim_linear (sparsediscrim) classification   6310         1 tune_grid
 7 mars (earth)                   regression       6310         1 tune_grid
 8 logistic_reg (LiblineaR)       classification   6310         1 tune_grid
 9 rand_forest (ranger)           regression      39811         2 tune_grid
10 multinom_reg (glmnet)          classification  15849         2 tune_grid
# ℹ 3 more variables: time_to_tune <bch:tm>, time_to_tune_float <dbl>,
#   strategy <fct>

2.2.2 Why not just model fits?

This seems a bit involved for the purposes of getting a rough sense of how long a given model may take to fit; why don’t we just use default parameter values from tidymodels and pass the model specification straight to fit()?

There are a few reasons for this. In general, though, resampling a model specification across an initial set of possible parameter values is a fundamental unit of interactive machine learning. This initial resampling process gives the practitioner a sense for the ballpark of predictive performance she can expect for a given model task and how various parameter values may affect that performance. Does a higher learn_rate result in better predictive performance? How many trees is enough? A single model fit leaves the answer to all of these questions unknown.

Importantly, too, the time that a given model specification takes to fit can vary greatly depending on parameter values. For example, only exploring the time to fit() rather than resample across a set of values would obscure the difference in these two elapsed times:

spec <- boost_tree(mode = "regression", engine = "lightgbm")

bench::mark(
  few_trees = fit(spec %>% set_args(trees = 10), outcome ~ ., d),
  many_trees = fit(spec %>% set_args(trees = 1000), outcome ~ ., d),
  check = FALSE
)
# A tibble: 2 × 3
  expression   median mem_alloc
* <bch:expr> <bch:tm> <bch:byt>
1 few_trees    13.7ms    5.06MB
2 many_trees  458.8ms    1.41MB

By summing across many model configurations that result in varied fit times, we can get a better sense for a typical time to fit across typical parameter values.

2.2.3 Analysis

2.2.3.1 Decision Trees

Decision trees recursively partition data by selecting values of predictor variables that, when split at that value, best predict the outcome at each node. The tree makes predictions by routing new samples through these splits to reach leaf nodes containing similar training examples. The tidymodels framework supports a number of modeling engines for fitting and predicting from decision trees, including C5.0, partykit, rpart, and spark, and decision trees can be used for either classification or regression (with some engine-specific exceptions).

At the highest level, here’s what the tuning times for various decision_tree() engines supported by tidymodels look like:

A line plot comparing performance of C5.0, partykit, and rpart engines across different dataset sizes. The x-axis shows number of rows (from 1,000 to 100,000) on a log scale, while y-axis shows time to tune on a log scale. C5.0 shows the highest elapsed times, followed by rpart, while partykit consistently evaluates most quickly.
Figure 2.1: Elapsed times to generate preliminary tuning results for decision trees with tidymodels by modeling engine.

Figure 2.1 is a special case of the Time to Tune app from Section 2.2; each data point shows the total time to fit and predict from 100 decision trees of varying parameterizations. partykit consistently evaluates the fastest across all dataset sizes, while C5.0 is the slowest, with the performance gap between engines widening as the number of rows increases.

tidymodels supports tuning the decision tree hyperparameters min_n, tree_depth, and cost_complexity.1 We’ll discuss the implications for each on the time to fit decision trees in the rest of this section. In general, though, it’s helpful to keep in mind that the computationally intensive part of fitting decision trees is the search for optimal split points. Because of this, more complex trees generally take longer to fit than less complex trees, and parameter values allowing for more complex trees will tend to result in longer fit times. That said, the effect on fit time of changing individual parameter values is often relatively mild, as changes in one parameter value are mediated by other parameters.

First, the Minimum Points Per Node, or min_n, is the minimum number of training set observations in a node required for the node to be split further. For example, if only 3 observations meet some split criteria some_variable < 2 and min_n is set to 3, then the tree will make a prediction at that node rather than considering whether to further break those 3 observations up into smaller buckets based on another split criteria before making predictions. Smaller values of min_n allow for more complex trees, and thus the time to fit a decision tree is inversely correlated to min_n when the complexity of the tree isn’t mediated by other parameter values. In CART-based settings like with engine = "rpart", min_n can usually be set to a reasonably high value without affecting predictive performance “since smaller nodes are almost always pruned away by cross-validation”; in other words, even if the tree is allowed to generate more splits via a small min_n value, those splits based on smaller numbers of observations are often pruned away via the cost_complexity parameter anyway (Therneau and Atkinson 2025).

Let’s see how this plays out in practice:

A ggplot2 dotplot faceted by engine showing elapsed computation time versus minimum points per node. The x-axis shows parameter values from 0 to 40, while the y-axis shows elapsed time on a log scale from 1ms to 10s. Each engine panel contains scattered points colored by dataset size (1,000 to 100,000 rows), showing that fit times don't tend to vary relative to minimum points per node.
Figure 2.2: Distributions of time-to-fit for various minimum points per node values, faceted by engine and colored according to numbers of rows. Generally, in practice, the minimum points per node doesn’t tend to affect fit times as tree complexity is mediated by other parameter values.

This plot disaggregates the information in Figure 2.1. Instead of summing across the time to fit and predict from each model to determine a data point (with varying values of min_n and other parameters), this plot shows one point per model fit. Generally, we see that min_n has little effect on the time to fit across all modeling engines supported by tidymodels.

The Tree Depth, or tree_depth in tidymodels, is the maximum depth of the tree. For example, if tree_depth = 2, the tree must reach a node (i.e. a point at which a prediction is made rather than a further split) after splitting twice. Larger values of tree_depth allow for more complex trees, and thus the time to fit a decision tree increases as tree_depth does unless the complexity of the tree is mediated by some other parameter.

An identical plot to that above but plotting tree depth instead of minimum points per node. The C5.0 panel is empty, while partykit and rpart panels show positive correlation between tree depth and computation time, with larger datasets taking longer to compute across all depths.
Figure 2.3: Distributions of time-to-fit for various tree depth values, faceted by engine and colored according to numbers of rows. For the engines that support tree depth, higher values of tree depth tend to result in greater fit times, even as other parameter values vary. C5.O doesn’t support the tree depth parameter (Max Kuhn and Quinlan 2023; Quinlan 2014).

The Cost of Complexity, or cost_complexity in tidymodels, is a penalization parameter (often referred to as \(C_p\)) on adding additional complexity to the tree. This parameter is only used by CART models, so is available only for specific engines. When training CART-based decision trees, once splits are generated, each are evaluated according to how much they decrease error on out-of-sample data relative to the tree without that split. If the decrease in error doesn’t surpass some threshold \(C_p\), the split is “pruned” back, forming a less complex tree. Higher values of penalization mean that the decision tree will evaluate (and thus search for) fewer splits in total, ultimately saving “considerable computational effort” and decreasing the time to fit the model (Therneau and Atkinson 2025; M. Kuhn 2013).

An identical plot to that above but plotting cost of complexity instead of tree depth. The C5.0 and partykit panels are empty, while rpart panel shows scattered points across complexity values, with computation time primarily determined by dataset size rather than complexity parameter.
Figure 2.4: Distributions of time-to-fit for various values of cost of complexity, faceted by engine and colored according to numbers of rows. Neither C5.0 nor partykit make use of this parameter (Quinlan 2014; Hothorn, Hornik, and Zeileis 2006).
Note

C5.0 and partykit make use of related parameters to automatically prune trees, though they aren’t supported by tidymodels as a default tuning parameter as they’re engine-specific. C5.0’s confidence factor parameter is a very computationally efficient way to control complexity, though do note that the approach stands on “shaky statistical grounds” (M. Kuhn 2013). partykit determines whether a split is “worth it” based on hypothesis testing, and the significance level of that hypothesis test can be used to prune splits more aggressively (Hothorn, Hornik, and Zeileis 2006).

Again, the time-consuming portion of fitting decision trees is searching for optimal predictor values to split on. Because of this, more complex trees tend to take longer to fit than less complex trees. On its own, though, this doesn’t necessarily mean that setting an individual parameter value to allow for a more complex tree will result in a longer fit time, and we saw this effect above. Instead, the complex interplay of many parameter values simultaneously allowing for greater complexity is what drives longer fit times. We can quickly demonstrate this in-the-small with a brief set of model fits.

# generate a dataset with 100,000 rows
set.seed(1)
d <- simulate_classification(100000)

# define a decision tree model specification
spec_small <-  
  decision_tree(cost_complexity = .1, tree_depth = 2, min_n = 1000) %>%
  set_engine("rpart") %>%
  set_mode("classification")

In the above, spec_small defines a very minimal tree; this cost_complexity() value heavily penalizes further splits, the tree_depth describes a very shallow tree, and min_n only allows for further splits when a given split contains many observations.

Now, we can examine the effect of setting individual values on fit time by benchmarking the time to fit a simple tree by all measures, then toggling each of the parameter values individually to allow for more complex trees, and finally by toggling all of them to allow for a very complex tree.

bench::mark(
  # small by all measures
  small = fit(object = spec_small, formula = class ~ ., data = d),
  
  # allowing for greater complexity with only one parameter value
  complex_cost_complexity = fit(
    set_args(spec_small, cost_complexity = 10e-9), class ~ ., d
  ),
  complex_tree_depth = fit(
    set_args(spec_small, tree_depth = 30), class ~ ., d
  ),
  complex_min_n = fit(
    set_args(spec_small, min_n = 1), class ~ ., d
  ),
  
  # allowing for greater complexity with all parameter values
  complex = fit(
    set_args(spec_small, cost_complexity = 10e-9, tree_depth = 30, min_n = 1),
    class ~ ., 
    d
  ),
  check = FALSE
)
# A tibble: 5 × 3
  expression                median mem_alloc
* <bch:expr>              <bch:tm> <bch:byt>
1 small                   750.08ms     102MB
2 complex_cost_complexity 814.43ms     102MB
3 complex_tree_depth          1.2s     102MB
4 complex_min_n           749.36ms     102MB
5 complex                    7.16s     124MB

Again, we see that individual parameter values allowing for greater complexity are mediated by other parameter values, but setting them all together allows for a complex tree and, thus, a significantly longer fit time.

Summary: Decision Trees

  • More complex decision trees take longer to fit.
  • Setting individual parameter values to allow for more complex fits doesn’t necessarily mean that a tree will take longer to fit.
  • Compared to many of the other methods described in this chapter, decision trees are quite quick-fitting.

  1. That is, tidymodels provides default distributions of parameter values to sample from for these parameters. Other parameters can be tuned by providing your own distributions using the dials package.↩︎