2  Models

Caution

This chapter still has a long way to go. I’d recommend exploring other portions of the draft in the meantime.

2.1 Tidymodels overhead


While the tidymodels team develops the infrastructure that users interact with directly, under the hood, we send calls out to other people’s modeling packages—or modeling engines—that provide the actual implementations that estimate parameters, generate predictions, etc. The process looks something like this:

A graphic representing the tidymodels interface. In order, step 1 “translate”, step 2 “call”, and step 3 “translate”, outline the process of translating from the standardized tidymodels interface to an engine’s specific interface, calling the modeling engine, and translating back to the standardized tidymodels interface. Step 1 and step 3 are in green, while step 2 is in orange.

When thinking about the time allotted to each of the three steps above, we refer to the “translate” steps in green as the tidymodels overhead. The time it takes to “translate” interfaces in steps 1) and 3) is within our control, while the time the modeling engine takes to do it’s thing in step 2) is not.

Let’s demonstrate with an example classification problem. Generating some random data:

set.seed(1)
d <- simulate_classification(n_rows = 100)

d
# A tibble: 100 × 18
   class   two_factor_1 two_factor_2 non_linear_1 non_linear_2 non_linear_3
   <fct>   <fct>               <dbl> <fct>               <dbl>        <dbl>
 1 class_2 level_1          -1.17    level_1             0.554        0.814
 2 class_1 level_1           0.261   level_2             0.688        0.929
 3 class_2 level_1          -1.61    level_1             0.658        0.147
 4 class_1 level_1           2.14    level_1             0.663        0.750
 5 class_2 level_1           0.0360  level_1             0.472        0.976
 6 class_1 level_2          -0.00837 level_1             0.970        0.975
 7 class_2 level_1           1.05    level_2             0.402        0.351
 8 class_1 level_1           1.49    level_1             0.850        0.394
 9 class_2 level_1           0.967   level_2             0.757        0.951
10 class_2 level_2           0.603   level_1             0.533        0.107
# ℹ 90 more rows
# ℹ 12 more variables: linear_01 <dbl>, linear_02 <dbl>, linear_03 <dbl>,
#   linear_04 <dbl>, linear_05 <dbl>, linear_06 <dbl>, linear_07 <dbl>,
#   linear_08 <dbl>, linear_09 <dbl>, linear_10 <fct>, linear_11 <fct>,
#   linear_12 <fct>

…we’d like to model the class using the remainder of the variables in this dataset using a logistic regression. We can using the following code to do so:

fit(logistic_reg(), class ~ ., d)
parsnip model object


Call:  stats::glm(formula = class ~ ., family = stats::binomial, data = data)

Coefficients:
        (Intercept)  two_factor_1level_2         two_factor_2  
           7.080808            -7.463598            -2.821078  
non_linear_1level_2         non_linear_2         non_linear_3  
           0.550041            -0.878752            -2.039599  
          linear_01            linear_02            linear_03  
          -0.153653            -0.273809            -0.005318  
          linear_04            linear_05            linear_06  
           1.248566             0.880917            -0.112520  
          linear_07            linear_08            linear_09  
          -0.925063             1.261782            -1.251249  
   linear_10level_2     linear_10level_3     linear_10level_4  
         -18.020101            -3.242219            -1.336644  
   linear_11level_2     linear_11level_3  
          -0.640718            -4.402850  
 [ reached getOption("max.print") -- omitted 4 entries ]

Degrees of Freedom: 99 Total (i.e. Null);  76 Residual
Null Deviance:      134.6 
Residual Deviance: 54.75    AIC: 102.8

The default engine for a logistic regression in tidymodels is stats::glm(). So, in the style of the above graphic, this code:

  1. Translates the tidymodels code, which is consistent across engines, to the format that is specific to the chosen engine. In this case, there’s not a whole lot to do: it passes the preprocessor as formula, the data as data, and picks a family of stats::binomial.
  2. Calls stats::glm() and collects its output.
  3. Translates the output of stats::glm() back into a standardized model fit object.

Again, we can control what happens in steps 1) and 3), but step 2) belongs to the stats package.

The time that steps 1) and 3) take is relatively independent of the dimensionality of the training data. That is, regardless of whether we train on one hundred or a million data points, our code (as in, the translation) takes about the same time to run. Regardless of training set size, our code pushes around small, relational data structures to determine how to correctly interface with a given engine. The time it takes to run step 2), though, depends almost completely on the size of the data. Depending on the modeling engine, modeling 10 times as much data could result in step 2) taking twice as long, or 10x as long, or 100x as long as the original fit.

So, while the absolute time allotted to steps 1) and 3) is fixed, the portion of total time to fit a model with tidymodels that is “overhead” depends on how quick the engine code itself is. How quick is a logistic regression with glm() on 100 data points?

bench::mark(
  fit = glm(class ~ ., family = binomial, data = d)
) %>% 
  select(expression, median)
# A tibble: 1 × 2
  expression   median
* <bch:expr> <bch:tm>
1 fit          2.45ms

About a millisecond. That means that, if the tidymodels overhead is one second, we’ve made this model fit a thousand times slower!

In practice, the overhead here has hovered around a millisecond or two for the last couple years, and machine learning practitioners usually fit much more computationally expensive models than a logistic regression on 100 data points. You’ll just have to believe me on that second point. Regarding the first:

bm_logistic_reg <- 
  bench::mark(
    parsnip = fit(logistic_reg(), class ~ ., d),
    stats = glm(class ~ ., family = binomial, data = d),
    check = FALSE
  )

Remember that the first expression calls the second one, so the increase in time from the second to the first is the “overhead.” In this case, it’s 0.866125 milliseconds, or 27.3% of the total elapsed time.

So, to fit a boosted tree model on 1,000,000 data points, step 2) might take a few seconds. Steps 1) and 3) don’t care about the size of the data, so they still take a few thousandths of a second. No biggie—the overhead is negligible. Let’s quickly back that up by fitting boosted tree models on simulated datasets of varying sizes, once with the XGBoost interface and once with parsnip’s wrapper around it.

TODO: write caption

This graph shows the gist of tidymodels’ overhead for modeling engines: as dataset size and model complexity grow larger, model fitting and prediction take up increasingly large proportions of the total evaluation time.

Section 1.1.3 showed a number of ways users can cut down on the evaluation time of their tidymodels code. Making use of parallelism, reducing the total number of model fits needed to search a given grid, and carefully constructing that grid to search over are all major parts of the story

2.2 Benchmarks

2.2.1 Linear models

2.2.2 Decision trees

2.2.3 Boosted trees

XGBoost and LightGBM – comparison timings for the same thing but from the Python interface?

2.2.4 Random forests

2.2.5 Support vector machines