Lab 10b - Boosting

Learning goals

Lab description

For this lab we will be working with the heart dataset that you can download from here

Setup packages

You should install and load gbm (gradient boosting) and xgboost (extreme gradient boosting).

install.packages(c(,"gbm","xgboost","caret"))

Load packages and data

library(tidyverse)
library(gbm)
library(xgboost)
library(caret)

heart<-read.csv("https://raw.githubusercontent.com/JSC370/jsc370-2023/main/data/heart/heart.csv") |>
  mutate(
    AHD = 1 * (AHD == "Yes"),
    ChestPain = factor(ChestPain),
    Thal = factor(Thal),
    ChestPain_num = case_match(
      ChestPain,
      "asymptomatic" ~ 1,
      "nonanginal" ~ 2,
      "nontypical" ~ 3,
      .default = 0
    ),
    Thal_num = case_match(
      Thal,
      "fixed" ~ 1, 
      "normal" ~ 2,
      .default = 0
    )
  ) |>
  na.omit()
head(heart)
##   Age Sex    ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca
## 1  63   1      typical    145  233   1       2   150     0     2.3     3  0
## 2  67   1 asymptomatic    160  286   0       2   108     1     1.5     2  3
## 3  67   1 asymptomatic    120  229   0       2   129     1     2.6     2  2
## 4  37   1   nonanginal    130  250   0       0   187     0     3.5     3  0
## 5  41   0   nontypical    130  204   0       2   172     0     1.4     1  0
## 6  56   1   nontypical    120  236   0       0   178     0     0.8     1  0
##         Thal AHD ChestPain_num Thal_num
## 1      fixed   0             0        1
## 2     normal   1             1        2
## 3 reversable   1             1        0
## 4     normal   0             2        2
## 5     normal   0             3        2
## 6     normal   0             3        2

Questions

Question 1: Gradient Boosting

Evaluate the effect of critical boosting parameters (number of boosting iterations, shrinkage/learning rate, and tree depth/interaction). In gbm the number of iterations is controlled by n.trees (default is 100), the shrinkage/learning rate is controlled by shrinkage (default is 0.001), and interaction depth by interaction.depth (default is 1).

Note, boosting can overfit if the number of trees is too large. The shrinkage parameter controls the rate at which the boosting learns. Very small \(\lambda\) can require using a very large number of trees to achieve good performance. Finally, interaction depth controls the interaction order of the boosted model. A value of 1 implies an additive model, a value of 2 implies a model with up to 2-way interactions, etc. the default is 1.

  1. Split the heart data into training and testing. Also need to make character variables into numeric variables and get rid of missing values.
set.seed(370)
train <- sample(1:nrow(heart), floor(nrow(heart) * 0.7))
test <- setdiff(1:nrow(heart), train)
  1. Set the seed and train a boosting classification with gbm using 10-fold cross-validation (cv.folds=10) on the training data with n.trees = 5000, shrinkage = 0.001, and interaction.depth =1. Plot the cross-validation errors as a function of the boosting iteration and calculate the test MSE.

  2. Repeat ii. using the same seed and n.trees=5000 with the following 3 additional combination of parameters: a) shrinkage = 0.001, interaction.depth = 2; b) shrinkage = 0.01, interaction.depth = 1; c) shrinkage = 0.01, interaction.depth = 2.

Question 2: Extreme Gradient Boosting

Training an xgboost model with xgboost and perform a grid search for tuning the number of trees and the maxium depth of the tree. Also perform 10-fold cross-validation and determine the variable importance. Finally, compute the test MSE.