heart
data.For this lab we will be working with the heart
dataset
that you can download from here
You should install and load gbm
(gradient boosting) and
xgboost
(extreme gradient boosting).
install.packages(c(,"gbm","xgboost","caret"))
library(tidyverse)
library(gbm)
library(xgboost)
library(caret)
heart<-read.csv("https://raw.githubusercontent.com/JSC370/jsc370-2023/main/data/heart/heart.csv") |>
mutate(
AHD = 1 * (AHD == "Yes"),
ChestPain = factor(ChestPain),
Thal = factor(Thal),
ChestPain_num = case_match(
ChestPain,
"asymptomatic" ~ 1,
"nonanginal" ~ 2,
"nontypical" ~ 3,
.default = 0
),
Thal_num = case_match(
Thal,
"fixed" ~ 1,
"normal" ~ 2,
.default = 0
)
) |>
na.omit()
head(heart)
## Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca
## 1 63 1 typical 145 233 1 2 150 0 2.3 3 0
## 2 67 1 asymptomatic 160 286 0 2 108 1 1.5 2 3
## 3 67 1 asymptomatic 120 229 0 2 129 1 2.6 2 2
## 4 37 1 nonanginal 130 250 0 0 187 0 3.5 3 0
## 5 41 0 nontypical 130 204 0 2 172 0 1.4 1 0
## 6 56 1 nontypical 120 236 0 0 178 0 0.8 1 0
## Thal AHD ChestPain_num Thal_num
## 1 fixed 0 0 1
## 2 normal 1 1 2
## 3 reversable 1 1 0
## 4 normal 0 2 2
## 5 normal 0 3 2
## 6 normal 0 3 2
Evaluate the effect of critical boosting parameters (number of
boosting iterations, shrinkage/learning rate, and tree
depth/interaction). In gbm
the number of iterations is
controlled by n.trees
(default is 100), the
shrinkage/learning rate is controlled by shrinkage
(default
is 0.001), and interaction depth by interaction.depth
(default is 1).
Note, boosting can overfit if the number of trees is too large. The shrinkage parameter controls the rate at which the boosting learns. Very small \(\lambda\) can require using a very large number of trees to achieve good performance. Finally, interaction depth controls the interaction order of the boosted model. A value of 1 implies an additive model, a value of 2 implies a model with up to 2-way interactions, etc. the default is 1.
set.seed(370)
train <- sample(1:nrow(heart), floor(nrow(heart) * 0.7))
test <- setdiff(1:nrow(heart), train)
Set the seed and train a boosting classification with
gbm
using 10-fold cross-validation
(cv.folds=10
) on the training data with
n.trees = 5000
, shrinkage = 0.001
, and
interaction.depth =1
. Plot the cross-validation errors as a
function of the boosting iteration and calculate the test MSE.
Repeat ii. using the same seed and n.trees=5000
with
the following 3 additional combination of parameters: a)
shrinkage = 0.001
, interaction.depth = 2
; b)
shrinkage = 0.01
, interaction.depth = 1
; c)
shrinkage = 0.01
,
interaction.depth = 2
.
Training an xgboost model with xgboost
and perform a
grid search for tuning the number of trees and the maxium depth of the
tree. Also perform 10-fold cross-validation and determine the variable
importance. Finally, compute the test MSE.