Lab description

For this lab we will be working with simulated data and the heart dataset that you can download from here

Setup packages

You should install and load rpart (trees), randomForest (random forest), gbm (gradient boosting) and xgboost (extreme gradient boosting).

install.packages(c("rpart", "rpart.plot", "randomForest", "gbm", "xgboost"))

Load packages and data

library(tidyverse)
library(rpart)
library(rpart.plot)
library(randomForest)
library(gbm)
library(xgboost)

heart <- read.csv("https://raw.githubusercontent.com/JSC370/jsc370-2023/main/data/heart/heart.csv") |>
  mutate(
    AHD = 1 * (AHD == "Yes"),
    ChestPain = factor(ChestPain),
    Thal = factor(Thal)
  )
head(heart)

##   Age Sex    ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca
## 1  63   1      typical    145  233   1       2   150     0     2.3     3  0
## 2  67   1 asymptomatic    160  286   0       2   108     1     1.5     2  3
## 3  67   1 asymptomatic    120  229   0       2   129     1     2.6     2  2
## 4  37   1   nonanginal    130  250   0       0   187     0     3.5     3  0
## 5  41   0   nontypical    130  204   0       2   172     0     1.4     1  0
## 6  56   1   nontypical    120  236   0       0   178     0     0.8     1  0
##         Thal AHD
## 1      fixed   0
## 2     normal   1
## 3 reversable   1
## 4     normal   0
## 5     normal   0
## 6     normal   0

Questions

Question 1: Trees with simulated data

Simulate data from a random uniform distribution [-5,5] and normally distributed errors (s.d = 0.5)
Create a non-linear relationship y=sin(x)+error
Split the data into test and training sets (500 points each), plot the data

set.seed(1984)
n <- 1000
x <- runif(n, -5,5) 
error <- rnorm(n, sd = 0.5)
y <- sin(x) + error 
nonlin <- data.frame(y = y, x = x)

train_size <- sample(1:1000, size = 500)
nonlin_train <- nonlin[train_size,]
nonlin_test <- nonlin[-train_size,]

ggplot(nonlin,aes(y=y,x=x))+
  geom_point() +
  theme_minimal()

Fit a regression tree using the training set, plot it
Determine the optimal complexity parameter (cp) to prune the tree
Plot the pruned tree and summarize
Based on the plot and/or summary of the pruned tree create a vector of the (ordered) split points for variable x, and a vector of fitted values for the intervals determined by the split points of x.
Fit a linear model to the training data and plot the regression line.
Contrast the quality of the fit of the tree model vs. linear regression by inspection of the plot
Compute the test MSE of the pruned tree and the linear regression model
Is the lm or regression tree better at fitting a non-linear function?

Question 2: Analysis of Real Data

Split the heart data into training and testing (70-30%)
Fit a classification tree using rpart, plot the full tree
Plot the complexity parameter table for an rpart fit and prune the tree
Compute the test misclassification error
Fit the tree with the optimal complexity parameter to the full data (training + testing)

train <- sample(1:nrow(heart), round(0.7*nrow(heart)))
heart_train <- heart[train,]
heart_test <- heart[-train,]

Question 3: Bagging, Random Forest

Compare the performance of classification trees (above), bagging, random forests for predicting heart disease based on the heart data.
Split the data into training and testing. Train each of the models on the training data and extract the cross-validation (or out-of-bag error for bagging and Random forest).
For bagging use randomForest with mtry equal to the number of features (all other parameters at their default values). Generate the variable importance plot using varImpPlot and extract variable importance from the randomForest fitted object using the importance function.
For random forests use randomForest with the default parameters. Generate the variable importance plot using varImpPlot and extract variable importance from the randomForest fitted object using the importance function.

Question 4: Boosting

For boosting use gbm with cv.folds=5 to perform 5-fold cross-validation, and set class.stratify.cv to AHD (heart disease outcome) so that cross-validation is performed stratifying by AHD. Plot the cross-validation error as a function of the boosting iteration/trees (the $cv.error component of the object returned by gbm) and determine whether additional boosting iterations are warranted. If so, run additional iterations with gbm.more (use the R help to check its syntax). Choose the optimal number of iterations. Use the summary.gbm function to generate the variable importance plot and extract variable importance/influence (summary.gbm does both). Generate 1D and 2D marginal plots with gbm.plot to assess the effect of the top three variables and their 2-way interactions.

Lab 10 - RF, XGBoost

Learning goals