heart
data.For this lab we will be working with simulated data and the
heart
dataset that you can download from here
You should install and load rpart
(trees),
randomForest
(random forest), gbm
(gradient
boosting) and xgboost
(extreme gradient boosting).
install.packages(c("rpart", "rpart.plot", "randomForest", "gbm", "xgboost"))
library(tidyverse)
library(rpart)
library(rpart.plot)
library(randomForest)
library(gbm)
library(xgboost)
heart <- read.csv("https://raw.githubusercontent.com/JSC370/jsc370-2023/main/data/heart/heart.csv") |>
mutate(
AHD = 1 * (AHD == "Yes"),
ChestPain = factor(ChestPain),
Thal = factor(Thal)
)
head(heart)
## Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca
## 1 63 1 typical 145 233 1 2 150 0 2.3 3 0
## 2 67 1 asymptomatic 160 286 0 2 108 1 1.5 2 3
## 3 67 1 asymptomatic 120 229 0 2 129 1 2.6 2 2
## 4 37 1 nonanginal 130 250 0 0 187 0 3.5 3 0
## 5 41 0 nontypical 130 204 0 2 172 0 1.4 1 0
## 6 56 1 nontypical 120 236 0 0 178 0 0.8 1 0
## Thal AHD
## 1 fixed 0
## 2 normal 1
## 3 reversable 1
## 4 normal 0
## 5 normal 0
## 6 normal 0
set.seed(1984)
n <- 1000
x <- runif(n, -5,5)
error <- rnorm(n, sd = 0.5)
y <- sin(x) + error
nonlin <- data.frame(y = y, x = x)
train_size <- sample(1:1000, size = 500)
nonlin_train <- nonlin[train_size,]
nonlin_test <- nonlin[-train_size,]
ggplot(nonlin,aes(y=y,x=x))+
geom_point() +
theme_minimal()
Fit a regression tree using the training set, plot it
Determine the optimal complexity parameter (cp) to prune the tree
Plot the pruned tree and summarize
Based on the plot and/or summary of the pruned tree create a vector of the (ordered) split points for variable x, and a vector of fitted values for the intervals determined by the split points of x.
Fit a linear model to the training data and plot the regression line.
Contrast the quality of the fit of the tree model vs. linear regression by inspection of the plot
Compute the test MSE of the pruned tree and the linear regression model
Is the lm or regression tree better at fitting a non-linear function?
heart
data into training and testing
(70-30%)train <- sample(1:nrow(heart), round(0.7*nrow(heart)))
heart_train <- heart[train,]
heart_test <- heart[-train,]
heart
data.randomForest
with mtry
equal to the number of features (all other parameters at their default
values). Generate the variable importance plot using
varImpPlot
and extract variable importance from the
randomForest
fitted object using the
importance
function.randomForest
with the default
parameters. Generate the variable importance plot using
varImpPlot
and extract variable importance from the
randomForest
fitted object using the
importance
function.gbm
with cv.folds=5
to
perform 5-fold cross-validation, and set class.stratify.cv
to AHD
(heart disease outcome) so that cross-validation is
performed stratifying by AHD
. Plot the cross-validation
error as a function of the boosting iteration/trees (the
$cv.error
component of the object returned by
gbm
) and determine whether additional boosting iterations
are warranted. If so, run additional iterations with
gbm.more
(use the R help to check its syntax). Choose the
optimal number of iterations. Use the summary.gbm
function
to generate the variable importance plot and extract variable
importance/influence (summary.gbm
does both). Generate 1D
and 2D marginal plots with gbm.plot
to assess the effect of
the top three variables and their 2-way interactions.