Lab 10 - RF, XGBoost

Learning goals

Lab description

For this lab we will be working with simulated data and the heart dataset that you can download from here

Setup packages

You should install and load rpart (trees), randomForest (random forest), gbm (gradient boosting) and xgboost (extreme gradient boosting).

install.packages(c("rpart", "rpart.plot", "randomForest", "gbm", "xgboost"))

Load packages and data

library(tidyverse)
library(rpart)
library(rpart.plot)
library(randomForest)
library(gbm)
library(xgboost)

heart <- read.csv("https://raw.githubusercontent.com/JSC370/jsc370-2023/main/data/heart/heart.csv") |>
  mutate(
    AHD = 1 * (AHD == "Yes"),
    ChestPain = factor(ChestPain),
    Thal = factor(Thal)
  )
head(heart)
##   Age Sex    ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca
## 1  63   1      typical    145  233   1       2   150     0     2.3     3  0
## 2  67   1 asymptomatic    160  286   0       2   108     1     1.5     2  3
## 3  67   1 asymptomatic    120  229   0       2   129     1     2.6     2  2
## 4  37   1   nonanginal    130  250   0       0   187     0     3.5     3  0
## 5  41   0   nontypical    130  204   0       2   172     0     1.4     1  0
## 6  56   1   nontypical    120  236   0       0   178     0     0.8     1  0
##         Thal AHD
## 1      fixed   0
## 2     normal   1
## 3 reversable   1
## 4     normal   0
## 5     normal   0
## 6     normal   0

Questions

Question 1: Trees with simulated data

set.seed(1984)
n <- 1000
x <- runif(n, -5,5) 
error <- rnorm(n, sd = 0.5)
y <- sin(x) + error 
nonlin <- data.frame(y = y, x = x)

train_size <- sample(1:1000, size = 500)
nonlin_train <- nonlin[train_size,]
nonlin_test <- nonlin[-train_size,]

ggplot(nonlin,aes(y=y,x=x))+
  geom_point() +
  theme_minimal()


Question 2: Analysis of Real Data

train <- sample(1:nrow(heart), round(0.7*nrow(heart)))
heart_train <- heart[train,]
heart_test <- heart[-train,]

Question 3: Bagging, Random Forest


Question 4: Boosting


Deliverables

  1. Questions 1-4 answered, pdf or html output uploaded to quercus