March 31, 2023 by 11:59pm.
Rewrite the following R functions to make them faster. It is OK (and recommended) to take a look at Stackoverflow and Google
# Total row sums
fun1 <- function(mat) {
n <- nrow(mat)
ans <- double(n)
for (i in 1:n) {
ans[i] <- sum(mat[i, ])
}
ans
}
fun1alt <- function(mat) {
# YOUR CODE HERE
}
# Cumulative sum by row
fun2 <- function(mat) {
n <- nrow(mat)
k <- ncol(mat)
ans <- mat
for (i in 1:n) {
for (j in 2:k) {
ans[i,j] <- mat[i, j] + ans[i, j - 1]
}
}
ans
}
fun2alt <- function(mat) {
# YOUR CODE HERE
}
# Use the data with this code
set.seed(2315)
dat <- matrix(rnorm(200 * 100), nrow = 200)
# Test for the first
microbenchmark::microbenchmark(
fun1(dat),
fun1alt(dat), unit = "relative", check = "equivalent"
)
# Test for the second
microbenchmark::microbenchmark(
fun2(dat),
fun2alt(dat), unit = "relative", check = "equivalent"
)
The last argument, check = “equivalent”, is included to make sure that the functions return the same result.
The following function allows simulating PI
sim_pi <- function(n = 1000, i = NULL) {
p <- matrix(runif(n*2), ncol = 2)
mean(rowSums(p^2) < 1) * 4
}
# Here is an example of the run
set.seed(156)
sim_pi(1000) # 3.132
In order to get accurate estimates, we can run this function multiple times, with the following code:
# This runs the simulation a 4,000 times, each with 10,000 points
set.seed(1231)
system.time({
ans <- unlist(lapply(1:4000, sim_pi, n = 10000))
print(mean(ans))
})
Rewrite the previous code using parLapply()
to make it
run faster. Make sure you set the seed using
clusterSetRNGStream()
:
# YOUR CODE HERE
system.time({
# YOUR CODE HERE
ans <- # YOUR CODE HERE
print(mean(ans))
# YOUR CODE HERE
})
For this question we will use the hitters
dataset, which
consists of data for 332 major league baseball players. The data are
here https://github.com/JSC370/jsc370-2023/tree/main/data/hitters.csv.
The main goal is to predict players’ salaries (variable
Salary
) based on the features in the data. To do so you
will replicate many of the concepts in labs 11 and 12 (trees, bagging,
random forest, boosting and xgboost). Please split the data into
training and testing sets (70-30) for all questions.
Salary
, and
appropriately prune it based on the optimal complexity parameter.Salary
using bagging, construct a variable
importance plot.