In this lab, you are expected to learn/put in practice the following skills:
Give yourself a few minutes to think about what you learned about parallelization. List three examples of problems that you believe may be solved using parallel computing, and check for packages on the HPC CRAN task view that may be related to it.
carat
,
mlr
boot
parallel
parallel
The following functions can be written to be more efficient without
using parallel
:
n x k
dataset with all its
entries having a Poisson distribution with mean
lambda
.fun1 <- function(n = 100, k = 4, lambda = 4) {
x <- NULL
for (i in 1:n)
x <- rbind(x, rpois(k, lambda))
return(x)
}
fun1alt <- function(n = 100, k = 4, lambda = 4) {
matrix(rpois(n * k, lambda = lambda), ncol = k)
}
# Benchmarking
microbenchmark::microbenchmark(
fun1(100),
fun1alt(100), unit = "ns" # can also try other units such as ms (milliseconds)
)
## Warning in microbenchmark::microbenchmark(fun1(100), fun1alt(100), unit =
## "ns"): less accurate nanosecond times to avoid potential integer overflows
## Unit: nanoseconds
## expr min lq mean median uq max neval
## fun1(100) 167444 182921.5 283257.93 186775.5 193950.5 9733154 100
## fun1alt(100) 12997 13735.0 21748.45 14432.0 15088.0 724306 100
max.col()
).# Data Generating Process (10 x 10,000 matrix)
set.seed(1234)
x <- matrix(rnorm(1e4), nrow=10)
# Find each column's max value
fun2 <- function(x) {
apply(x, 2, max)
}
fun2alt <- function(x) {
x[cbind(max.col(t(x)), 1:ncol(x))]
}
# Benchmarking
bench <- microbenchmark::microbenchmark(
fun2(x),
fun2alt(x),
unit = "us"
)
plot(bench)
ggplot2::autoplot(bench) +
ggplot2::theme_minimal()
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
Create a plot of the processing time
We will now turn our attention to non-parametric bootstrapping. Among its many uses, non-parametric bootstrapping allow us to obtain confidence intervals for parameter estimates without relying on parametric assumptions.
The main assumption is that we can approximate many experiments by resampling observations from our original dataset, which reflects the population.
This function implements the non-parametric bootstrap:
library(parallel)
my_boot <- function(dat, stat, R, ncpus = 1L) {
# Getting the random indices
n <- nrow(dat)
idx <- matrix(sample.int(n, n * R, TRUE), nrow = n, ncol = R)
# Making the cluster using `ncpus`
# STEP 1: GOES HERE
cl <- makePSOCKcluster(ncpus)
# on.exit(stopCluster(cl))
# STEP 2: GOES HERE
clusterExport(cl, varlist = c("idx", "dat", "stat"), envir = environment())
# STEP 3: THIS FUNCTION NEEDS TO BE REPLACED WITH parLapply
#ans <- lapply(seq_len(R), function(i) {
# stat(dat[idx[,i], , drop=FALSE])
#})
ans <- parLapply(cl, seq_len(R), function(i) {
stat(dat[idx[,i], , drop=FALSE])
})
# Coercing the list into a matrix
ans <- do.call(rbind, ans)
# STEP 4: GOES HERE
stopCluster(cl)
ans
}
# Bootstrap of a linear regression model
my_stat <- function(d) coef(lm(y ~ x, data = d))
# DATA SIM
set.seed(1)
n <- 500
R <- 1e4
x <- cbind(rnorm(n))
y <- x * 5 + rnorm(n)
# Checking if we get something similar as lm
ans0 <- confint(lm(y ~ x))
cat("OLS CIs -----------\n")
## OLS CIs -----------
ans0
## 2.5 % 97.5 %
## (Intercept) -0.1379033 0.04797344
## x 4.8650100 5.04883353
ans1 <- my_boot(dat = data.frame(x, y), my_stat, R = R, ncpus = 4)
qs <- c(.025, .975)
cat("Bootstrap CIs -----------\n")
## Bootstrap CIs -----------
t(apply(ans1, 2, quantile, probs = qs))
## 2.5% 97.5%
## (Intercept) -0.1386903 0.04856752
## x 4.8685162 5.04351239
parallel::detectCores() # number of available cores
## [1] 10
# 1 core => non-parallel
system.time(my_boot(dat = data.frame(x, y), my_stat, R = 4000, ncpus = 1L))
## user system elapsed
## 0.037 0.007 1.343
# 8 cores => parallel
system.time(my_boot(dat = data.frame(x, y), my_stat, R = 4000, ncpus = 8L))
## user system elapsed
## 0.137 0.033 0.570
Once you have saved this Rmd file, try running the following command in your terminal:
Rscript --vanilla -e 'rmarkdown::render("[full-path-to-your-Rmd-file.Rmd]")' &
Where [full-path-to-your-Rmd-file.Rmd]
should be replace
with the full path to your Rmd fileā¦ :).