Learning goals

In this lab, you are expected to learn/put in practice the following skills:

Evaluate whether a problem can be parallelized or not.
Practice with the parallel package.
Use Rscript to submit jobs.

Problem 1: Think

Give yourself a few minutes to think about what you learned about parallelization. List three examples of problems that you believe may be solved using parallel computing, and check for packages on the HPC CRAN task view that may be related to it.

Cross validation in machine learning: carat, mlr
Bootstrapping: boot
MCMC: parallel
Cellular automaton: parallel
Others??

Problem 2: Pre-parallelization

The following functions can be written to be more efficient without using parallel:

This function generates a n x k dataset with all its entries having a Poisson distribution with mean lambda.

fun1 <- function(n = 100, k = 4, lambda = 4) {
  x <- NULL
  
  for (i in 1:n)
    x <- rbind(x, rpois(k, lambda))
  
  return(x)
}

fun1alt <- function(n = 100, k = 4, lambda = 4) {
  matrix(rpois(n * k, lambda = lambda), ncol = k)
}

# Benchmarking
microbenchmark::microbenchmark(
  fun1(100),
  fun1alt(100), unit = "ns" # can also try other units such as ms (milliseconds)
)

## Warning in microbenchmark::microbenchmark(fun1(100), fun1alt(100), unit =
## "ns"): less accurate nanosecond times to avoid potential integer overflows

## Unit: nanoseconds
##          expr    min       lq      mean   median       uq     max neval
##     fun1(100) 167444 182921.5 283257.93 186775.5 193950.5 9733154   100
##  fun1alt(100)  12997  13735.0  21748.45  14432.0  15088.0  724306   100

Find the column max (hint: Checkout the function max.col()).

# Data Generating Process (10 x 10,000 matrix)
set.seed(1234)
x <- matrix(rnorm(1e4), nrow=10)

# Find each column's max value
fun2 <- function(x) {
  apply(x, 2, max)
}

fun2alt <- function(x) {
  x[cbind(max.col(t(x)), 1:ncol(x))]
}

# Benchmarking
bench <- microbenchmark::microbenchmark(
  fun2(x),
  fun2alt(x), 
  unit = "us"
)
plot(bench)
ggplot2::autoplot(bench) +
  ggplot2::theme_minimal()

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

Create a plot of the processing time

Problem 3: Parallelize everything

We will now turn our attention to non-parametric bootstrapping. Among its many uses, non-parametric bootstrapping allow us to obtain confidence intervals for parameter estimates without relying on parametric assumptions.

The main assumption is that we can approximate many experiments by resampling observations from our original dataset, which reflects the population.

This function implements the non-parametric bootstrap:

library(parallel)
my_boot <- function(dat, stat, R, ncpus = 1L) {
  
  # Getting the random indices
  n <- nrow(dat)
  idx <- matrix(sample.int(n, n * R, TRUE), nrow = n, ncol = R)
 
  # Making the cluster using `ncpus`
  # STEP 1: GOES HERE
  cl <- makePSOCKcluster(ncpus)
  # on.exit(stopCluster(cl))
  # STEP 2: GOES HERE
  clusterExport(cl, varlist = c("idx", "dat", "stat"), envir = environment())
  
  # STEP 3: THIS FUNCTION NEEDS TO BE REPLACED WITH parLapply
  #ans <- lapply(seq_len(R), function(i) {
  #  stat(dat[idx[,i], , drop=FALSE])
  #})
   ans <- parLapply(cl, seq_len(R), function(i) {
    stat(dat[idx[,i], , drop=FALSE])
  })
   
  # Coercing the list into a matrix
  ans <- do.call(rbind, ans)
  
  # STEP 4: GOES HERE
   stopCluster(cl)
  
  ans
  
}

Use the previous pseudocode, and make it work with parallel. Here is just an example for you to try:

# Bootstrap of a linear regression model
my_stat <- function(d) coef(lm(y ~ x, data = d))

# DATA SIM
set.seed(1)
n <- 500 
R <- 1e4
x <- cbind(rnorm(n)) 
y <- x * 5 + rnorm(n)

# Checking if we get something similar as lm
ans0 <- confint(lm(y ~ x))
cat("OLS CIs -----------\n")

## OLS CIs -----------

ans0

##                  2.5 %     97.5 %
## (Intercept) -0.1379033 0.04797344
## x            4.8650100 5.04883353

ans1 <- my_boot(dat = data.frame(x, y), my_stat, R = R, ncpus = 4)
qs <- c(.025, .975)
cat("Bootstrap CIs -----------\n")

## Bootstrap CIs -----------

t(apply(ans1, 2, quantile, probs = qs))

##                   2.5%      97.5%
## (Intercept) -0.1386903 0.04856752
## x            4.8685162 5.04351239

Check whether your version actually goes faster than the non-parallel version:

parallel::detectCores() # number of available cores

## [1] 10

# 1 core => non-parallel
system.time(my_boot(dat = data.frame(x, y), my_stat, R = 4000, ncpus = 1L))

##    user  system elapsed 
##   0.037   0.007   1.343

# 8 cores => parallel
system.time(my_boot(dat = data.frame(x, y), my_stat, R = 4000, ncpus = 8L))

##    user  system elapsed 
##   0.137   0.033   0.570

Problem 4: Compile this markdown document using Rscript

Once you have saved this Rmd file, try running the following command in your terminal:

Rscript --vanilla -e 'rmarkdown::render("[full-path-to-your-Rmd-file.Rmd]")' &

Where [full-path-to-your-Rmd-file.Rmd] should be replace with the full path to your Rmd file… :).

Lab 9 - HPC

Learning goals

Problem 1: Think

Problem 2: Pre-parallelization

Problem 3: Parallelize everything

Problem 4: Compile this markdown document using Rscript