Learning goals

In this lab, you are expected to learn/put in practice the following skills:

Evaluate whether a problem can be parallelized or not.
Practice with the parallel package.
Use Rscript to submit jobs.

Problem 1: Think

Give yourself a few minutes to think about what you learned about parallelization. List three examples of problems that you believe may be solved using parallel computing, and check for packages on the HPC CRAN task view that may be related to it.

Answer here.

Problem 2: Pre-parallelization

The following functions can be written to be more efficient without using parallel:

This function generates a n x k dataset with all its entries having a Poisson distribution with mean lambda.

fun1 <- function(n = 100, k = 4, lambda = 4) {
  x <- NULL
  
  for (i in 1:n)
    x <- rbind(x, rpois(k, lambda))
  
  return(x)
}

fun1alt <- function(n = 100, k = 4, lambda = 4) {
  # YOUR CODE HERE
}

# Benchmarking
microbenchmark::microbenchmark(
  fun1(),
  fun1alt()
)

How much faster?

Answer here.

Find the column max (hint: Checkout the function max.col()).

# Data Generating Process (10 x 10,000 matrix)
set.seed(1234)
x <- matrix(rnorm(1e4), nrow=10)

# Find each column's max value
fun2 <- function(x) {
  apply(x, 2, max)
}

fun2alt <- function(x) {
  # YOUR CODE HERE
}

# Benchmarking
microbenchmark::microbenchmark(
  fun2(),
  fun2alt()
)

Answer here with a plot.

Problem 3: Parallelize everything

We will now turn our attention to non-parametric bootstrapping. Among its many uses, non-parametric bootstrapping allow us to obtain confidence intervals for parameter estimates without relying on parametric assumptions.

The main assumption is that we can approximate many experiments by resampling observations from our original dataset, which reflects the population.

This function implements the non-parametric bootstrap:

my_boot <- function(dat, stat, R, ncpus = 1L) {
  
  # Getting the random indices
  n <- nrow(dat)
  idx <- matrix(sample.int(n, n*R, TRUE), nrow=n, ncol=R)
 
  # Making the cluster using `ncpus`
  # STEP 1: GOES HERE
  # STEP 2: GOES HERE
  
  # STEP 3: THIS FUNCTION NEEDS TO BE REPLACED WITH parLapply
  ans <- lapply(seq_len(R), function(i) {
    stat(dat[idx[,i], , drop=FALSE])
  })
  
  # Coercing the list into a matrix
  ans <- do.call(rbind, ans)
  
  # STEP 4: GOES HERE
  
  ans
  
}

Use the previous pseudocode, and make it work with parallel. Here is just an example for you to try:

# Bootstrap of a linear regression model
my_stat <- function_for_lm 

# DATA SIM
set.seed(1)
n <- 500 
R <- 1e4
x <- cbind(rnorm(n)) 
y <- x*5 + rnorm(n)

# Check if we get something similar as lm
ans0 <- confint()
ans1 <- my_boot()

Check whether your version actually goes faster than the non-parallel version:

# your code here

Answer here.

Problem 4: Compile this markdown document using Rscript

Once you have saved this Rmd file, try running the following command in your terminal:

Rscript --vanilla -e 'rmarkdown::render("[full-path-to-your-Rmd-file.Rmd]")' &

Where [full-path-to-your-Rmd-file.Rmd] should be replace with the full path to your Rmd file… :).

Lab 9 - HPC

Learning goals

Problem 1: Think

Problem 2: Pre-parallelization

Problem 3: Parallelize everything

Problem 4: Compile this markdown document using Rscript