Lab 11 - Job Submission on teach.cs Cluster

Download lab .qmd here

Learning goals

  • Connect to the teach.cs computing cluster via SSH.
  • Understand the SLURM workload manager and cluster hardware.
  • Write Python scripts suitable for non-interactive cluster execution.
  • Submit batch jobs with sbatch and monitor them with sinfo / squeue.
  • Request multiple CPU cores and measure parallel speedup on the cluster.
  • Retrieve and interpret job output files.

Lab description

In this lab you will move from running parallel Python code on your laptop (Lab 10) to submitting jobs on the University of Toronto CS Teaching Labs cluster. The cluster is managed by the SLURM workload manager, which queues and allocates resources across multiple nodes.

All commands in this lab are run in a terminal (not in a Quarto notebook). Use this .qmd file to document your answers and paste terminal output.

Deliverables

All questions answered, with required terminal output pasted in. Upload your .qmd and rendered .html to Quercus.


Background: The teach.cs Cluster

The cluster has three node types, each available as a SLURM partition:

Partition Nodes CPU cores/node RAM GPUs
squid squid01–07 24 (dual 12-core Xeon Silver 4310) 256 GB 4× NVIDIA RTX A4500 (20 GB)
coral coral01–08 8 (single 8-core Xeon Silver 4208) 96 GB 2× NVIDIA RTX A4000 (16 GB)
prawn prawn01–12 8 (single 8-core Xeon Silver 4208) 96 GB 2× NVIDIA RTX 2080 Ti (11 GB)

jsc370 is a separate partition for us to use in this course and we have access to all of the above nodes.

For the lab:

  • Access the login node via ssh <utorid>@teach.cs.toronto.edu.
  • You must submit jobs through SLURM — direct SSH login to compute nodes is not allowed.
  • You may use VS Code (preferred) or terminal
  • Get used to nano/vim

Problem 1: Connecting and exploring the cluster

1a. Open a terminal and SSH into the teach.cs login node:

ssh <your-utorid>@teach.cs.toronto.edu

Once logged in, check the cluster status:

sinfo
sinfo -p jsc370       # show only the jsc370 partition
squeue                # all running/pending jobs
squeue -u $USER       # your jobs only

Paste the output of sinfo here.

1b. Check how many CPU cores are available right now and what Python version is installed:

python3 --version
which python3
nproc              # cores on the login node (don't run compute-heavy work here!)

Paste the output here.

1c. Make a working directory for this lab on the cluster:

mkdir -p ~/jsc370/lab11
cd ~/jsc370/lab11

Problem 2: Your first batch job

We will start with the simplest possible job: print some system information from a compute node.

2a. Copy the script below and save it as lab11.py on the cluster. You can use nano lab11.py, vim lab11.py, or create it locally and scp it over.

#!/usr/bin/env python3
import sys, platform, multiprocessing, os

print("=== System Info from Compute Node ===")
print(f"Hostname:       {platform.node()}")
print(f"Python version: {sys.version}")
print(f"OS:             {platform.platform()}")
print(f"CPU cores:      {multiprocessing.cpu_count()}")
print(f"SLURM job ID:   {os.environ.get('SLURM_JOB_ID', 'N/A')}")
print(f"SLURM node:     {os.environ.get('SLURM_NODELIST', 'N/A')}")
print(f"CPUs allocated: {os.environ.get('SLURM_CPUS_PER_TASK', 'N/A')}")

2b. Write a SLURM batch script lab11.sh to run it. Fill in the blanks:

#!/bin/bash
#SBATCH --job-name=lab11
#SBATCH --partition=____________       # use the jsc370 partition
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --time=00:05:00
#SBATCH --output=lab11_%j.out    # %j is replaced by the job ID
#SBATCH --error=lab11_%j.err

python3 lab11.py

2c. Submit the job and monitor it:

sbatch lab11.sh            # submit
squeue -u $USER            # check status (PD = pending, R = running, CG = completing)

Once it completes, view the output:

cat lab11_*.out

Paste the output here. How many CPU cores does the compute node report? Is it different from the 1 core you requested?


Problem 3: Requesting multiple cores — parallel flu simulation

Now we submit the flu simulation from Lecture 10. The goal is to compare serial vs parallel execution on the cluster and see how speedup changes with core count.

3a. Save the following script as flu_parallel.py on the cluster. Read through it — it accepts two command-line arguments: n_vaccinated and n_workers.

#!/usr/bin/env python3
"""
flu_parallel.py — parallel influenza simulation on the cluster.
Usage: python3 flu_parallel.py <n_vaccinated> <n_workers>
"""
import sys
import numpy as np
import multiprocessing
import time
import os

# --- Simulation constants ---
SUSCEPTIBLE, INFECTED, RECOVERED, DECEASED = 0, 1, 2, 3
probs_transmit = [0.60, 0.25, 0.40, 0.08]
prob_death = 0.05

def simulate_flu(pop_size=900, n_sick=10, n_vaccinated=0, n_steps=20, seed=None):
    rng = np.random.default_rng(seed)
    n = int(np.sqrt(pop_size))
    status = np.zeros((n, n), dtype=int)
    sick_idx = rng.choice(pop_size, n_sick, replace=False)
    status.flat[sick_idx] = INFECTED
    vaccinated = np.zeros(pop_size, dtype=bool)
    if n_vaccinated > 0:
        vacc_idx = rng.choice(pop_size, n_vaccinated, replace=False)
        vaccinated[vacc_idx] = True
    vaccinated = vaccinated.reshape(n, n)
    deceased_counts = []
    for _ in range(n_steps):
        new_status = status.copy()
        for i in range(n):
            for j in range(n):
                if status[i, j] == SUSCEPTIBLE:
                    for di in [-1, 0, 1]:
                        for dj in [-1, 0, 1]:
                            if di == 0 and dj == 0:
                                continue
                            ni_, nj_ = (i + di) % n, (j + dj) % n
                            if status[ni_, nj_] == INFECTED:
                                both      = vaccinated[i, j] and vaccinated[ni_, nj_]
                                only_self = vaccinated[i, j] and not vaccinated[ni_, nj_]
                                only_nb   = not vaccinated[i, j] and vaccinated[ni_, nj_]
                                p = probs_transmit[3*both + 2*only_self + only_nb]
                                if rng.random() < p:
                                    new_status[i, j] = INFECTED
                                    break
                        else:
                            continue
                        break
                elif status[i, j] == INFECTED:
                    if rng.random() < prob_death:
                        new_status[i, j] = DECEASED
                    else:
                        new_status[i, j] = RECOVERED
        status = new_status
        deceased_counts.append((status == DECEASED).sum())
    return np.array(deceased_counts)

# Top-level so it can be pickled by multiprocessing
_n_vaccinated = 0
def run_one(seed):
    return simulate_flu(pop_size=900, n_sick=10,
                        n_vaccinated=_n_vaccinated,
                        n_steps=20, seed=seed)

if __name__ == "__main__":
    n_vaccinated = int(sys.argv[1]) if len(sys.argv) > 1 else 450
    n_workers    = int(sys.argv[2]) if len(sys.argv) > 2 else 1
    seeds = list(range(100))

    # Make n_vaccinated available to workers via the module-level variable
    _n_vaccinated = n_vaccinated

    # Serial baseline
    t0 = time.time()
    results_serial = [run_one(s) for s in seeds]
    t_serial = time.time() - t0

    # Parallel run
    ctx = multiprocessing.get_context("fork")
    t0 = time.time()
    with ctx.Pool(processes=n_workers) as pool:
        results_parallel = pool.map(run_one, seeds)
    t_parallel = time.time() - t0

    arr = np.array(results_parallel)
    job_id  = os.environ.get("SLURM_JOB_ID", "local")
    node    = os.environ.get("SLURM_NODELIST", "local")
    alloc   = os.environ.get("SLURM_CPUS_PER_TASK", str(n_workers))

    print(f"Job {job_id} on {node} | CPUs allocated: {alloc}")
    print(f"n_vaccinated={n_vaccinated}, n_workers={n_workers}, n_sims=100")
    print(f"Serial time:   {t_serial:.2f}s")
    print(f"Parallel time: {t_parallel:.2f}s")
    print(f"Speedup:       {t_serial/t_parallel:.2f}x")
    print(f"Mean deceased at step 20: {arr[:, -1].mean():.1f} "
          f"± {arr[:, -1].std():.1f}")

3b. Write a batch script flu_1core.sh that requests 1 core and runs the simulation with 450 vaccinated individuals:

#!/bin/bash
#SBATCH --job-name=flu_1core
#SBATCH --partition=jsc370
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:15:00
#SBATCH --output=flu_1core_%j.out
#SBATCH --error=flu_1core_%j.err

python3 flu_parallel.py 450 1

Submit it:

sbatch flu_1core.sh

3c. Write a second batch script flu_4core.sh that requests 4 cores and runs the same simulation with 4 workers. Only two lines need to change from 3b — which ones?

#!/bin/bash
#SBATCH --job-name=flu_4core
#SBATCH --partition=jsc370
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=____________   # fill in
#SBATCH --mem=4G
#SBATCH --time=00:15:00
#SBATCH --output=flu_4core_%j.out
#SBATCH --error=flu_4core_%j.err

python3 flu_parallel.py 450 ____________   # fill in

Submit it:

sbatch flu_4core.sh

3d. Monitor both jobs until they finish:

squeue -u $USER

Then collect the results:

cat flu_1core_*.out
cat flu_4core_*.out

Paste both outputs. Fill in the table:

Script Cores requested Serial time (s) Parallel time (s) Speedup
flu_1core.sh 1
flu_4core.sh 4

Is the speedup on the cluster similar to what you observed on your laptop in Lab 10? What factors might cause differences?


Problem 4: Sweeping vaccination coverage

Now submit a job array — one job per vaccination level — so all five vaccination levels run in parallel across different nodes. This demonstrates how cluster computing enables large-scale parameter sweeps.

4a. Save the following script as flu_sweep.py:

#!/usr/bin/env python3
"""
flu_sweep.py — one vaccination level per SLURM array task.
SLURM_ARRAY_TASK_ID controls which vaccination level is used.
Usage: automatically called by SLURM array job.
"""
import os, time
import numpy as np
import multiprocessing

SUSCEPTIBLE, INFECTED, RECOVERED, DECEASED = 0, 1, 2, 3
probs_transmit = [0.60, 0.25, 0.40, 0.08]
prob_death = 0.05

def simulate_flu(pop_size=900, n_sick=10, n_vaccinated=0, n_steps=20, seed=None):
    rng = np.random.default_rng(seed)
    n = int(np.sqrt(pop_size))
    status = np.zeros((n, n), dtype=int)
    sick_idx = rng.choice(pop_size, n_sick, replace=False)
    status.flat[sick_idx] = INFECTED
    vaccinated = np.zeros(pop_size, dtype=bool)
    if n_vaccinated > 0:
        vacc_idx = rng.choice(pop_size, n_vaccinated, replace=False)
        vaccinated[vacc_idx] = True
    vaccinated = vaccinated.reshape(n, n)
    deceased_counts = []
    for _ in range(n_steps):
        new_status = status.copy()
        for i in range(n):
            for j in range(n):
                if status[i, j] == SUSCEPTIBLE:
                    for di in [-1, 0, 1]:
                        for dj in [-1, 0, 1]:
                            if di == 0 and dj == 0:
                                continue
                            ni_, nj_ = (i + di) % n, (j + dj) % n
                            if status[ni_, nj_] == INFECTED:
                                both      = vaccinated[i, j] and vaccinated[ni_, nj_]
                                only_self = vaccinated[i, j] and not vaccinated[ni_, nj_]
                                only_nb   = not vaccinated[i, j] and vaccinated[ni_, nj_]
                                p = probs_transmit[3*both + 2*only_self + only_nb]
                                if rng.random() < p:
                                    new_status[i, j] = INFECTED
                                    break
                        else:
                            continue
                        break
                elif status[i, j] == INFECTED:
                    if rng.random() < prob_death:
                        new_status[i, j] = DECEASED
                    else:
                        new_status[i, j] = RECOVERED
        status = new_status
        deceased_counts.append((status == DECEASED).sum())
    return np.array(deceased_counts)

_n_vaccinated = 0
def run_one(seed):
    return simulate_flu(pop_size=900, n_sick=10,
                        n_vaccinated=_n_vaccinated,
                        n_steps=20, seed=seed)

if __name__ == "__main__":
    # Each array task gets a different vaccination level
    vacc_levels = [0, 225, 450, 675, 900]   # 0%, 25%, 50%, 75%, 100%
    task_id = int(os.environ.get("SLURM_ARRAY_TASK_ID", 0))
    n_vaccinated = vacc_levels[task_id]
    n_workers = int(os.environ.get("SLURM_CPUS_PER_TASK", 4))

    _n_vaccinated = n_vaccinated

    ctx = multiprocessing.get_context("fork")
    t0 = time.time()
    with ctx.Pool(processes=n_workers) as pool:
        results = pool.map(run_one, range(100))
    elapsed = time.time() - t0

    arr = np.array(results)
    vacc_pct = 100 * n_vaccinated // 900
    print(f"task={task_id}  vacc={n_vaccinated} ({vacc_pct}%)  "
          f"workers={n_workers}  time={elapsed:.2f}s")
    print(f"Mean deceased (step 20): {arr[:, -1].mean():.2f} "
          f"± {arr[:, -1].std():.2f}")
    # Save per-task results for later aggregation
    np.save(f"results_vacc{n_vaccinated}.npy", arr)
    print(f"Saved results_vacc{n_vaccinated}.npy")

4b. Write the job array batch script flu_array.sh. A job array uses --array=0-4 to launch 5 tasks (IDs 0–4), one per vaccination level. Each task uses 4 cores.

#!/bin/bash
#SBATCH --job-name=flu_sweep
#SBATCH --partition=jsc370
#SBATCH --array=____________          # fill in: launch tasks 0 through 4
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=4G
#SBATCH --time=00:15:00
#SBATCH --output=flu_sweep_%A_%a.out  # %A = array job ID, %a = task ID
#SBATCH --error=flu_sweep_%A_%a.err

python3 flu_sweep.py

Submit and monitor:

sbatch flu_array.sh
squeue -u $USER          # you should see 5 tasks (or some running, some pending)

4c. Once all tasks finish, collect the results:

cat flu_sweep_*_0.out   # 0% vaccination
cat flu_sweep_*_1.out   # 25% vaccination
cat flu_sweep_*_2.out   # 50% vaccination
cat flu_sweep_*_3.out   # 75% vaccination
cat flu_sweep_*_4.out   # 100% vaccination

Paste the output for all 5 tasks. Fill in the table:

Task ID Vaccination Mean deceased (step 20) Std
0 0% (0/900)
1 25% (225/900)
2 50% (450/900)
3 75% (675/900)
4 100% (900/900)

Describe the trend. At what vaccination level do deaths drop most sharply?


Problem 5: Aggregating and plotting results locally

After the array job completes, download the .npy result files to your laptop and make a plot.

5a. Copy the files from the cluster to your laptop (run this in a local terminal, not on the cluster):

scp "<utorid>@teach.cs.toronto.edu:~/jsc370/lab11/results_vacc*.npy" <path-to-a-directory>

If scp does not work, you can download the .npy files directly from GitHub:

git clone https://github.com/BullDF/lab11-npy-files
cd lab11-npy-files

5b. Run the following Python code locally to produce a summary plot:

import numpy as np
import matplotlib.pyplot as plt

vacc_levels = [0, 225, 450, 675, 900]
vacc_labels = ["0%", "25%", "50%", "75%", "100%"]
colors = ["tomato", "orange", "gold", "steelblue", "mediumseagreen"]

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Left: boxplot of final deceased count across 100 simulations
data = []
for v in vacc_levels:
    arr = np.load(f"results_vacc{v}.npy")
    data.append(arr[:, -1])   # final time step

axes[0].boxplot(data, labels=vacc_labels, patch_artist=True,
                boxprops=dict(facecolor="lightblue"),
                medianprops=dict(color="black"), showfliers=False)
axes[0].set_xlabel("Vaccination coverage")
axes[0].set_ylabel("Deceased at step 20")
axes[0].set_title("Distribution of deaths by vaccination level\n(100 simulations each)")

# Right: mean contagion curve over time
steps = np.arange(1, 21)
for v, label, color in zip(vacc_levels, vacc_labels, colors):
    arr = np.load(f"results_vacc{v}.npy")
    axes[1].plot(steps, arr.mean(axis=0), label=label, color=color)
axes[1].set_xlabel("Time step")
axes[1].set_ylabel("Mean cumulative deceased")
axes[1].set_title("Contagion curve by vaccination coverage")
axes[1].legend(title="Vaccination")

plt.tight_layout()
plt.savefig("flu_cluster_results.png", dpi=150)
print("Saved flu_cluster_results.png")

Include the plot in your submission (paste the image or attach the file). What does the plot reveal about herd immunity?


Problem 6: Interactive session (bonus)

Sometimes you want to test code interactively on a compute node before submitting a batch job. SLURM’s srun command opens an interactive shell on a node.

6a. Request an interactive session on the jsc370 partition with 2 cores for 30 minutes:

srun --partition=jsc370 --ntasks=1 --cpus-per-task=2 --mem=2G \
     --time=00:30:00 --pty bash

Once the shell opens, you are on a compute node. Verify:

hostname
nproc
python3 -c "import multiprocessing; print(multiprocessing.cpu_count())"

Paste the output. Does nproc match --cpus-per-task=2 or the physical core count of the node?

6b. Run a quick parallel test directly in the interactive session:

python3 -c "
import multiprocessing, time
ctx = multiprocessing.get_context('fork')
def sq(x): return x * x
t0 = time.time()
with ctx.Pool(2) as p:
    r = p.map(sq, range(1000000))
print(f'Done in {time.time()-t0:.2f}s, sum={sum(r)}')
"

Paste the output.

Exit the interactive session when done:

exit

Summary

Answer the following questions in your own words:

  1. What is the role of SLURM on the teach.cs cluster? Why can’t you just SSH directly to a compute node?

  2. What is a SLURM job array, and when is it more convenient than writing separate batch scripts?

  3. On the cluster you requested --cpus-per-task=4. On your laptop in Lab 10 you also used 4 workers. Was the speedup similar? What hardware differences might explain any gap?

  4. You ran 100 simulations × 5 vaccination levels = 500 simulations total. Roughly how long would this take serially on a single core (extrapolate from your timing results)? How long did it take with the job array?

Your answers here.