Basics of Interacting with SLURM Scheduler

Overview

Teaching: 5 min
Exercises: 20 min
Questions
Objectives
  • Query the SLURM job queue

  • Submit a job to the queue

  • Cancel a submitted job

Using sinfo to see what resources are available

To view the state of the available nodes (organised in partitions)

sinfo
PARTITION AVAIL JOB_SIZE  TIMELIMIT   CPUS  S:C:T   NODES STATE      NODELIST
work*     up    1-infini 1-00:00:00    256 2:64:2       3 planned    nid[001137,001401-001402]
work*     up    1-infini 1-00:00:00    256 2:64:2      38 down$      nid[001011,001046,001071,001086,001104-001111,001126,001135,001187,001190,001243,001273,001282-001283,001303,001307,001420,001563,001689,002038,002049,002304,002321,002348,002442,002562,002564-002567,002585,002805]
work*     up    1-infini 1-00:00:00    256 2:64:2       5 completing nid[001017,001036,001227,001412,002818]
work*     up    1-infini 1-00:00:00    256 2:64:2       3 down*      nid[001338-001339,002433]
work*     up    1-infini 1-00:00:00    256 2:64:2       1 completing nid002312
work*     up    1-infini 1-00:00:00    256 2:64:2       9 drained    nid[001146-001147,001152-001155,001159,001204,001512]
long      up    1        4-00:00:00    256 2:64:2       2 mixed      nid[002596,002603]
long      up    1        4-00:00:00    256 2:64:2       6 allocated  nid[002597-002602]
copy      up    1-infini 2-00:00:00     64 1:32:2       3 mixed      setonix-dm[01-03]
copy      up    1-infini 2-00:00:00     64 1:32:2       1 down       setonix-dm04
askaprt   up    1-infini 1-00:00:00    256 2:64:2       9 down$      nid[001803,001854,001869,001885,001973,001986-001987,002641,002643]
askaprt   up    1-infini 1-00:00:00    256 2:64:2       2 mixed$     nid[001768,001831]
askaprt   up    1-infini 1-00:00:00    256 2:64:2       1 completing nid002751
askaprt   up    1-infini 1-00:00:00    256 2:64:2       1 reserved   nid002636
askaprt   up    1-infini 1-00:00:00    256 2:64:2      11 mixed      nid[002615-002616,002619,002621,002625,002628-002629,002633,002635,002644,002675]
debug     up    1-4         1:00:00    256 2:64:2       8 maint      nid[002604-002611]
highmem   up    1        4-00:00:00    256 2:64:2       1 mixed      nid001505
highmem   up    1        4-00:00:00    256 2:64:2       6 allocated  nid[001504,001506-001509,001511]
highmem   up    1        4-00:00:00    256 2:64:2       1 idle       nid001510
gpu       up    1-infini 1-00:00:00    128  8:8:2       5 maint      nid[002836,002856,002864,002866,002882]
gpu       up    1-infini 1-00:00:00    128  8:8:2       1 down*      nid002834
gpu       up    1-infini 1-00:00:00    128  8:8:2       2 idle       nid[002932,002938]
gpu-dev   up    1-infini    4:00:00    128  8:8:2       4 down$      nid[002944,002946,003008,003010]
gpu-dev   up    1-infini    4:00:00    128  8:8:2      16 idle       nid[002948,002950,002984,002986,002988,002990,002992,002994,002996,002998,003000,003002,003004,003006,003012,003014]
gpu-highm up    1-infini 1-00:00:00    128  8:8:2       4 down$      nid[002888,002908,002968,002976]
gpu-highm up    1-infini 1-00:00:00    128  8:8:2       6 maint      nid[002890,002900,002902,002910,002970,002978]
gpu-highm up    1-infini 1-00:00:00    128  8:8:2       5 mixed      nid[002956,002958,002962,002964,002966]
casda     up    1-infini 1-00:00:00     64 1:32:2       1 idle       casda-an01

Using squeue to check running jobs

To see what jobs are already running in the local cluster

squeue
 JOBID  PARTITION     USER ACCOUNT                    NAME  ST REASON    START_TIME                TIME  TIME_LEFT NODES CPUS PRIORITY NODELIST
363117       long      *** pawsey0399      str_detect_chr3   R None      09:07:19               9:30:05 3-14:29:55     1   34 90037 nid002596
363117       long      *** pawsey0399           str_detect   R None      09:07:19               9:30:05 3-14:29:55     1   34 90037 nid002596
356825       long   ****** d71              70_update_long   R None      08:29:18              10:08:07 3-13:51:53     1  256 120309 nid002597
348465       long  ******* pawsey0386         M2s60TD_smLO   R None      08:29:18              10:08:07 3-13:51:53     1  256 90579 nid002601
362041    highmem   ****** pawsey0263      asm_Hexaprotodo   R None      05:38:49              12:58:36 3-11:01:24     1  128 75289 nid001505
347976       long ******** pawsey0106         Coupling_eta   R None      08:29:18              10:08:07 2-13:51:53     1   48 120819 nid002596

EXEC_HOST refers to the node which is running the job.
ST refers to the state of the job. ‘PD’ means pending, ‘R’ means running.
REASON refers to why the job is not running. ReqNodeNot = nodes are not available, Priority = a higher priority job exists, Resources = waiting on the necessary resources.

To refine the listing to just jobs from a certain partition, use the -p flag

squeue -p work
 JOBID  PARTITION     USER ACCOUNT                    NAME  ST REASON    START_TIME                TIME  TIME_LEFT NODES CPUS PRIORITY NODELIST
357595       work     **** pawsey0380      S13_RA110_S320_   R None      18:39:15                  0:23   23:59:37     1  256 75282 nid001568
363718       work   ****** pawsey0382               lammps   R None      18:37:14                  2:24   23:57:36     1  128 75119 nid001608
363718       work   ****** pawsey0382               lammps   R None      18:35:42                  3:56   23:56:04     1  128 75119 nid001137
357595       work     **** pawsey0380      S13_RA110_S320_   R None      18:35:08                  4:30   23:55:30     1  256 75282 nid002352
363718       work   ****** pawsey0382               lammps   R None      18:30:34                  9:04   23:50:56     1  128 75118 nid001200
357595       work     **** pawsey0380      S13_RA110_S320_   R None      18:28:35                 11:03   23:48:57     1  256 75281 nid001398
363826       work      *** m49                  CuO-111-Vo   R None      18:22:22                 17:16   23:42:44     1  144 75054 nid001516

To refine the listing to a certain user (usually yourself), use the -u flag

squeue -u $USER

Submitting a job to the queue using sbatch

sbatch test.sh

Each job gets a unique identifier (Job ID)

Can you see your job running in the queue? What is the jod ID?

squeue -u $USER

Cancelling a submitted job using scancel

Sometimes you will want to cancel a job. Maybe you were just testing the script, or maybe you realised you made a mistake!

To cancel your specific job

scancel jobID

You can also cancel all jobs under your name with

scancel -u $USER

Understanding how to allocate resources to a job

This was touched on in the lecture component of today, but let’s revisit by taking a closer look at the test.sh script we just ran.

cat test.sh
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=100M
#SBATCH --time=00:05:00
#SBATCH --export=NONE

echo 'I am a test job'
echo 'sleeping for 5 minutes'
sleep 5m

The #SBATCH lines specify to SLURM the computational resources/specifications we want for our job. It is also important to note that SLURM job scripts start with #!/bin/bash because they are essentially bash scripts.
The --reservation flag specifies that we will use the special reservation for this training session. You wouldn’t need to specify that typically.
The --account flag tells the system which allocation to ‘charge’ for the compute time.
The --nodes flag specifies how many nodes you want to use.
The --ntasks-per-node flag specifies how many tasks per node you want to run.
The --cpus-per-task flag specifies how many CPUs (cores) per task you need. The --mem flag specifies how much memory to use per job. The --time flag sets the maximum allowable time for your job to run (i.e. the wall-clock limit). This job is set to get cut-off by SLURM at the 5 minute mark.

Key Points

  • SLURM manages the allocation and resourcing of all submitted jobs

  • Being able to check the status of your job is useful