Job Manager


When users log in to the system, they log in to the frontends that give access to the system resources. Users should not run any work on these computers. All users work on these machines and the execution of processes on them, slows down the work of other users.
All HPC jobs running on the system must be executed on the calculation nodes by sending a script to the SCAYLE job manager.
The job manager or queue manager is a system that sends jobs to the calculation nodes to which each user has access, controls their execution and prevents several jobs from sharing the same resources, thus increasing their execution times. The manager used by SCAYLE is SLURM.

In the case of SLURM, the most commonly used commands are:

  • sbatch: sends the work script
  • squeue: shows the status of submitted jobs
  • scancel: delete a job
  • salloc: creates an interactive session with the calculation nodes in order to compile jobs.

The first step in being able to submit a job to the job manager is to write a submission script containing two types of lines: directives for the job manager and Linux commands.

The latter are the commands that will be interpreted by the Linux shell defined in the first line of the script (#!/bin/bash). The directives for the job manager are placed at the beginning of the script and in the case of SLURM they are lines that begin with the string "#SBATCH" followed by the different options available. These directives are processed by the manager when the script is sent with the sbatch command and serve to provide information to the manager and thus allow the execution nodes to perform the work as desired by the user. For example, the following batch script:

#!/bin/bash 
#SBATCH –-ntasks=32 
#SBATCH –-job-name=hello_world 
#SBATCH –-mail-user=
#SBATCH –-mail-type=ALL 
#SBATCH --output=hello_world_%A_%a.out 
#SBATCH --error=hello_world_%A_%a.err 
#SBATCH –-partition=haswell 
#SBATCH –-qos=normal 
source /soft/calendula2/intel/ipsxe_2018_u4/parallel_studio_xe_2018/psxevars.sh 
srun -n $SLURM_NTASKS hello_world.ex

shows an example of a batch script that will run a basic program on 32 cores. In line 1, as already detailed, you specify the type of shell that will execute the linux commands of the script. The following lines, between 2 and 11, are the directives that the task manager will interpret.
In this example:

#SBATCH --ntasks=32; Sets the number of cores desired for the execution of the script
#SBATCH --job-name=hello_world; name assigned to the job
#SBATCH ---mail-user=; email address to which job-related notifications will be sent 
#SBATCH --mail-type=ALL; defines in which circumstances an e-mail will be sent to the user. in this case "ALL" will be at the beginning of the execution, at the end of the execution and in case the task is cancelled. 
#SBATCH --output=hello_world_%A_%a.out; it is the standard output file. If no output file is defined for the errors, by default the standard output of the execution and the output of the possible errors are unified in a single file. 
#SBATCH --error=hello_world_%A_%a.err; defines the error output file. 
#SBATCH --partition=haswell; partition to which the job is sent. 
#SBATCH --qos=normal; qos with which the job is sent.

To check the status of the jobs sent by the user the command will be:

$ squeue

Each QOS (Quality Of Services) allows you to customize various parameters such as the maximum time a job can run, the maximum number of cores that can be requested by a user or which users can send jobs to that partition. The default QOS used by users if nothing is specified is the normal QOS.
By default, users have access to certain limits. To request access to a particular QOS, the user must contact the support staff.

Name Priority MaxWall MaxTRESPU MaxJobsPU
normal 100 5-00:00:00 cpu=512 50
long 100 15-00:00:00 cpu=256
xlong 100 30-00:00:00 cpu=128

These QOS may change depending on the needs of the system.

When the same job has to be repeated a series of times only varying the value of some parameter, the task manager allows this task to be performed automatically. This type of work is called array jobs.
To send an array job you must use the option --array of the sbatch command, for example from the command line:

 frontend> sbatch ... --array 1-20 ... test.sh

would send 20 synultaneous executions of the test.sh program. If we want to include it in our own script, we should add it to the rest of the task manager options:

  #SBATCH --output=hello_world_%A_%a.out 
  #SBATCH --error=hello_world_%A_%a.err
  #SBATCH –-partition=haswell 
  #SBATCH –-qos=normal 10
  #SBATCH –-array=1-20

Given the characteristics of the limits of the queuing system, in Slurm there is the option of determining the number of jobs that we want to have simultaneously in execution.

  #SBATCH –-array=1-20%4

With the previous line we indicate that we want to launch an array job of 20 jobs and that simultaneously are running 4.

This does not guarantee that the jobs enter one after the other. It depends on the load of the machine and priorities.

There are a number of environment variables that are defined in the work environment when the script is executed through the task manager. These variables can be used in the script. Among the most interesting for the usual use are the following:

  • $SLURM_JOB_ID: job identifier.
  • $SLURM_JOB_NAME: job name.
  • $SLURM_SUBMIT_DIR: sending directory.
  • $SLURM_JOB_NUM_NODES: number of nodes assigned to the job.
  • $SLURM_CPUS_ON_NODE: number of cores/node.
  • $SLURM_NTASKS: total number of cores per job.
  • $SLURM_NODEID: index of the node that is executed in relation to the nodes assigned to the work.
  • $SLURM_PROCID: index of the task in relation to the work.