When users log in to the system, they log in to the frontends that give access to the system resources. Users should not run any work on these computers. All users work on these machines and the execution of processes on them, slows down the work of other users.
All HPC jobs running on the system must be executed on the calculation nodes by sending a script to the SCAYLE job manager.
The job manager or queue manager is a system that sends jobs to the calculation nodes to which each user has access, controls their execution and prevents several jobs from sharing the same resources, thus increasing their execution times. The manager used by SCAYLE is SLURM.
In the case of SLURM, the most commonly used commands are:
The first step in being able to submit a job to the job manager is to write a submission script containing two types of lines: directives for the job manager and Linux commands.
The latter are the commands that will be interpreted by the Linux shell defined in the first line of the script (#!/bin/bash). The directives for the job manager are placed at the beginning of the script and in the case of SLURM they are lines that begin with the string
"#SBATCH" followed by the different options available. These directives are processed by the manager when the script is sent with the
sbatch command and serve to provide information to the manager and thus allow the execution nodes to perform the work as desired by the user. For example, the following batch script:
#!/bin/bash #SBATCH –-ntasks=32 #SBATCH –-job-name=hello_world #SBATCH –-mail-user= #SBATCH –-mail-type=ALL #SBATCH --output=hello_world_%A_%a.out #SBATCH --error=hello_world_%A_%a.err #SBATCH –-partition=haswell #SBATCH –-qos=normal source /soft/calendula2/intel/ipsxe_2018_u4/parallel_studio_xe_2018/psxevars.sh srun -n $SLURM_NTASKS hello_world.ex
shows an example of a batch script that will run a basic program on 32 cores. In line 1, as already detailed, you specify the type of shell that will execute the linux commands of the script. The following lines, between 2 and 11, are the directives that the task manager will interpret.
In this example:
#SBATCH --ntasks=32; Sets the number of cores desired for the execution of the script #SBATCH --job-name=hello_world; name assigned to the job #SBATCH ---mail-user=; email address to which job-related notifications will be sent #SBATCH --mail-type=ALL; defines in which circumstances an e-mail will be sent to the user. in this case "ALL" will be at the beginning of the execution, at the end of the execution and in case the task is cancelled. #SBATCH --output=hello_world_%A_%a.out; it is the standard output file. If no output file is defined for the errors, by default the standard output of the execution and the output of the possible errors are unified in a single file. #SBATCH --error=hello_world_%A_%a.err; defines the error output file. #SBATCH --partition=haswell; partition to which the job is sent. #SBATCH --qos=normal; qos with which the job is sent.
To check the status of the jobs sent by the user the command will be:
Each QOS (Quality Of Services) allows you to customize various parameters such as the maximum time a job can run, the maximum number of cores that can be requested by a user or which users can send jobs to that partition. The default QOS used by users if nothing is specified is the normal QOS.
By default, users have access to certain limits. To request access to a particular QOS, the user must contact the support staff.
These QOS may change depending on the needs of the system.
When the same job has to be repeated a series of times only varying the value of some parameter, the task manager allows this task to be performed automatically. This type of work is called array jobs.
To send an array job you must use the option
--array of the
sbatch command, for example from the command line:
frontend> sbatch ... --array 1-20 ... test.sh
would send 20 synultaneous executions of the test.sh program. If we want to include it in our own script, we should add it to the rest of the task manager options:
#SBATCH --output=hello_world_%A_%a.out #SBATCH --error=hello_world_%A_%a.err #SBATCH –-partition=haswell #SBATCH –-qos=normal 10 #SBATCH –-array=1-20
Given the characteristics of the limits of the queuing system, in Slurm there is the option of determining the number of jobs that we want to have simultaneously in execution.
With the previous line we indicate that we want to launch an array job of 20 jobs and that simultaneously are running 4.
This does not guarantee that the jobs enter one after the other. It depends on the load of the machine and priorities.
There are a number of environment variables that are defined in the work environment when the script is executed through the task manager. These variables can be used in the script. Among the most interesting for the usual use are the following: