...
- It has the ability to run parallel calculations by allocating a set of nodes specifically for that task.
- Users running serial jobs with minimal input/output can start the jobs directly from the commandline without resorting to batch files.
- Slurm has the ability to handle batch jobs as well.
- Slurm allows users to run interactive commandline or X11 jobs.
Node Status
sinfo
sinfo reports the state of partitions and nodes managed by Slurm. Example output:
...
srun can also be invoked outside of a job allocation. In that case, srun requests resources, and when those resources are granted, launches tasks across those resources as a single job and job step.
Load any environment modules before the srun/sbatch/salloc commands. These commands will copy your environment as it is at job submission time.
srun
You can start a calculation/job directly from the commandprompt by using srun. This command submits jobs to the slurm job submission system and can also be used to start the same command on multiple nodes. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics such as memory and disk space.
...
Code Block | ||
---|---|---|
| ||
$ prun -v a.out [prun] Master compute host = dn1 [prun] Resource manager = slurm [prun] Setting env variable: OMPI_MCA_mca_base_component_show_load_errors=0 [prun] Setting env variable: PMIX_MCA_mca_base_component_show_load_errors=0 [prun] Setting env variable: OMPI_MCA_ras=^tm [prun] Setting env variable: OMPI_MCA_ess=^tm [prun] Setting env variable: OMPI_MCA_plm=^tm [prun] Setting env variable: OMPI_MCA_io=romio314 [prun] Launch cmd = mpirun a.out (family=openmpi3) Hello, world (4 procs total) --> Process # 0 of 4 is alive. -> dn1 --> Process # 1 of 4 is alive. -> rb2u1 --> Process # 2 of 4 is alive. -> rb2u2 --> Process # 3 of 4 is alive. -> rb2u4 |
...
Interactive Shells
You can allocate an interactive shell on a node for running calculations by hand.
...
Code Block | ||
---|---|---|
| ||
$ srun -n1 --pty bash |
To allocate a shell with X11 (Option 1 for X11 forwarding) forwarding so that you can use the rappture GUI:
Code Block | ||
---|---|---|
| ||
$ module load rappture $ srun -n1 --x11 --pty bash |
To allocate a shell with X11 (Option 2 for X11 forwarding, which may work better than option1 depending on the software and is slightly more complicated to use):
Code Block | ||
---|---|---|
| ||
# With this method, you end up in two subshells underneath your main nanolab login:
# nanolab main login -> salloc subshell on nanolab -> subshell via ssh -Y on the allocated node
# Do not load modules until you have connected to the allocated node, as exemplified below:
$ salloc -n1
# You will see output such as:
salloc: Granted job allocation 18820
salloc: Waiting for resource configuration
salloc: Nodes <nodename> are ready for job
$ ssh -Y <nodename_from_above>
# Now load your module
$ module load rappture
# Now run your software
# If you forget the -Y option to ssh, you will see an error such as:
xterm: Can't open display:
xterm: DISPLAY is not set
# When exiting, you will have to exit an extra time.
# First, out of the node that was allocated to you.
# Then out of the salloc command, which will print output such as
salloc: Relinquishing job allocation 18820
# Then a third time from nanolab itself |
If you did not connect to the head node with X11 forwarding enabled and are using option 1If you did not connect to the head node with X11 forwarding enabled, you will see the following error:
Code Block | ||
---|---|---|
| ||
$ srun -n1 --x11 --pty bash srun: error: No DISPLAY variable set, cannot setup x11 forwarding. |
sbatch Examples
Load any environment modules before using the sbatch command to submit your job.
Code Block | ||
---|---|---|
| ||
#!/bin/bash #SBATCH -J test # job name #SBATCH -o job.%j.out # Name of standard output file (%j expands to %jobId) #SBATCH -N 2 # Number of nodes requested #SBATCH -n 16 # Total number of mpi tasks requested #SBATCH -t 01:30:00 # Run tim e(hh:mm:ss) # Launch MPI-based executable prun ./a.out |
...
Code Block | ||
---|---|---|
| ||
$ sbatch job.mpi Submitted batch job 339 |
...
Stopping a Job
scancel
scancel is used to cancel a pending or running job or job step. To do this we need the JOB ID for the calculation and the command scancel. The JOB ID can be determined using the squeue command described above. To cancel the job with ID=84, just type:
Code Block | ||
---|---|---|
| ||
$ scancel 84 |
If you rerun squeue you will see that the job is gone.