...
- It has the ability to run parallel calculations by allocating a set of nodes specifically for that task.
- Users running serial jobs with minimal input/output can start the jobs directly from the commandline without resorting to batch files.
- Slurm has the ability to handle batch jobs as well.
- Slurm allows users to run interactive commandline or X11 jobs.
Node Status
sinfo
sinfo reports the state of partitions and nodes managed by Slurm. Example output:
...
Code Block | ||
---|---|---|
| ||
$ prun -v a.out [prun] Master compute host = dn1 [prun] Resource manager = slurm [prun] Setting env variable: OMPI_MCA_mca_base_component_show_load_errors=0 [prun] Setting env variable: PMIX_MCA_mca_base_component_show_load_errors=0 [prun] Setting env variable: OMPI_MCA_ras=^tm [prun] Setting env variable: OMPI_MCA_ess=^tm [prun] Setting env variable: OMPI_MCA_plm=^tm [prun] Setting env variable: OMPI_MCA_io=romio314 [prun] Launch cmd = mpirun a.out (family=openmpi3) Hello, world (4 procs total) --> Process # 0 of 4 is alive. -> dn1 --> Process # 1 of 4 is alive. -> rb2u1 --> Process # 2 of 4 is alive. -> rb2u2 --> Process # 3 of 4 is alive. -> rb2u4 |
...
Interactive Shells
You can allocate an interactive shell on a node for running calculations by hand.
...
To allocate a shell with X11 (Option 1 for X11 forwarding) forwarding so that you can use the rappture GUI:
Code Block | ||
---|---|---|
| ||
$ module load rappture $ srun -n1 --x11 --pty bash |
If you did not connect to the head node with X11 forwarding enabled, you will see the following errorTo allocate a shell with X11 (Option 2 for X11 forwarding, which may work better than option1 depending on the software and is slightly more complicated to use):
Code Block | ||
---|---|---|
| ||
$# With this method, you end up in two subshells underneath your main nanolab login: # nanolab main login -> salloc subshell on nanolab -> subshell via ssh -Y on the allocated node # Do not load modules until you have connected to the allocated node, as exemplified below: $ salloc -n1 # You will see output such as: salloc: Granted job allocation 18820 salloc: Waiting for resource configuration salloc: Nodes <nodename> are ready for job $ ssh -Y <nodename_from_above> # Now load your module $ module load rappture # Now run your software # If you forget the -Y option to ssh, you will see an error such as: xterm: Can't open display: xterm: DISPLAY is not set # When exiting, you will have to exit an extra time. # First, out of the node that was allocated to you. # Then out of the salloc command, which will print output such as salloc: Relinquishing job allocation 18820 # Then a third time from nanolab itself |
If you did not connect to the head node with X11 forwarding enabled and are using option 1, you will see the following error:
Code Block | ||
---|---|---|
| ||
$ srun srun -n1 --x11 --pty bash srun: error: No DISPLAY variable set, cannot setup x11 forwarding. |
...
Code Block | ||
---|---|---|
| ||
$ sbatch job.mpi Submitted batch job 339 |
...
Stopping a Job
scancel
scancel is used to cancel a pending or running job or job step. To do this we need the JOB ID for the calculation and the command scancel. The JOB ID can be determined using the squeue command described above. To cancel the job with ID=84, just type:
Code Block | ||
---|---|---|
| ||
$ scancel 84 |
If you rerun squeue you will see that the job is gone.