Project

General

Profile

Slurm » History » Version 21

Kerstin Paech, 10/09/2013 02:01 PM

1 21 Kerstin Paech
{{toc}}
2 21 Kerstin Paech
3 1 Kerstin Paech
h1. How to run jobs on the euclides nodes
4 1 Kerstin Paech
5 7 Kerstin Paech
Use slurm to submit jobs to the euclides nodes (node1-8), ssh login access to those nodes will be restricted in the near future.
6 1 Kerstin Paech
7 9 Kerstin Paech
*Please read through this entire wikipage so everyone can make efficient use of this cluster*
8 9 Kerstin Paech
9 1 Kerstin Paech
h2. alexandria
10 1 Kerstin Paech
11 1 Kerstin Paech
*Please do not use alexandria as a compute node* - it's hardware is different from the nodes. It hosts our file server and other services that are important to us. 
12 1 Kerstin Paech
13 1 Kerstin Paech
You should use alexandria to
14 1 Kerstin Paech
- transfer files
15 1 Kerstin Paech
- compile your code
16 1 Kerstin Paech
- submit jobs to the nodes
17 1 Kerstin Paech
18 1 Kerstin Paech
If you need to debug, please start an interactive job to one of the nodes using slurm. For instructions see below.
19 1 Kerstin Paech
20 1 Kerstin Paech
h2. euclides nodes
21 1 Kerstin Paech
22 1 Kerstin Paech
Job submission to the euclides nodes is handled by the slurm jobmanager (see http://slurm.schedmd.com and https://computing.llnl.gov/linux/slurm/). 
23 1 Kerstin Paech
*Important: In order to run jobs, you need to be added to the slurm accounting system - please contact Kerstin*
24 1 Kerstin Paech
25 4 Kerstin Paech
All slurm commands listed below have very helpful man pages (e.g. man slurm, man squeue, ...). 
26 4 Kerstin Paech
27 4 Kerstin Paech
If you are already familiar with another jobmanager the following information may be helpful to you http://slurm.schedmd.com/rosetta.pdf‎.
28 1 Kerstin Paech
29 1 Kerstin Paech
h3. Scheduling of Jobs
30 1 Kerstin Paech
31 9 Kerstin Paech
At this point there are two queues, called partitions in slurm: 
32 9 Kerstin Paech
* *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of
33 9 Kerstin Paech
two days. Jobs at this point can only run on 1 node.
34 16 Kerstin Paech
* *debug* which is meant for debugging, you can only run one job at a time, other jobs submitted will remain in the queue. Time limit is
35 16 Kerstin Paech
12 hours.
36 1 Kerstin Paech
37 9 Kerstin Paech
We have also set up a scheduler that goes beyond the first come first serve - some jobs will be favoured over others depending
38 9 Kerstin Paech
on how much you or your group have been using euclides in the past 2 weeks, how long the job has been queued and how much
39 9 Kerstin Paech
resources it will consume.
40 9 Kerstin Paech
41 9 Kerstin Paech
This is serves as a starting point, we may have to adjust parameters once the slurm jobmanager is used. Job scheduling is a complex
42 9 Kerstin Paech
issue and we still need to build expertise and gain experience what are the user needs in our groups. Please feel free to speak out if
43 9 Kerstin Paech
there is something that can be improved without creating an unfair disadvantage for other users.
44 9 Kerstin Paech
45 9 Kerstin Paech
You can run interactive jobs on both partitions.
46 9 Kerstin Paech
47 1 Kerstin Paech
h3. Running an interactive job with slurm
48 1 Kerstin Paech
49 9 Kerstin Paech
To run an interactive job with slurm in the default partition, use
50 1 Kerstin Paech
51 1 Kerstin Paech
<pre>
52 14 Kerstin Paech
srun -u --pty bash
53 1 Kerstin Paech
</pre>
54 9 Kerstin Paech
55 15 Shantanu Desai
If you want to use tcsh use
56 15 Shantanu Desai
57 15 Shantanu Desai
<pre>
58 15 Shantanu Desai
srun -u --pty tcsh
59 15 Shantanu Desai
</pre>
60 15 Shantanu Desai
61 20 Kerstin Paech
In case you want to open x11 applications, use the --x11=first option, e.g.
62 20 Kerstin Paech
<pre>
63 20 Kerstin Paech
srun --x11=first -u   --pty  bash
64 20 Kerstin Paech
</pre>
65 20 Kerstin Paech
66 9 Kerstin Paech
In case the 'normal' partition is overcrowded, to use the 'debug' partition, use:
67 9 Kerstin Paech
<pre>
68 14 Kerstin Paech
srun --account cosmo_debug -p debug -u --pty bash # if you are part of the Cosmology group
69 14 Kerstin Paech
srun --account euclid_debug -p debug -u --pty bash  # if you are part of the EuclidDM group
70 12 Kerstin Paech
</pre> As soon as a slot is open, slurm will log you in to an interactive session on one of the nodes.
71 1 Kerstin Paech
72 10 Kerstin Paech
h3. Running a simple once core batch job with slurm using the default partition
73 1 Kerstin Paech
74 1 Kerstin Paech
* To see what queues are available to you (called partitions in slurm), run:
75 1 Kerstin Paech
<pre>
76 1 Kerstin Paech
sinfo
77 1 Kerstin Paech
</pre>
78 1 Kerstin Paech
79 1 Kerstin Paech
* To run slurm, create a myjob.slurm containing the following information:
80 1 Kerstin Paech
<pre>
81 1 Kerstin Paech
#!/bin/bash
82 1 Kerstin Paech
#SBATCH --output=slurm.out
83 1 Kerstin Paech
#SBATCH --error=slurm.err
84 1 Kerstin Paech
#SBATCH --mail-user <put your email address here>
85 1 Kerstin Paech
#SBATCH --mail-type=BEGIN
86 8 Kerstin Paech
#SBATCH -p normal
87 1 Kerstin Paech
88 1 Kerstin Paech
/bin/hostname
89 1 Kerstin Paech
</pre>
90 1 Kerstin Paech
91 1 Kerstin Paech
* To submit a batch job use:
92 1 Kerstin Paech
<pre>
93 1 Kerstin Paech
sbatch myjob.slurm
94 1 Kerstin Paech
</pre>
95 1 Kerstin Paech
96 1 Kerstin Paech
* To see the status of you job, use 
97 1 Kerstin Paech
<pre>
98 1 Kerstin Paech
squeue
99 1 Kerstin Paech
</pre>
100 1 Kerstin Paech
101 11 Kerstin Paech
* To kill a job use:
102 11 Kerstin Paech
<pre>
103 11 Kerstin Paech
scancel <jobid>
104 11 Kerstin Paech
</pre> the <jobid> you can get from using squeue.
105 11 Kerstin Paech
106 1 Kerstin Paech
* For some more information on your job use
107 1 Kerstin Paech
<pre>
108 1 Kerstin Paech
scontrol show job <jobid>
109 11 Kerstin Paech
</pre>the <jobid> you can get from using squeue.
110 1 Kerstin Paech
111 10 Kerstin Paech
h3. Running a simple once core batch job with slurm using the debug partition
112 10 Kerstin Paech
113 10 Kerstin Paech
Change the partition to debug and add the appropriate account depending if you're part of
114 10 Kerstin Paech
the euclid or cosmology group.
115 10 Kerstin Paech
116 10 Kerstin Paech
<pre>
117 10 Kerstin Paech
#!/bin/bash
118 10 Kerstin Paech
#SBATCH --output=slurm.out
119 10 Kerstin Paech
#SBATCH --error=slurm.err
120 10 Kerstin Paech
#SBATCH --mail-user <put your email address here>
121 10 Kerstin Paech
#SBATCH --mail-type=BEGIN
122 10 Kerstin Paech
#SBATCH -p debug
123 10 Kerstin Paech
#SBATCH -account [cosmo_debug/euclid_debug]
124 10 Kerstin Paech
125 10 Kerstin Paech
/bin/hostname
126 10 Kerstin Paech
</pre>
127 10 Kerstin Paech
128 10 Kerstin Paech
129 6 Kerstin Paech
h3. Batch script for running a multi-core job
130 6 Kerstin Paech
131 17 Kerstin Paech
mpi is installed on alexandria.
132 17 Kerstin Paech
133 18 Kerstin Paech
To run a 4 core job for an executable compiled with mpi you can use
134 6 Kerstin Paech
<pre>
135 6 Kerstin Paech
#!/bin/bash
136 6 Kerstin Paech
#SBATCH --output=slurm.out
137 6 Kerstin Paech
#SBATCH --error=slurm.err
138 6 Kerstin Paech
#SBATCH --mail-user <put your email address here>
139 6 Kerstin Paech
#SBATCH --mail-type=BEGIN
140 6 Kerstin Paech
#SBATCH -n 4
141 1 Kerstin Paech
142 18 Kerstin Paech
mpirun <programname>
143 1 Kerstin Paech
144 1 Kerstin Paech
</pre>
145 18 Kerstin Paech
and it will automatically start on the number of nodes specified.
146 1 Kerstin Paech
147 18 Kerstin Paech
To ensure that the job is being executed on only one node, add
148 18 Kerstin Paech
<pre>
149 18 Kerstin Paech
#SBATCH -n 4
150 18 Kerstin Paech
</pre>
151 18 Kerstin Paech
to the job script.
152 17 Kerstin Paech
153 19 Kerstin Paech
If you would like to run a program that itself starts processes, you can use the
154 19 Kerstin Paech
environment variable $SLURM_NPROCS that is automatically defined for slurm
155 19 Kerstin Paech
jobs to explicitly pass the number of cores the program can run on.
156 19 Kerstin Paech
157 17 Kerstin Paech
To check if your job is acutally running on the specified number of cores, you can check
158 17 Kerstin Paech
the PSR column of
159 17 Kerstin Paech
<pre>
160 17 Kerstin Paech
ps -eaFAl
161 17 Kerstin Paech
# or ps -eaFAl | egrep "<yourusername>|UID" if you just want to see your jobs
162 6 Kerstin Paech
</pre>
Redmine Appliance - Powered by TurnKey Linux