Project

General

Profile

Slurm » History » Version 91

Martin Kuemmel, 10/19/2018 07:38 AM

1 21 Kerstin Paech
{{toc}}
2 21 Kerstin Paech
3 53 Sebastian Bocquet
h1. Hardware overview
4 53 Sebastian Bocquet
5 90 Martin Kuemmel
You access the Euclid cluster through cosmogw.kosmo.physik.uni-muenchen.de
6 67 Martin Kuemmel
7 90 Martin Kuemmel
* cosmogw is a gateway machines and should *not* be used for computing
8 90 Martin Kuemmel
* there are 21 compute nodes named euclides01--euclides11 and euclides12--euclides21;
9 90 Martin Kuemmel
* all nodes euclides01-21 are available via cosmogw; 
10 77 Martin Kuemmel
* euclides01-euclides11 have each 32 logical CPUs and 64GB of RAM;
11 77 Martin Kuemmel
* euclides12-euclides21 have each 56 logical CPUs and 128GB of RAM;
12 53 Sebastian Bocquet
13 1 Kerstin Paech
h1. How to run jobs on the euclides nodes (using Slurm)
14 1 Kerstin Paech
15 1 Kerstin Paech
Use slurm to submit jobs or login to the euclides nodes (euclides01-21).
16 1 Kerstin Paech
17 9 Kerstin Paech
*Please read through this entire wikipage so everyone can make efficient use of this cluster*
18 9 Kerstin Paech
19 90 Martin Kuemmel
h2. Control node cosmogw
20 1 Kerstin Paech
21 90 Martin Kuemmel
The machine cosmogw is the login node and submit nodes for the slurm queue, so please do not use them as a simple compute nodes - it's hardware is different from the nodes. It hosts our file server and other services that are important to us.
22 1 Kerstin Paech
23 90 Martin Kuemmel
You should use cosmogw to:
24 1 Kerstin Paech
* transfer files
25 68 Martin Kuemmel
* compile your code
26 51 Sebastian Bocquet
* submit jobs to the nodes via the slurm queues
27 51 Sebastian Bocquet
28 51 Sebastian Bocquet
If you need to debug and would like to login to a node, please start an interactive job to one of the nodes using slurm. For instructions see below.
29 51 Sebastian Bocquet
30 51 Sebastian Bocquet
h2. euclides nodes
31 1 Kerstin Paech
32 1 Kerstin Paech
Job submission to the euclides nodes is handled by the slurm jobmanager (see http://slurm.schedmd.com and https://computing.llnl.gov/linux/slurm/). 
33 1 Kerstin Paech
*Important: In order to run jobs, you need to be added to the slurm accounting system - please contact the admin*
34 4 Kerstin Paech
35 4 Kerstin Paech
All slurm commands listed below have very helpful man pages (e.g. man slurm, man squeue, ...). 
36 75 Martin Kuemmel
37 69 Martin Kuemmel
If you are already familiar with another jobmanager the following information may be helpful to you http://slurm.schedmd.com/rosetta.pdf‎.
38 77 Martin Kuemmel
39 1 Kerstin Paech
h3. Scheduling of Jobs
40 69 Martin Kuemmel
41 90 Martin Kuemmel
At this point there are two queues, called partitions in slurm:
42 70 Martin Kuemmel
* on cosmofgw:
43 70 Martin Kuemmel
** *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of
44 90 Martin Kuemmel
four days; this queue comprises the computing nodes euclides01-21;
45 90 Martin Kuemmel
** the *lowpri* partition also comprises the computing nodes euclides01-21; it is a so called preempty queue, allowing more resources for the users; however jobs are re-queued (canceled and re-scheduled) if the resources are demanded on the normal queue;
46 1 Kerstin Paech
47 38 Kerstin Paech
The default memory per core used is 2GB, if you need more or less, please specify with the --mem or --mem-per-cpu option.
48 38 Kerstin Paech
49 9 Kerstin Paech
We have also set up a scheduler that goes beyond the first come first serve - some jobs will be favoured over others depending
50 9 Kerstin Paech
on how much you or your group have been using euclides in the past 2 weeks, how long the job has been queued and how much
51 9 Kerstin Paech
resources it will consume.
52 9 Kerstin Paech
53 9 Kerstin Paech
This is serves as a starting point, we may have to adjust parameters once the slurm jobmanager is used. Job scheduling is a complex
54 9 Kerstin Paech
issue and we still need to build expertise and gain experience what are the user needs in our groups. Please feel free to speak out if
55 9 Kerstin Paech
there is something that can be improved without creating an unfair disadvantage for other users.
56 9 Kerstin Paech
57 9 Kerstin Paech
You can run interactive jobs on both partitions.
58 9 Kerstin Paech
59 41 Kerstin Paech
h3. Running an interactive job with slurm (a.k.a. logging in)
60 1 Kerstin Paech
61 9 Kerstin Paech
To run an interactive job with slurm in the default partition, use
62 1 Kerstin Paech
63 1 Kerstin Paech
<pre>
64 14 Kerstin Paech
srun -u --pty bash
65 1 Kerstin Paech
</pre>
66 9 Kerstin Paech
67 1 Kerstin Paech
If you want to use tcsh use
68 15 Shantanu Desai
69 15 Shantanu Desai
<pre>
70 15 Shantanu Desai
srun -u --pty tcsh
71 15 Shantanu Desai
</pre>
72 30 Shantanu Desai
73 20 Kerstin Paech
If you want to use a larger memory per job do
74 1 Kerstin Paech
75 20 Kerstin Paech
<pre>
76 20 Kerstin Paech
srun -u --mem-per-cpu=8000 --pty tcsh
77 20 Kerstin Paech
</pre>
78 71 Martin Kuemmel
79 9 Kerstin Paech
In case you want to open x11 applications, use the --x11=first option, e.g.
80 14 Kerstin Paech
<pre>
81 90 Martin Kuemmel
srun --x11=first -u --pty  bash
82 12 Kerstin Paech
</pre>
83 1 Kerstin Paech
84 44 Kerstin Paech
h3. limited ssh access
85 44 Kerstin Paech
86 44 Kerstin Paech
If you have an active job (batch or interactive), you can login to the node the job is running on. Your ssh session will be killed if the job terminates. Your ssh session will be restricted to the same resources as your job (so you cannot accidentally bypass the job scheduler and harm other user's jobs).
87 44 Kerstin Paech
88 77 Martin Kuemmel
h3. Running a simple one core batch job with slurm using the default partition
89 1 Kerstin Paech
90 1 Kerstin Paech
* To see what queues are available to you (called partitions in slurm), run:
91 1 Kerstin Paech
<pre>
92 1 Kerstin Paech
sinfo
93 1 Kerstin Paech
</pre>
94 1 Kerstin Paech
95 1 Kerstin Paech
* To run slurm, create a myjob.slurm containing the following information:
96 1 Kerstin Paech
<pre>
97 1 Kerstin Paech
#!/bin/bash
98 1 Kerstin Paech
#SBATCH --output=slurm.out
99 1 Kerstin Paech
#SBATCH --error=slurm.err
100 1 Kerstin Paech
#SBATCH --mail-user <put your email address here>
101 1 Kerstin Paech
#SBATCH --mail-type=BEGIN
102 8 Kerstin Paech
#SBATCH -p normal
103 91 Martin Kuemmel
#SBATCH --ntasks=1
104 1 Kerstin Paech
105 1 Kerstin Paech
/bin/hostname
106 1 Kerstin Paech
</pre>
107 1 Kerstin Paech
108 1 Kerstin Paech
* To submit a batch job use:
109 1 Kerstin Paech
<pre>
110 1 Kerstin Paech
sbatch myjob.slurm
111 1 Kerstin Paech
</pre>
112 1 Kerstin Paech
113 1 Kerstin Paech
* To see the status of you job, use 
114 1 Kerstin Paech
<pre>
115 1 Kerstin Paech
squeue
116 1 Kerstin Paech
</pre>
117 1 Kerstin Paech
118 11 Kerstin Paech
* To kill a job use:
119 11 Kerstin Paech
<pre>
120 11 Kerstin Paech
scancel <jobid>
121 11 Kerstin Paech
</pre> the <jobid> you can get from using squeue.
122 1 Kerstin Paech
123 1 Kerstin Paech
* For some more information on your job use
124 11 Kerstin Paech
<pre>
125 1 Kerstin Paech
scontrol show job <jobid>
126 11 Kerstin Paech
</pre>the <jobid> you can get from using squeue.
127 1 Kerstin Paech
128 77 Martin Kuemmel
h3. Running a simple once core batch job with slurm using the lowpri partition
129 10 Kerstin Paech
130 77 Martin Kuemmel
Change the partition to lowpri and add the appropriate account depending if you're part of
131 10 Kerstin Paech
the euclid or cosmology group.
132 10 Kerstin Paech
133 10 Kerstin Paech
<pre>
134 10 Kerstin Paech
#!/bin/bash
135 10 Kerstin Paech
#SBATCH --output=slurm.out
136 10 Kerstin Paech
#SBATCH --error=slurm.err
137 10 Kerstin Paech
#SBATCH --mail-user <put your email address here>
138 10 Kerstin Paech
#SBATCH --mail-type=BEGIN
139 77 Martin Kuemmel
#SBATCH --account=[euclid_lowpri/cosmo_lowpri]
140 77 Martin Kuemmel
#SBATCH --partition=lowpri
141 91 Martin Kuemmel
#SBATCH --ntasks=1
142 10 Kerstin Paech
143 10 Kerstin Paech
/bin/hostname
144 10 Kerstin Paech
</pre>
145 10 Kerstin Paech
146 22 Kerstin Paech
h3. Accessing a node where a job is running or starting additional processes on a node
147 22 Kerstin Paech
148 25 Kerstin Paech
You can attach an srun command to an already existing job (batch or interactive). This
149 22 Kerstin Paech
means you can start an interactive session on a node where a job of yours is running
150 26 Kerstin Paech
or start an additional process.
151 22 Kerstin Paech
152 22 Kerstin Paech
First determine the jobid of the desired job using squeue, then use 
153 22 Kerstin Paech
154 22 Kerstin Paech
<pre>
155 22 Kerstin Paech
srun  --jobid <jobid> [options] <executable> 
156 22 Kerstin Paech
</pre>
157 22 Kerstin Paech
Or more concrete
158 22 Kerstin Paech
<pre>
159 22 Kerstin Paech
srun  --jobid <jobid> -u --pty  bash # to start an interactive session
160 22 Kerstin Paech
srun  --jobid <jobid> ps -eaFAl  # to start get detailed process information 
161 22 Kerstin Paech
</pre>
162 22 Kerstin Paech
163 24 Kerstin Paech
The processes will only run on cores that have been allocated to you. This works 
164 24 Kerstin Paech
for batch as well as interactive jobs. 
165 23 Kerstin Paech
*Important: If the original job that was submitted is finished, any process 
166 23 Kerstin Paech
attached in this fashion will be killed.*
167 22 Kerstin Paech
168 10 Kerstin Paech
169 6 Kerstin Paech
h3. Batch script for running a multi-core job
170 6 Kerstin Paech
171 61 Martin Kuemmel
mpi is installed on cosmofs1.
172 17 Kerstin Paech
173 18 Kerstin Paech
To run a 4 core job for an executable compiled with mpi you can use
174 6 Kerstin Paech
<pre>
175 6 Kerstin Paech
#!/bin/bash
176 6 Kerstin Paech
#SBATCH --output=slurm.out
177 6 Kerstin Paech
#SBATCH --error=slurm.err
178 6 Kerstin Paech
#SBATCH --mail-user <put your email address here>
179 1 Kerstin Paech
#SBATCH --mail-type=BEGIN
180 91 Martin Kuemmel
#SBATCH --ntasks=4
181 1 Kerstin Paech
182 18 Kerstin Paech
mpirun <programname>
183 1 Kerstin Paech
184 1 Kerstin Paech
</pre>
185 18 Kerstin Paech
and it will automatically start on the number of nodes specified.
186 1 Kerstin Paech
187 18 Kerstin Paech
To ensure that the job is being executed on only one node, add
188 18 Kerstin Paech
<pre>
189 18 Kerstin Paech
#SBATCH -n 4
190 18 Kerstin Paech
</pre>
191 18 Kerstin Paech
to the job script.
192 17 Kerstin Paech
193 19 Kerstin Paech
If you would like to run a program that itself starts processes, you can use the
194 19 Kerstin Paech
environment variable $SLURM_NPROCS that is automatically defined for slurm
195 19 Kerstin Paech
jobs to explicitly pass the number of cores the program can run on.
196 19 Kerstin Paech
197 17 Kerstin Paech
To check if your job is acutally running on the specified number of cores, you can check
198 17 Kerstin Paech
the PSR column of
199 17 Kerstin Paech
<pre>
200 17 Kerstin Paech
ps -eaFAl
201 17 Kerstin Paech
# or ps -eaFAl | egrep "<yourusername>|UID" if you just want to see your jobs
202 6 Kerstin Paech
</pre>
203 27 Jiayi Liu
204 28 Kerstin Paech
h3. environment for jobs
205 27 Jiayi Liu
206 29 Kerstin Paech
By default, slurm does not initialize the environment (using .bashrc, .profile, .tcshrc, ...)
207 29 Kerstin Paech
208 28 Kerstin Paech
To use your usual system environment, add the following line in the submission script:
209 27 Jiayi Liu
<pre>
210 27 Jiayi Liu
#SBATCH --get-user-env
211 1 Kerstin Paech
</pre>
212 1 Kerstin Paech
213 87 Martin Kuemmel
h3. Slurm reporting and accounting
214 87 Martin Kuemmel
215 88 Martin Kuemmel
For information on job usage and cluster utilization for slurm jobs the slurm command "sreport" can be used. E.g. the command:
216 87 Martin Kuemmel
<pre>
217 87 Martin Kuemmel
sreport user topusage start=01/15/18 -t percent
218 87 Martin Kuemmel
</pre>
219 88 Martin Kuemmel
shows the top ten users in percent since January 15th 2018. For more information please look at "man sreport".
220 87 Martin Kuemmel
221 88 Martin Kuemmel
For accounting on specific jobs the slurm command "sacct" can be used. E.g. the command:
222 87 Martin Kuemmel
<pre>
223 87 Martin Kuemmel
sacct -j 18551 --format=JobID,JobName,MaxRSS,Elapsed
224 87 Martin Kuemmel
</pre>
225 88 Martin Kuemmel
displays information (elapsed time, memory usage, ...) on the job number "18551". For more details please  use "man sacct".
226 87 Martin Kuemmel
227 78 Martin Kuemmel
h3. Some points on the 'normal' versus 'lowpri' queue on cosmogw
228 78 Martin Kuemmel
229 78 Martin Kuemmel
The allowances for each user on the *normal* partition are 250CPU's and 554700MB, which corresponds to 1/3 of the entire cluster (euclides06-21). In short, every user is allowed to use up to 1/3 of the cluster in the normal partition.
230 78 Martin Kuemmel
231 78 Martin Kuemmel
On the partition *lowpri* (for low priority) there are no limits on the CPU numbers or RAM consumption, meaning the user can take all available resources up to the *entire* cluster! However, jobs on the partition "lowpri" have a lower priority through the so called preemption mechanism. This means if all nodes are busy (partially through the lowpri queue) and an additional job is submitted to the "normal" partition, slurm will re-queue (meaning cancel and re-schedule to the lowpri-queue) job(s) on the "lowpri" partition to get the job on the "normal" partition running.
232 78 Martin Kuemmel
233 78 Martin Kuemmel
Here is an example scenario to illustrate the opportunities the "lowpri" partition offers:
234 78 Martin Kuemmel
I want to submit a number of jobs for in total 752cpu's. The entire cluster has 752 cpu's in total, this means in the optimal case I get 1/3 of the cluster on the "normal" partition, and it takes at least three cycles to get all my jobs finished. However, if I submit to the "lowpri" partition, in the case of an empty cluster I can use the *entire* cluster and finish in only one cycle. Of course it may happen that other users submit lots of jobs to the "normal" partition afterwards and many of my jobs are being re-queued. That would then delay the finishing of my jobs on the "lowpri" partition correspondingly. To highlight some aspects of using the "lowpri" partition:
235 78 Martin Kuemmel
236 78 Martin Kuemmel
* it is relevant especially when you want to submit several jobs that significantly exceed the user allowance on the "normal" partition and need the entire cluster to get finished;
237 78 Martin Kuemmel
* on average, the available ressources on the "lowpri" partition are much *larger* than on the "normal" partition, especially during the night or on the weekend;
238 78 Martin Kuemmel
* please not that *no job gets ever lost* at the "lowpri" partition; if re-queuing occurs, the user gets an email (Subject: "SLURM Job_id=2563 Name=test_mpi_gather.slurm Failed, Run time 00:01:58, PREEMPTED, ExitCode 0") when the job is stopped and subsequently when it starts again and when it finishes (see 1.);
239 78 Martin Kuemmel
* also on the "lowpri" partition there is a queue which decides which job comes first (of course only in the case of an oversubscription);
240 78 Martin Kuemmel
* the preemption mechanism tries to minimize the number of re-queued jobs necessary to get the job in the "normal" partition going; so, if 8 cpus are requested and the "lowpri" partion contains one job using 8 cpus, three jobs using 4 cpus and several dozens jobs using 1 cpu, only the job with 8 cpus is re-scheduled independent on the run times and other parameters.
241 78 Martin Kuemmel
242 79 Martin Kuemmel
To submit a job to the "lowpri" partition please insert the following lines into the slurm batch script (see also example above):
243 79 Martin Kuemmel
<pre>
244 79 Martin Kuemmel
#SBATCH --account=<your_acount>
245 79 Martin Kuemmel
#SBATCH -p lowpri
246 79 Martin Kuemmel
</pre>
247 79 Martin Kuemmel
248 79 Martin Kuemmel
with <your_acount> being either "cosmo_lowpri" or "euclid_lowpri".
249 79 Martin Kuemmel
250 80 Martin Kuemmel
There are two typical scenarios where a user can gain from the lowpri queue:
251 80 Martin Kuemmel
* if a job stores intermediate results at regular intervals and picks up from there once started again; then even a long job looses only the computing time since the last storage point if a job is re-scheduled;
252 80 Martin Kuemmel
* if a single job needs only a small amount of computing time (perhaps <12h) but a lot of jobs need to be run; then the loss of computing time is rather small if a job is re-scheduled;
253 80 Martin Kuemmel
254 58 Martin Kuemmel
h2. desdb node
255 58 Martin Kuemmel
256 58 Martin Kuemmel
Some specific jobs in cosmodb, such as the "catalog ingest", need to be performed on the machines desdb1/2. For those jobs there is the slurm account "euclid_cat_ing" with the partition "cat_ing". Only selected persons from the Euclid group have access to this node. Please specify "-p cat_ing" and "--account euclid_cat_ing" on the command line or in the slurm script.
257 28 Kerstin Paech
258 28 Kerstin Paech
h2. Software specific setup
259 28 Kerstin Paech
260 28 Kerstin Paech
h3. Python environment 
261 28 Kerstin Paech
262 28 Kerstin Paech
You can use the python 2.7.3 installed on the euclides cluster by using
263 27 Jiayi Liu
264 27 Jiayi Liu
<pre>
265 27 Jiayi Liu
source /data2/users/ccsoft/etc/setup_all
266 37 Kerstin Paech
source  /data2/users/ccsoft/etc/setup_python2.7.3
267 33 Shantanu Desai
</pre>
268 32 Shantanu Desai
269 32 Shantanu Desai
270 34 Shantanu Desai
h2. Notes For Euclid users
271 32 Shantanu Desai
272 35 Shantanu Desai
For those submitting jobs to euclides* nodes through Cosmo DM pipeline  here are some things which need to be specified for customized job submissions,
273 35 Shantanu Desai
since a different interface to slurm is used.
274 34 Shantanu Desai
275 34 Shantanu Desai
* To use larger memory per block , specify max_memory = 6000 (for 6G) and so on. inside block definition or in the submit file (in
276 34 Shantanu Desai
case you want to use it for all blocks)
277 34 Shantanu Desai
278 34 Shantanu Desai
* If you want to run on multiple cores/cores then use 
279 34 Shantanu Desai
nodes='<number of nodes>:ppn=<number of cores> inside the block definition of a particular block or in the submit file in case you want
280 1 Kerstin Paech
to use it for all blocks.
281 34 Shantanu Desai
282 35 Shantanu Desai
* If you want to use a larger wall time then specify wall_mod=<wall time in minutes> inside the module definition
283 39 Shantanu Desai
284 61 Martin Kuemmel
* note that queue=serial does not work on cosmofs1 (we usually use it for c2pap)
285 45 Roy Henderson
286 45 Roy Henderson
h1. Admin
287 45 Roy Henderson
288 49 Martin Kuemmel
There is a user "slurm" which however is not really necessary for the administration work. The slurm administrator needs sudo access. Some script for adding a user and similar things are in "/data1/users/slurm". With the sudo access the admin can execute those scripts. In the mysql database there is the username "slurmdb" with password.
289 48 Martin Kuemmel
290 63 Martin Kuemmel
291 63 Martin Kuemmel
h2. Slurm configuration
292 63 Martin Kuemmel
293 63 Martin Kuemmel
h3. Slurm configuration file
294 63 Martin Kuemmel
295 72 Martin Kuemmel
The currently valid version of the configuration file are "/data1/users/slurm/slurm.conf" and "/data1/users/slurm/cosmo/slurm.conf" on cosmofs1 and cosmogw, respectively. To apply a modified slurm configuration, the script "newconfig.sh" can be used. 
296 63 Martin Kuemmel
297 63 Martin Kuemmel
The script 
298 63 Martin Kuemmel
299 63 Martin Kuemmel
* copies the configuration file to the submit node and restarts the submit service;
300 63 Martin Kuemmel
* copies the configuration file to all computing nodes and triggers the reconfiguration there;
301 63 Martin Kuemmel
302 1 Kerstin Paech
Then the slurm daemon needs to be started on the submit and all computing nodes with the script "restart.sh". 
303 72 Martin Kuemmel
304 72 Martin Kuemmel
*Note:* Right now the slurmd deamons do not properly start on cosmogw. Even if the start fails, the slurmd daemon is there and working.
305 72 Martin Kuemmel
306 63 Martin Kuemmel
307 62 Martin Kuemmel
h2. User management
308 1 Kerstin Paech
309 62 Martin Kuemmel
h3. Overview over users, accounts, etc.
310 62 Martin Kuemmel
311 50 Sebastian Bocquet
No sudo access needed:
312 50 Sebastian Bocquet
<pre>
313 50 Sebastian Bocquet
/usr/local/bin/sacctmgr show account withassoc
314 1 Kerstin Paech
</pre>
315 1 Kerstin Paech
316 62 Martin Kuemmel
h3. Adding a new user
317 45 Roy Henderson
318 62 Martin Kuemmel
As root on @cosmofs1@,
319 45 Roy Henderson
320 45 Roy Henderson
<pre>
321 55 Sebastian Bocquet
cd /data1/users/slurm/
322 1 Kerstin Paech
./add_user.sh UserName account(cosmo or euclid)
323 45 Roy Henderson
/usr/local/bin/.scontrol reconfigure
324 45 Roy Henderson
</pre>
325 62 Martin Kuemmel
326 45 Roy Henderson
h3. To increase memory, cores etc for a user
327 45 Roy Henderson
328 45 Roy Henderson
Inside script above, various commands for changing user settings, e.g.
329 1 Kerstin Paech
330 1 Kerstin Paech
<pre>
331 1 Kerstin Paech
/usr/local/bin/sacctmgr -i modify user  name=$1 set GrpCPUs=32
332 45 Roy Henderson
/usr/local/bin/sacctmgr -i modify user  name=$1 set GrpMem=128000
333 45 Roy Henderson
</pre>
334 62 Martin Kuemmel
335 62 Martin Kuemmel
h2. Trouble shooting
336 1 Kerstin Paech
337 63 Martin Kuemmel
h3. Information on a particular node
338 1 Kerstin Paech
339 63 Martin Kuemmel
The command "/usr/local/bin/scontrol show node <nodename>" gives detailed information on a particular node (status, reason for being down and so on)
340 63 Martin Kuemmel
341 63 Martin Kuemmel
h3. Node in state "drain"
342 63 Martin Kuemmel
343 50 Sebastian Bocquet
When a node is in "drain" state when calling <pre>sinfo</pre>
344 50 Sebastian Bocquet
run
345 50 Sebastian Bocquet
<pre>
346 50 Sebastian Bocquet
/usr/local/bin/scontrol update nodename=NODE_NAME state=resume
347 50 Sebastian Bocquet
</pre>
348 50 Sebastian Bocquet
to put it back to operation.
349 48 Martin Kuemmel
350 48 Martin Kuemmel
h2. Nodes down
351 48 Martin Kuemmel
352 1 Kerstin Paech
Sometimes nodes are reported as "down". This seems to happen as a result of network problems. Here is some "troubleshooting":https://computing.llnl.gov/linux/slurm/troubleshoot.html#nodes for this situation. Also after a re-boot of cosmofs1 some manual work on slurm might be necessary to get going again.
353 63 Martin Kuemmel
354 76 Martin Kuemmel
If a job does not finish and remains int eh state "CG" then the sequence:
355 76 Martin Kuemmel
<pre>
356 76 Martin Kuemmel
/usr/local/bin/scontrol update NodeName=euclides13-os State=down Reason=hung_proc
357 76 Martin Kuemmel
/usr/local/bin/scontrol update NodeName=euclides13-os State=resume Reason=hung_proc
358 76 Martin Kuemmel
</pre>
359 76 Martin Kuemmel
brings the node back again.
360 76 Martin Kuemmel
361 1 Kerstin Paech
h2. History
362 89 Martin Kuemmel
363 85 Martin Kuemmel
* January 23rd 2018: Jobs on euclides12 are no longer finishing. They end up in the state "CG" and hang there forever. In the slurmd log there is the entry "[2018-01-23T10:12:17.477] [18153] error: Unable to establish controller machine" basically every 15mins or so. ssh from euclides12 to cosmogw via name and IP address was possible, so it is difficult to interpret this error message. At the end the problem was solved by:
364 81 Martin Kuemmel
** stopping slurmd
365 81 Martin Kuemmel
** removing /var/run/slurmd.pid
366 81 Martin Kuemmel
** creating /var/run/slurmd.pid via touch
367 81 Martin Kuemmel
** re-starting slurmd again
368 86 Martin Kuemmel
** euclides12 had before this sometimes created problems, maybe this was the culmination now.
369 81 Martin Kuemmel
370 73 Martin Kuemmel
* May 18th 2017: On cosmogw, three nodes were reported as "DOWN" despite running the slurmd daemon and having connections to the slurmctl daemon on the control node; turns out that with a normal "/etc/init.d/slurm start" on the control machine only nodes are considered that are *not* DOWN; "/etc/init.d/slurm startclean" must be used to establish new connections to all nodes to take them back into the queue;
371 73 Martin Kuemmel
372 66 Martin Kuemmel
* May 2nd 2017: the control daemon on cosmofs1 was no longer working; also it could not e re-started; the corresponding commands "/etc/init.d/slurm status/start" were not giving back any kind of feedback, the log files were empty; the relevant daemon on the nodes "slurmd", was running smoothly; a comparison revealed that the difference was whether the command  "/usr/local/bin/scontrol show daemon" does return the daemon name or nothing, and in the later case nothing happens and the daemon does not run well; further investigation showed that the machine name given in "slurm.conf" as "ControlMachine=" needs to be identical to the name returned of the command "hostname"; this was no longer the case, likely induced due to moving the machines to the new sub-net (the exact mechanism is unclear);
373 66 Martin Kuemmel
374 65 Martin Kuemmel
* April 24th 2017: taking euclides11 out of the queues to free it for the new OS and the slurm test on it; euclides10 is now the development node;
375 63 Martin Kuemmel
376 63 Martin Kuemmel
* April 07th 2017: Applying "/usr/local/bin/scontrol show node euclides11" for the debug partition euclides11 says "Reason=Node unexpectedly rebooted [root@2016-12-14T13:25:01]"; internet research suggested to change "ReturnToService=" from 1 to 2 in the configuration file; after applying and restarting the new configuration file the debug nodes works again.;
377 63 Martin Kuemmel
378 63 Martin Kuemmel
* April 06th 2017: After the reconfiguration of the cluster the slurm confguration file was adjusted (to reflect the new machine names); also minor changes had to be applied to the scripts "newconfig.sh" and "restart.sh" to loop over the new names; the new configuration files were applied and slurm restarted; all computing nodes for the normal partition came up, the debug partition stayed down;
379 63 Martin Kuemmel
380 63 Martin Kuemmel
* March 29th 2017: euclides7 is in drain state;  "/usr/local/bin/scontrol show node euclides2" says "Reason=Epilog error"; when resumed, seems to work normal;
381 63 Martin Kuemmel
382 63 Martin Kuemmel
* March 28th 2017: euclides2 is in drain state; when resumed, it goes into drain state when using it the next time; "/usr/local/bin/scontrol show node euclides2" says "Reason=Prolog error"; after a reboot the machine was in status "idle*"; when resumed, it worked again;
Redmine Appliance - Powered by TurnKey Linux