Version 42 - History - Slurm - Cluster Cosmology - Redmine

Slurm » History » Version 42

Kerstin Paech, 07/09/2014 03:30 PM

-Kerstin Paech
+{{toc}}
 Kerstin Paech
-Kerstin Paech
+h1. How to run jobs on the euclides nodes
 Kerstin Paech
-Kerstin Paech
+Use slurm to submit jobs or login to the euclides nodes (euclides1-12).
 Kerstin Paech
-Kerstin Paech
+*Please read through this entire wikipage so everyone can make efficient use of this cluster*
 Kerstin Paech
-Kerstin Paech
+h2. alexandria
 Kerstin Paech
-Kerstin Paech
+*Please do not use alexandria as a compute node* - it's hardware is different from the nodes. It hosts our file server and other services that are important to us.
 Kerstin Paech
-Kerstin Paech
+You should use alexandria to
-Kerstin Paech
+- transfer files
-Kerstin Paech
+- compile your code
-Kerstin Paech
+- submit jobs to the nodes
 Kerstin Paech
-Kerstin Paech
+If you need to debug, please start an interactive job to one of the nodes using slurm. For instructions see below.
 Kerstin Paech
-Kerstin Paech
+h2. euclides nodes
 Kerstin Paech
-Kerstin Paech
+Job submission to the euclides nodes is handled by the slurm jobmanager (see http://slurm.schedmd.com and https://computing.llnl.gov/linux/slurm/).
-Kerstin Paech
+*Important: In order to run jobs, you need to be added to the slurm accounting system - please contact Kerstin*
 Kerstin Paech
-Kerstin Paech
+All slurm commands listed below have very helpful man pages (e.g. man slurm, man squeue, ...).
 Kerstin Paech
-Kerstin Paech
+If you are already familiar with another jobmanager the following information may be helpful to you http://slurm.schedmd.com/rosetta.pdf‎.
 Kerstin Paech
-Kerstin Paech
+h3. Scheduling of Jobs
 Kerstin Paech
-Kerstin Paech
+At this point there are two queues, called partitions in slurm:
-Kerstin Paech
+* *normal* which is the default partition your jobs will be sent to if you do not specify it otherwise. At this point there is a time limit of
-Kerstin Paech
+two days. Jobs at this point can only run on 1 node.
-Kerstin Paech
+* *debug* which is meant for debugging, you can only run one job at a time, other jobs submitted will remain in the queue. Time limit is
-Kerstin Paech
+hours.
 Kerstin Paech
-Kerstin Paech
+The default memory per core used is 2GB, if you need more or less, please specify with the --mem or --mem-per-cpu option.
 Kerstin Paech
-Kerstin Paech
+We have also set up a scheduler that goes beyond the first come first serve - some jobs will be favoured over others depending
-Kerstin Paech
+on how much you or your group have been using euclides in the past 2 weeks, how long the job has been queued and how much
-Kerstin Paech
+resources it will consume.
 Kerstin Paech
-Kerstin Paech
+This is serves as a starting point, we may have to adjust parameters once the slurm jobmanager is used. Job scheduling is a complex
-Kerstin Paech
+issue and we still need to build expertise and gain experience what are the user needs in our groups. Please feel free to speak out if
-Kerstin Paech
+there is something that can be improved without creating an unfair disadvantage for other users.
 Kerstin Paech
-Kerstin Paech
+You can run interactive jobs on both partitions.
 Kerstin Paech
-Kerstin Paech
+h3. Running an interactive job with slurm (a.k.a. logging in)
 Kerstin Paech
-Kerstin Paech
+To run an interactive job with slurm in the default partition, use
 Kerstin Paech
-Kerstin Paech
+<pre>
-Kerstin Paech
+srun -u --pty bash
-Kerstin Paech
+</pre>
 Kerstin Paech
-Shantanu Desai
+If you want to use tcsh use
 Shantanu Desai
-Shantanu Desai
+<pre>
-Shantanu Desai
+srun -u --pty tcsh
-Shantanu Desai
+</pre>
 Shantanu Desai
-Shantanu Desai
+If you want to use a larger memory per job do
 Shantanu Desai
-Shantanu Desai
+<pre>
-Shantanu Desai
+srun -u --mem-per-cpu=8000 --pty tcsh
-Shantanu Desai
+</pre>
 Shantanu Desai
-Kerstin Paech
+In case you want to open x11 applications, use the --x11=first option, e.g.
-Kerstin Paech
+<pre>
-Kerstin Paech
+srun --x11=first -u   --pty  bash
-Kerstin Paech
+</pre>
 Kerstin Paech
-Kerstin Paech
+In case the 'normal' partition is overcrowded, to use the 'debug' partition, use:
-Kerstin Paech
+<pre>
-Kerstin Paech
+srun --account cosmo_debug -p debug -u --pty bash # if you are part of the Cosmology group
-Kerstin Paech
+srun --account euclid_debug -p debug -u --pty bash  # if you are part of the EuclidDM group
-Kerstin Paech
+</pre> As soon as a slot is open, slurm will log you in to an interactive session on one of the nodes.
 Kerstin Paech
-Kerstin Paech
+h3. Running a simple once core batch job with slurm using the default partition
 Kerstin Paech
-Kerstin Paech
+* To see what queues are available to you (called partitions in slurm), run:
-Kerstin Paech
+<pre>
-Kerstin Paech
+sinfo
-Kerstin Paech
+</pre>
 Kerstin Paech
-Kerstin Paech
+* To run slurm, create a myjob.slurm containing the following information:
-Kerstin Paech
+<pre>
-Kerstin Paech
+#!/bin/bash
-Kerstin Paech
+#SBATCH --output=slurm.out
-Kerstin Paech
+#SBATCH --error=slurm.err
-Kerstin Paech
+#SBATCH --mail-user <put your email address here>
-Kerstin Paech
+#SBATCH --mail-type=BEGIN
-Kerstin Paech
+#SBATCH -p normal
 Kerstin Paech
-Kerstin Paech
+/bin/hostname
-Kerstin Paech
+</pre>
 Kerstin Paech
-Kerstin Paech
+* To submit a batch job use:
-Kerstin Paech
+<pre>
-Kerstin Paech
+sbatch myjob.slurm
-Kerstin Paech
+</pre>
 Kerstin Paech
-Kerstin Paech
+* To see the status of you job, use
-Kerstin Paech
+<pre>
-Kerstin Paech
+squeue
-Kerstin Paech
+</pre>
 Kerstin Paech
-Kerstin Paech
+* To kill a job use:
-Kerstin Paech
+<pre>
-Kerstin Paech
+scancel <jobid>
-Kerstin Paech
+</pre> the <jobid> you can get from using squeue.
 Kerstin Paech
-Kerstin Paech
+* For some more information on your job use
-Kerstin Paech
+<pre>
-Kerstin Paech
+scontrol show job <jobid>
-Kerstin Paech
+</pre>the <jobid> you can get from using squeue.
 Kerstin Paech
-Kerstin Paech
+h3. Running a simple once core batch job with slurm using the debug partition
 Kerstin Paech
-Kerstin Paech
+Change the partition to debug and add the appropriate account depending if you're part of
-Kerstin Paech
+the euclid or cosmology group.
 Kerstin Paech
-Kerstin Paech
+<pre>
-Kerstin Paech
+#!/bin/bash
-Kerstin Paech
+#SBATCH --output=slurm.out
-Kerstin Paech
+#SBATCH --error=slurm.err
-Kerstin Paech
+#SBATCH --mail-user <put your email address here>
-Kerstin Paech
+#SBATCH --mail-type=BEGIN
-Kerstin Paech
+#SBATCH -p debug
-Kerstin Paech
+#SBATCH -account [cosmo_debug/euclid_debug]
 Kerstin Paech
-Kerstin Paech
+/bin/hostname
-Kerstin Paech
+</pre>
 Kerstin Paech
-Kerstin Paech
+h3. Accessing a node where a job is running or starting additional processes on a node
 Kerstin Paech
-Kerstin Paech
+You can attach an srun command to an already existing job (batch or interactive). This
-Kerstin Paech
+means you can start an interactive session on a node where a job of yours is running
-Kerstin Paech
+or start an additional process.
 Kerstin Paech
-Kerstin Paech
+First determine the jobid of the desired job using squeue, then use
 Kerstin Paech
-Kerstin Paech
+<pre>
-Kerstin Paech
+srun  --jobid <jobid> [options] <executable>
-Kerstin Paech
+</pre>
-Kerstin Paech
+Or more concrete
-Kerstin Paech
+<pre>
-Kerstin Paech
+srun  --jobid <jobid> -u --pty  bash # to start an interactive session
-Kerstin Paech
+srun  --jobid <jobid> ps -eaFAl  # to start get detailed process information
-Kerstin Paech
+</pre>
 Kerstin Paech
-Kerstin Paech
+The processes will only run on cores that have been allocated to you. This works
-Kerstin Paech
+for batch as well as interactive jobs.
-Kerstin Paech
+*Important: If the original job that was submitted is finished, any process
-Kerstin Paech
+attached in this fashion will be killed.*
 Kerstin Paech
 Kerstin Paech
-Kerstin Paech
+h3. Batch script for running a multi-core job
 Kerstin Paech
-Kerstin Paech
+mpi is installed on alexandria.
 Kerstin Paech
-Kerstin Paech
+To run a 4 core job for an executable compiled with mpi you can use
-Kerstin Paech
+<pre>
-Kerstin Paech
+#!/bin/bash
-Kerstin Paech
+#SBATCH --output=slurm.out
-Kerstin Paech
+#SBATCH --error=slurm.err
-Kerstin Paech
+#SBATCH --mail-user <put your email address here>
-Kerstin Paech
+#SBATCH --mail-type=BEGIN
-Kerstin Paech
+#SBATCH -n 4
 Kerstin Paech
-Kerstin Paech
+mpirun <programname>
 Kerstin Paech
-Kerstin Paech
+</pre>
-Kerstin Paech
+and it will automatically start on the number of nodes specified.
 Kerstin Paech
-Kerstin Paech
+To ensure that the job is being executed on only one node, add
-Kerstin Paech
+<pre>
-Kerstin Paech
+#SBATCH -n 4
-Kerstin Paech
+</pre>
-Kerstin Paech
+to the job script.
 Kerstin Paech
-Kerstin Paech
+If you would like to run a program that itself starts processes, you can use the
-Kerstin Paech
+environment variable $SLURM_NPROCS that is automatically defined for slurm
-Kerstin Paech
+jobs to explicitly pass the number of cores the program can run on.
 Kerstin Paech
-Kerstin Paech
+To check if your job is acutally running on the specified number of cores, you can check
-Kerstin Paech
+the PSR column of
-Kerstin Paech
+<pre>
-Kerstin Paech
+ps -eaFAl
-Kerstin Paech
+# or ps -eaFAl | egrep "<yourusername>|UID" if you just want to see your jobs
-Kerstin Paech
+</pre>
 Jiayi Liu
-Kerstin Paech
+h3. environment for jobs
 Jiayi Liu
-Kerstin Paech
+By default, slurm does not initialize the environment (using .bashrc, .profile, .tcshrc, ...)
 Kerstin Paech
-Kerstin Paech
+To use your usual system environment, add the following line in the submission script:
-Jiayi Liu
+<pre>
-Jiayi Liu
+#SBATCH --get-user-env
-Kerstin Paech
+</pre>
 Kerstin Paech
 Kerstin Paech
-Kerstin Paech
+h2. Software specific setup
 Kerstin Paech
-Kerstin Paech
+h3. Python environment
 Kerstin Paech
-Kerstin Paech
+You can use the python 2.7.3 installed on the euclides cluster by using
 Jiayi Liu
-Jiayi Liu
+<pre>
-Jiayi Liu
+source /data2/users/ccsoft/etc/setup_all
-Kerstin Paech
+source  /data2/users/ccsoft/etc/setup_python2.7.3
-Shantanu Desai
+</pre>
 Shantanu Desai
 Shantanu Desai
-Shantanu Desai
+h2. Notes For Euclid users
 Shantanu Desai
-Shantanu Desai
+For those submitting jobs to euclides* nodes through Cosmo DM pipeline  here are some things which need to be specified for customized job submissions,
-Shantanu Desai
+since a different interface to slurm is used.
 Shantanu Desai
-Shantanu Desai
+* To use larger memory per block , specify max_memory = 6000 (for 6G) and so on. inside block definition or in the submit file (in
-Shantanu Desai
+case you want to use it for all blocks)
 Shantanu Desai
-Shantanu Desai
+* If you want to run on multiple cores/cores then use
-Shantanu Desai
+nodes='<number of nodes>:ppn=<number of cores> inside the block definition of a particular block or in the submit file in case you want
-Kerstin Paech
+to use it for all blocks.
 Shantanu Desai
-Shantanu Desai
+* If you want to use a larger wall time then specify wall_mod=<wall time in minutes> inside the module definition
 Shantanu Desai
-Shantanu Desai
+* note that queue=serial does not work on alexandria(we usually use it for c2pap)

Project

General

Profile

Cluster Cosmology

Slurm » History » Version 42