^ [[technique:accueil#mots_cles|Mots clés]] | {{tag> "cluster de calcul" hpc stargate}} |
@Misc{HPC_LERIA,
title = {High Performance Computing Cluster of LERIA},
year = {2018},
note = {slurm/debian cluster of 27 nodes(700 logical CPU, 2 nvidia GPU tesla k20m, 1 nvidia P100 GPU), 120TB of beegfs scratch storage}
}
star242 | Dell R730 | 1 | Tesla P100 | 1 | [[https://ark.intel.com/fr/products/92986/Intel-Xeon-Processor-E5-2620-v4-20M-Cache-2-10-GHz-|intel-E5-2620]] | 2 | 8 | 16 | 32 | 128 Go | 1 To | 2*10Gb/s |
| star[199-195] | Dell R415 | 5 | X | 0 | [[https://www.cpubenchmark.net/cpu.php?cpu=AMD+Opteron+6134&id=1566|amd-opteron-6134]] | 1 | 8 | 16 | 16 | 32 Go | 1 To | 2*1Gb/s |
| star[194-190] | Dell R415 | 5 | X | 0 | [[https://www.cpubenchmark.net/cpu.php?cpu=AMD+Opteron+4184&id=278|amd-opteron-4184]] | 1 | 6 | 12 | 12 | 32 Go | 1 To | 2*1Gb/s |
| star100 | Dell T640 | 1 | RTX 2080 Ti | 4 | [[https://ark.intel.com/content/www/fr/fr/ark/products/123540/intel-xeon-bronze-3106-processor-11m-cache-1-70-ghz.html|intel-xeon-bronze-3106]] | 1 | 8 | 16 | 16 | 96 Go | X | 2*10 Gb/s |
| star101 | Dell R740 | 1 | Tesla V100 32 Go | 3 | [[https://ark.intel.com/content/www/us/en/ark/products/193390/intel-xeon-silver-4208-processor-11m-cache-2-10-ghz.html|intel-xeon-server-4208]] | 2 | 8 | 16 | 32 | 96 Go | X | 2*10 Gb/s |
==== Software architecture ====
The software architecture for the submission of tasks is based on the //Slurm// tool.
Slurm is an open source, fault-tolerant, and highly scalable cluster planning and management system designed for Linux clusters.
In the sense of Slurm, the nodes (servers) of calculations are named //nodes//, and these nodes are grouped in family called //partition// (which have nothing to do with the notion of partition which segments a peripheral mass storage)
Our cluster has 5 named partitions:
* gpu
* intel-E5-2695
* ram
* amd
* std
Each of these partitions contains nodes.
The compute nodes work with a debian stable operating system. You can find the list of installed software in the [[leria:centre_de_calcul:cluster_english_version#lists_of_install_software_for_high_performance_calculating| List of installed software for high performance computing]] sections.
==== Usage policy ====
#include
int main() {
std::cout<<"Hello world!"<
It can be compiled using one of the nodes of the partition intel-E5-2695 via the command:
username_ENT@stargate:~$ srun --partition=intel-E5-2695 g++ -Wall main.cpp
Slurm assigns a default partition to each user. Therefore, if intel-E5-2695 is your default partition (marked with a star * in the return of ** sinfo **) then the previous command is equivalent to the following:
username_ENT@stargate:~$ srun g++ -Wall main.cpp
==== Interactive execution ====
Finally, we can run this freshly compiled program with
user@stargate:~$ srun -p intel-E5-2695 ./hello
Most of the time, an interactive execution will not interest you, you will prefer and you must use the submission of a job in batch mode. Interactive execution can be interesting for compilation or for debugging.
==== Batch mode execution ====
* It's in this mode that a computing cluster is used to execute its processes
* You must write a submission script, this one contains 2 sections:
* The resources you want to use
* Variables for slurm are preceded by #SBATCH
* The commands needed to run the program
=== Example ===
#!/bin/bash
# hello.slurm
#SBATCH --job-name=hello
#SBATCH --output=hello.out
#SBATCH --error=hello.err
#SBATCH --mail-type=end
#SBATCH --mail-user=user@univ-angers.fr
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --partition=intel-E5-2695
/path/to/hello && sleep 5
user@stargate:~$ sbatch hello.slurm # Job submission
user@stargate:~$ squeue # Place and status of jobs in the submission queue
user@stargate:~$ cat hello.out # Displays what the standard output would have shown in interactive mode (respectively hello.err for error output)
Very often, we want to run a single program for a set of files or a set of parameters, in which case there are 2 solutions to prioritize:
* use an array job (easy to use, **it is the preferred solution**).
* use the steps job (more complex to implement).
===== IMPORTANT: Availability and Resource Management Policy =====
Slurm is a scheduler. Planning is a difficult and resource-intensive optimization problem. It is much easier for a scheduler to plan jobs if he knows about:
* its duration
* the resources to use (CPU, Memory)
In fact, default resources have been defined:
* the duration of a job is 20 minutes
* the available memory per CPU is 200 MB
It is quite possible to override these default values with the --mem-per-cpu and --time options. However,
CAUTION:
* You should not overstate the resources of your jobs. In fact, slurm works with a fair share concept: if you book resources, whatever you use them or not. In future submissions, slurm will consider that you have actually consumed those resources. Potentially, you could be considered a greedy user and have lower priority than a user who has correctly defined his resources for the same amount of work done.
* If you have a large number of jobs to do, ** you must use submission by array job **.
* If these jobs have long execution times (more than 1 day), ** you must limit the number of parallel executions in order to not saturate the cluster **. We allow users to set this limit, but if there is a problem sharing resources with other users, ** we will delete jobs that don't respect these conditions **.
==== Limitations ====
| | MaxWallDurationPerJob | MaxJobs | MaxSubmitJobs | FairSharePriority |
| leria-user | 14 days | | 10000 | 99 |
| guest-user | 7 days | 20 | 50 | 1 |
=== Disk space quota ===
See also [[leria:centre_de_calcul:cluster_english_version#usage_policy|usage policy]] and [[leria:centre_de_calcul:cluster_english_version#data_storage|data storage]].
By default the quota of disk space is limited to 50GB. You can easily find out which files take up the most space with the command:
user@stargate~ # ncdu
===== Data storage =====
You can also see [[leria:centre_de_calcul:cluster_english_version#global_architecture|global architecture]].
* The compute cluster uses a pool of distributed storage servers [[https://www.beegfs.io/content/|beegfs]]. This beegfs storage is independent of the compute servers. This storage area is naturally accessible in the tree of any compute node under /home/$USER. Since this storage is remote, all read/write in your home is network dependent. Our Beegfs storage and the underlying network are very powerful, but for some heavy processing, you might be better off using local disks from the compute servers. To do this, you can use the /local_working_directory directory of the calculation servers. This directory works in the same way as /tmp except that the data is persistent when the server is restarted.
* If you want to create groups, please send an email to technique.info [at] listes.univ-angers.fr with the name of the group and the associated users.
* As a reminder, **by default**, the rights of your home are in 755, so **anyone can read and execute your data**.
===== Advanced use =====
==== Array jobs ====
You should start by reading the [[https://slurm.schedmd.com/job_array.html|official documentation]]. This [[http://scicomp.aalto.fi/triton/tut/array.html|page]] presents some interesting use case.
If you have a large number of files or parameters to process with a single executable, you must use a [[https://slurm.schedmd.com/job_array.html|array job]].
It's easy to implement, just add the --array option to our batch script:
=== Parametric tests ===
It's easy to use the array jobs to do parametric tests. That is, use the same executable, possibly on the same file, but by varying a parameter in options of the executable. For that, if the parameters are contiguous or regular, one will use a batch like this one:
#!/bin/bash
#SBATCH -J Job_regular_parameter
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH -t 10:00:00
#SBATCH --array=0-9
#SBATCH -p intel-E5-2670
#SBATCH -o %A-%a.out
#SBATCH -e %A-%a.err
#SBATCH --mail-type=end,fail
#SBATCH --mail-user=username@univ-angers.fr
/path/to/exec --paramExecOption $SLURM_ARRAY_TASK_ID
The --array option can take special syntaxes, for irregular values or for value jumps:
# irregular values 0,3,7,11,35,359
--array=0,3,7,11,35,359
# value jumps of +2: 1, 3, 5 et 7
--array=1-7:2
=== Multiple instances job ===
It is common to run a program many times over many instances (benchmark).
Let the following tree:
job_name
├── error
├── instances
│ ├── bench1.txt
│ ├── bench2.txt
│ └── bench3.txt
├── job_name_exec
├── output
└── submit_instances_dir.slurm
It is easy to use an array job to execute job_name_exec on all the files to be processed in the instances directory. Just run the following command:
mkdir error output 2>/dev/null || sbatch --job-name=$(basename $PWD) --array=0-$(($(ls -1 instances|wc -l)-1)) submit_instances_dir.slurm
with the following batch file submit_instances_dir.slurm:
#!/bin/bash
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=YOUR-EMAIL
#SBATCH -o output/%A-%a
#SBATCH -e error/%A-%a
#INSTANCES IS ARRAY OF INSTANCE FILE
INSTANCES=(instances/*)
./job_name_exec ${INSTANCES[$SLURM_ARRAY_TASK_ID]}
=== Multiple instances job with multiple executions (Seed number) ===
Sometimes it is necessary to launch several times the execution on an instance by modifying the seed which makes it possible to generate and reproduct random numbers.
Let the following tree:
job_name
├── error
├── instances
│ ├── bench1.txt
│ ├── bench2.txt
│ └── bench3.txt
├── job_name_exec
├── output
├── submit_instances_dir_with_seed.slurm
└── submit.sh
Just run the following command:
./submit.sh
with the following submit.sh file (remember to change the NB_SEED variable):
#!/bin/bash
readonly NB_SEED=50
for instance in $ (ls instances)
do
sbatch --output output/${instance}_%A-%a --error error/${instance}_%A-%a --array 0-${NB_SEED} submit_instances_dir_with_seed.slurm instances/${instance}
done
exit 0
and the following submit_instances_dir_with_seed.slurm batch:
#!/bin/bash
#SBATCH --mail-type = END, FAIL
#SBATCH --mail-user = YOUR-EMAIL
echo "####### INSTANCE: $ {1}"
echo "####### SEED NUMBER: $ {SLURM_ARRAY_TASK_ID}"
echo
srun echo nameApplication $ {1} $ {SLURM_ARRAY_TASK_ID}
With this method, the variable SLURM_ARRAY_TASK_ID contains the seed. And you submit as many array jobs as there are instances in the instance directory.
You can easily find your output which is named like this:
output/instance_name-ID_job-seed_number
=== Dependencies between job ===
You can determine dependencies between jobs through the --depend sbatch options:
== Example ==
# job can begin after the specified jobs have started
sbatch --depend=after:123_4 my.job
#job can begin after the specified jobs have run to completion with an exit code of zero
sbatch --depend=afterok:123_4:123_8 my.job2
# job can begin after the specified jobs have terminated
sbatch --depend=afterany:123 my.job
# job can begin after the specified array_jobs have completely and successfully finish
sbatch --depend=afterok:123 my.job
You can also see this [[https://hpc.nih.gov/docs/job_dependencies.html|page]].
==== Steps jobs ====
The use of the steps jobs should only be done in very rare cases. Most of the time, you should come out with array jobs, which also allows the scheduler (slurm) to be more efficient for job placement.
You can use the steps jobs for multiple and varied executions.
The steps jobs:
* allow to split a job into several tasks
* are created by prefixing the program command to be executed by the command Slurm "srun"
* can run sequentially or in parallel
Each step can use n tasks (task) on N compute nodes (-n and -N options of srun). A task has CPU-per-task CPU at its disposal, and there are ntasks allocated by step.
=== Example ===
#SBATCH --job-name=nameOfJob
#SBATCH --cpus-per-task=1 # Allocation of 1 CPUs per task
#SBATCH --ntasks=2 # Tasks number : 2
#SBATCH --mail-type=END # Email notification
#SBATCH --mail-user=username@univ-angers.fr # at the end of job.
# Step of 2 Tasks
srun before.sh
# 2 Step in parallel (because &): task1 and task2 run in parallel. There is only one task per Step (option -n1)
srun -n1 -N1 /path/to/task1 -threads $SLURM_CPUS_PER_TASK &
srun -n1 -N1 /path/to/task2 -threads $SLURM_CPUS_PER_TASK &
# We wait for the end of task1 and task2 before running the last step after.sh
wait
srun after.sh
=== Steps creation shell bash structure according to the source of the data ===
From [[http://osirim.irit.fr/site/recette/fr/articles/description-slurm|here]]
== Example ==
# Loop on the elements of an array (here files) :
files=('file1' 'file2' 'file3' ...)
for f in "${files[@]}"; do
# Adapt "-n1" and "-N1" according to your needs
srun -n1 -N1 [...] "$f" &
done
# Loop on the files of a directory:
while read f; do
# Adapt "-n1" and "-N1" according to your needs
srun -n1 -N1 [...] "$f" &
done < <(ls "/path/to/files/")
# Use "ls -R" or "find" for a recursive file path
# Reading line by line of a file:
while read line; do
# Adapt "-n1" and "-N1" according to your needs
srun -n1 -N1 [...] "$line" &
done <"/path/to/file"
==== Use OpenMp ? ====
Just add the --cpus-per-task option and export the variable OMP_NUM_THREADS
#!/bin/bash
# openmp_exec.slurm
#SBATCH --job-name=hello
#SBATCH --output=openmp_exec.out
#SBATCH --error=openmp_exec.err
#SBATCH --mail-type=end
#SBATCH --mail-user=user@univ-angers.fr
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --partition=intel-E5-2695
#SBATCH --cpus-per-task=20
export OMP_NUM_THREADS=20
/path/to/openmp_exec
===== Specific use =====
==== Ssh access of compute nodes ====
By default, it's impossible to connect ssh directly to the compute nodes. However, if it's justified, we can easily make temporary exceptions. In this case, please make an explicit request to technique [at] info.univ-angers.fr
Users with ssh access must be subscribed to the calcul-hpc-leria-no-slurm-mode@listes.univ-angers.fr list. To subscribe to this mailing list, simply send an email to sympa@listes.univ-angers.fr with the subject: subscribe calcul-hpc-leria-no-slurm-mode Last name First name
__Default rule:__ we don't launch a calculation on a server on which another user's calculation is already running, **even if this user doesn't use all the resources**. Exception for boinc processes. These processes pause when you perform your calculations.
The htop command lets you know who is calculating with which resources and for how long.
If in doubt, contact the user who calculates directly by email or via the calcul-hpc-leria-no-slurm-mode@listes.univ-angers.fr list.
==== Cuda ====
GPU cards are present on star nodes {242,253,254}:
* star242: P100
* star253: 2*k20m
* star254: 2*k20m
Currently, version 9.1 of cuda-sdk-toolkit is installed.
These nodes are currently out of slurm submission lists (although the gpu partition already exists). To be able to use it, thank you to make the explicit request to technique [at] info.univ-angers.fr
==== RAM node ====
The leria has a node consisting of 1.5 TB of ram, it's about star243.
This node is accessible by submission via slurm (ram partition). To be able to use it, thank you to make the explicit request to technique [at] info.univ-angers.fr
==== Cplex ====
Leria has an academic license for the Cplex software.
The path to the library cplex is the default path /opt/ibm/ILOG/CPLEX_Studio129 (version 12.9)
==== Conda environments (Python) ====
The **conda activate ** command, activating a conda environnement is unavailable with slurm. You should rather use at the beginning of your script :
source ./anaconda3/bin/activate
It may also be necessary to update the environment variables and to initialize conda on the node :
source .bashrc
conda init bash
The environment will stay active after your tasks are over. To deactivate the environment, you should use :
source ./anaconda3/bin/deactivate
===== FAQ =====
* How to know which are the resources of a partition, example with the partition std:
user@stargate~# scontrol show Partition std
* What means “Some of your processes may have been killed by the cgroup out-of-memory handler” in the standard output of your job ?
You have exceeded the memories limits (--mem-per-cpu parameters)
* How to get an interactive shell prompt in a compute node of your default partition?
user@stargate~# salloc
This is a default behavior. The command actually passed is:
srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --cpu-bind=no --mpi=none $SHELL
* How to get an interactive shell prompt in a specific compute node?
user@stargate~# srun -w NODE_NAME -n1 -N1 --pty bash -i
user@NODE_NAME~#
* How can I quote the resources of LERIA in my scientific writings?
You can use the following misc bibtex entry to cite the compute cluster in your posts:
@Misc{HPC_LERIA,
title = {High Performance Computing Cluster of LERIA},
year = {2018},
note = {slurm/debian cluster of 27 nodes(700 logical CPU, 2 nvidia GPU tesla k20m, 1 nvidia P100 GPU), 120TB of beegfs scratch storage}
}
==== Error during job submission ====
* When submitting my jobs, I get the following error message:
srun: error: Unable to allocate resources: Requested node configuration is not available
This probably means that you are trying to use a node without having to specify the partition in which it's located. You must use the -p or --partition option followed by the name of the partition in which the node is located. To have this information, you can do:
user@stargate# scontrol show node NODE_NAME|grep Partitions
===== Lists of install software for high performance calculating =====
==== Via apt-get ====
* automake
* bison
* boinc-client
* bowtie2
* build-essential
* cmake
* flex
* freeglut3
* freeglut3-dev
* g++
* g++-8
* g++-7
* g++-6
* git
* glibc-doc
* glibc-source
* gnuplot
* libglpk-dev
* libgmp-dev
* liblapack3
* liblapack-dev
* liblas3
* liblas-dev
* libtool
* libopenblas-base
* maven
* nasm
* openjdk-8-jdk-headless
* r-base
* r-base-dev
* regina-rexx
* samtools
* screen
* strace
* subversion
* tmux
* valgrind
* valgrind-dbg
* valgrind-mpi
==== Via pip ====
* keras
* scikit-learn
* tenserflow
* tenserflow-gpu # Sur nœuds gpu
==== GPU node via apt-get ====
* libglu1-mesa-dev
* libx11-dev
* libxi-dev
* libxmu-dev
* libgl1-mesa-dev
* linux-source
* linux-headers
* linux-image
* nvidia-cuda-toolkit
==== Software installation ====
Maybe a program is missing from the list above. In this case, 5 options are available to you:
* Make a request to technique [at] info.univ-angers.fr of the software you want to install
* Make yourself the installation via pip, pip2 or pip3
* Make yourself the installation via conda: [[https://www.anaconda.com/download/#linux|download]] and [[https://conda.io/docs/user-guide/install/linux.html|install]]
* Make yourself the installation by compiling the sources in your home_directory
===== Visualize the load of the high-performance computing cluster =====
For the links below, you will need to authenticate with your login and password ldap (idem ENT).
==== Cluster load overview ====
https://grafana.leria.univ-angers.fr/d/_0Bh3sxiz/vue-densemble-du-cluster
==== Details per node ====
https://grafana.leria.univ-angers.fr/d/000000007/noeuds-du-cluster
You can select the node you are interested in using the drop-down menu "HOST"
====== Acknowledgement ======
[[leria:centre_de_calcul:remerciements|Acknowledgement]]