leria:centre_de_calcul:cluster_english_version
Différences
Ci-dessous, les différences entre deux révisions de la page.
Les deux révisions précédentesRévision précédenteProchaine révision | Révision précédente | ||
leria:centre_de_calcul:cluster_english_version [06/01/2023 14:38] – removed - external edit (Unknown date) 127.0.0.1 | leria:centre_de_calcul:cluster_english_version [06/06/2023 15:01] (Version actuelle) – [Tableau] Chantrein Jean-Mathieu | ||
---|---|---|---|
Ligne 1: | Ligne 1: | ||
+ | ^ [[technique: | ||
+ | <note tip>Vous pouvez avoir une traduction française de cette page [[leria: | ||
+ | |||
+ | |||
+ | <note important> | ||
+ | |||
+ | You can use the following misc bibtex entry to cite the compute cluster in your posts: | ||
+ | <code latex> | ||
+ | @Misc{HPC_LERIA, | ||
+ | title = {High Performance Computing Cluster of LERIA}, | ||
+ | year = {2018}, | ||
+ | note = {slurm/ | ||
+ | } | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | < | ||
+ | |||
+ | ====== Presentation of the high performance computing cluster " | ||
+ | < | ||
+ | * This wiki page is also yours, do not hesitate to modify it directly or to propose modifications to technique [at] info.univ-angers.fr. | ||
+ | * All cluster users must be on the mailing list [[http:// | ||
+ | * To subscribe to this mailing list, simply send an email to sympa@listes.univ-angers.fr with the subject subscribe calcul-hpc-leria Name Surname | ||
+ | </ | ||
+ | |||
+ | ===== Summary ===== | ||
+ | |||
+ | Stargate is the high performance computing cluster of the LERIA computing center. It is a set of 27 computing servers counting 700 CPU cores and 3 GPUs. We also have high performance storage [[https:// | ||
+ | ===== Who can use stargate? ===== | ||
+ | |||
+ | In order of priority: | ||
+ | - All members and associate members of the LERIA laboratory, | ||
+ | - The research professors of the University of Angers if they have had the prior authorization of the director of LERIA, | ||
+ | - Visiting researchers if they have the prior authorization of the head chef of the director of LERIA. | ||
+ | |||
+ | * To obtain access to the cluster, simply request the activation of his account by sending an email to technique (at) info.univ-angers.fr | ||
+ | |||
+ | ===== Technical presentation ===== | ||
+ | |||
+ | ==== Global architecture ==== | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | You can also see[[leria: | ||
+ | ==== Hardware architecture ==== | ||
+ | |||
+ | | Hostname | ||
+ | | star[254-253] | ||
+ | | star[246-252] | ||
+ | | star[245-244] | ||
+ | | star243 | ||
+ | | < | ||
+ | | star[199-195] | ||
+ | | star[194-190] | ||
+ | | star100 | ||
+ | | star101 | ||
+ | ==== Software architecture ==== | ||
+ | |||
+ | The software architecture for the submission of tasks is based on the //Slurm// tool. | ||
+ | Slurm is an open source, fault-tolerant, | ||
+ | In the sense of Slurm, the nodes (servers) of calculations are named //nodes//, and these nodes are grouped in family called // | ||
+ | |||
+ | Our cluster has 5 named partitions: | ||
+ | * gpu | ||
+ | * intel-E5-2695 | ||
+ | * ram | ||
+ | * amd | ||
+ | * std | ||
+ | |||
+ | Each of these partitions contains nodes. | ||
+ | |||
+ | The compute nodes work with a debian stable operating system. You can find the list of installed software in the [[leria: | ||
+ | ==== Usage policy ==== | ||
+ | |||
+ | <note important> | ||
+ | A high-performance computing cluster must allow users to use a large amount of storage space during calculations. Therefore, the storage usage must be ** temporary **. Once your calculations are done, it is your responsibility to: | ||
+ | * ** compress ** your important data | ||
+ | * ** move ** your important compressed data to another storage space | ||
+ | * ** backup ** your important compressed data | ||
+ | * ** delete ** unnecessary and unused data | ||
+ | * You name files and directories should not contain: | ||
+ | * space | ||
+ | * accented characters (é, è, â, ...) | ||
+ | * symbols (*, $,%, ...) | ||
+ | * punctuation (!,:,; ,,, ...) | ||
+ | |||
+ | System administrators reserve the right to rename, compress, delete your files at any time. | ||
+ | |||
+ | ** There is no backup of your files on the compute cluster, __you can lose all your data at any time!__ ** | ||
+ | |||
+ | |||
+ | In addition, to avoid uses that could affect other users, a quota of 50 GB is applied to your home directory. Users requiring more space should make an explicit request to technical [at] info.univ-angers.fr. You can also request access to a large capacity storage for a limited time: all data that has been present for more than 40 days in this storage __are automatically deleted without the possibility of recovery__. | ||
+ | </ | ||
+ | |||
+ | |||
+ | ====== Using the high performance computing cluster ====== | ||
+ | |||
+ | ===== Quick start ===== | ||
+ | ==== Connection to stargate ==== | ||
+ | |||
+ | Please contact technique [at] info.univ-angers.fr to get information about connexion to the compute cluster. | ||
+ | |||
+ | < | ||
+ | |||
+ | <note important> | ||
+ | |||
+ | https:// | ||
+ | ==== Slurm: first tests and documentation ==== | ||
+ | |||
+ | < | ||
+ | ]]</ | ||
+ | |||
+ | Slurm (Simple Linux Utility for Resource Management) is a task scheduler. Slurm determines where and when the calculations are distributed on the different calculation nodes according to: | ||
+ | |||
+ | * the current load of these computing servers (CPU, Ram, ...) | ||
+ | * user history (notion of fairshare, a user using the cluster will take precedence over a user who uses the cluster a lot) | ||
+ | |||
+ | It is strongly advised to read this [[https:// | ||
+ | |||
+ | Once connected, you can type the command **sinfo** which will inform you about the available partitions and their associated nodes: | ||
+ | |||
+ | username_ENT@stargate: | ||
+ | PARTITION | ||
+ | gpu up 14-00: | ||
+ | intel-E5-2695 | ||
+ | amd-opteron-4184 | ||
+ | std* up 14-00: | ||
+ | ram up 14-00: | ||
+ | username_ent@stargate: | ||
+ | |||
+ | < | ||
+ | |||
+ | There are two main ways to submit jobs to Slurm: | ||
+ | * Interactive execution, (via the command **srun**). | ||
+ | * Execution in batch mode (via the command **sbatch**). | ||
+ | |||
+ | Batch mode execution is presented later in this wiki. | ||
+ | |||
+ | === Interactive execution === | ||
+ | In order to submit a job to slurm, simply prefix the name of the executable with the command **srun**. | ||
+ | |||
+ | In order to understand the difference between a process supported by the stargate OS and a process supported by slurm, you can for example type the following two commands: | ||
+ | |||
+ | username_ENT@stargate: | ||
+ | stargate | ||
+ | username_ENT@stargate: | ||
+ | star245 | ||
+ | username_ENT@stargate: | ||
+ | |||
+ | For the first command, the return of **hostname** gives stargate while the second command **srun hostname** returns star245. star245 is the name of the machine that has been dynamically designated by slurm to execute the **hostname** command. | ||
+ | |||
+ | You can also type the commands **srun free -h** or **srun cat / | ||
+ | |||
+ | Whenever slurm is asked to perform a task, it places it in a thread also called //queue//. | ||
+ | The command **squeue** allows you to know the list of tasks being processed. This is a bit like the GNU/linux **ps command aux** or **ps -efl** but for cluster jobs rather than processes. | ||
+ | We can test this command by launching for example on one side **srun sleep infinity &**. While running this task, the **squeue** command will give: | ||
+ | username_ENT@stargate: | ||
+ | JOBID PARTITION | ||
+ | 278 intel-E5- | ||
+ | |||
+ | It is possible to //kill// this task via the command **scancel** with the job identifier as argument: | ||
+ | |||
+ | username_ENT@stargate: | ||
+ | username_ENT@stargate: | ||
+ | srun: Job step aborted: Waiting up to 32 seconds for job step to finish. | ||
+ | slurmstepd: error: *** STEP 332.0 ON star245 CANCELLED AT 2018-11-27T11: | ||
+ | srun: error: star245: task 0: Terminated | ||
+ | | ||
+ | [1]+ Termine 143 srun sleep infinity | ||
+ | username_ENT@stargate: | ||
+ | |||
+ | === Documentation === | ||
+ | |||
+ | To go further, you can watch this series of video presentation and introduction to slurm (in 8 parts): | ||
+ | |||
+ | < | ||
+ | < | ||
+ | <iframe width=" | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | You will find [[https:// | ||
+ | |||
+ | ===== Hello world ! ===== | ||
+ | |||
+ | ==== Compilation ==== | ||
+ | The Stargate machine **is not** a compute node: it is a node from which you submit your calculations on compute nodes. Stargate is said to be a master node. Therefore, source codes will be compiled on the compute nodes prefixing the compilation with the **srun** command. | ||
+ | < | ||
+ | |||
+ | Let's look at the following file named **main.cpp**: | ||
+ | |||
+ | <code c++> | ||
+ | #include < | ||
+ | |||
+ | int main() { | ||
+ | std:: | ||
+ | return 0; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | It can be compiled using one of the nodes of the partition intel-E5-2695 via the command: | ||
+ | |||
+ | username_ENT@stargate: | ||
+ | |||
+ | < | ||
+ | username_ENT@stargate: | ||
+ | </ | ||
+ | ==== Interactive execution ==== | ||
+ | |||
+ | Finally, we can run this freshly compiled program with | ||
+ | |||
+ | user@stargate: | ||
+ | |||
+ | |||
+ | Most of the time, an interactive execution will not interest you, you will prefer and you must use the submission of a job in batch mode. Interactive execution can be interesting for compilation or for debugging. | ||
+ | ==== Batch mode execution ==== | ||
+ | |||
+ | * It's in this mode that a computing cluster is used to execute its processes | ||
+ | * You must write a submission script, this one contains 2 sections: | ||
+ | * The resources you want to use | ||
+ | * Variables for slurm are preceded by #SBATCH | ||
+ | * The commands needed to run the program | ||
+ | |||
+ | === Example === | ||
+ | <code bash> | ||
+ | #!/bin/bash | ||
+ | # hello.slurm | ||
+ | #SBATCH --job-name=hello | ||
+ | #SBATCH --output=hello.out | ||
+ | #SBATCH --error=hello.err | ||
+ | #SBATCH --mail-type=end | ||
+ | #SBATCH --mail-user=user@univ-angers.fr | ||
+ | #SBATCH --nodes=1 | ||
+ | #SBATCH --ntasks-per-node=1 | ||
+ | #SBATCH --partition=intel-E5-2695 | ||
+ | / | ||
+ | </ | ||
+ | |||
+ | user@stargate: | ||
+ | user@stargate: | ||
+ | user@stargate: | ||
+ | |||
+ | Very often, we want to run a single program for a set of files or a set of parameters, in which case there are 2 solutions to prioritize: | ||
+ | * use an array job (easy to use, **it is the preferred solution**). | ||
+ | * use the steps job (more complex to implement). | ||
+ | |||
+ | ===== IMPORTANT: Availability and Resource Management Policy ===== | ||
+ | |||
+ | Slurm is a scheduler. Planning is a difficult and resource-intensive optimization problem. It is much easier for a scheduler to plan jobs if he knows about: | ||
+ | |||
+ | * its duration | ||
+ | * the resources to use (CPU, Memory) | ||
+ | |||
+ | In fact, default resources have been defined: | ||
+ | |||
+ | * the duration of a job is 20 minutes | ||
+ | * the available memory per CPU is 200 MB | ||
+ | |||
+ | It is quite possible to override these default values with the --mem-per-cpu and --time options. However, | ||
+ | |||
+ | <note important> | ||
+ | * You should not overstate the resources of your jobs. In fact, slurm works with a fair share concept: if you book resources, whatever you use them or not. In future submissions, | ||
+ | * If you have a large number of jobs to do, ** you must use submission by array job **. | ||
+ | * If these jobs have long execution times (more than 1 day), ** you must limit the number of parallel executions in order to not saturate the cluster **. We allow users to set this limit, but if there is a problem sharing resources with other users, ** we will delete jobs that don't respect these conditions **. | ||
+ | </ | ||
+ | |||
+ | ==== Limitations ==== | ||
+ | |||
+ | | | MaxWallDurationPerJob | ||
+ | | leria-user | ||
+ | | guest-user | ||
+ | |||
+ | === Disk space quota === | ||
+ | |||
+ | See also [[leria: | ||
+ | |||
+ | By default the quota of disk space is limited to 50GB. You can easily find out which files take up the most space with the command: | ||
+ | |||
+ | | ||
+ | ===== Data storage ===== | ||
+ | |||
+ | You can also see [[leria: | ||
+ | |||
+ | * The compute cluster uses a pool of distributed storage servers [[https:// | ||
+ | |||
+ | * If you want to create groups, please send an email to technique.info [at] listes.univ-angers.fr with the name of the group and the associated users. | ||
+ | |||
+ | * As a reminder, **by default**, the rights of your home are in 755, so **anyone can read and execute your data**. | ||
+ | ===== Advanced use ===== | ||
+ | |||
+ | ==== Array jobs ==== | ||
+ | |||
+ | You should start by reading the [[https:// | ||
+ | |||
+ | If you have a large number of files or parameters to process with a single executable, you must use a [[https:// | ||
+ | |||
+ | It's easy to implement, just add the --array option to our batch script: | ||
+ | |||
+ | === Parametric tests === | ||
+ | |||
+ | It's easy to use the array jobs to do parametric tests. That is, use the same executable, possibly on the same file, but by varying a parameter in options of the executable. For that, if the parameters are contiguous or regular, one will use a batch like this one: | ||
+ | |||
+ | <code bash> | ||
+ | #!/bin/bash | ||
+ | #SBATCH -J Job_regular_parameter | ||
+ | #SBATCH -N 1 | ||
+ | #SBATCH --ntasks-per-node=1 | ||
+ | #SBATCH -t 10:00:00 | ||
+ | #SBATCH --array=0-9 | ||
+ | #SBATCH -p intel-E5-2670 | ||
+ | #SBATCH -o %A-%a.out | ||
+ | #SBATCH -e %A-%a.err | ||
+ | #SBATCH --mail-type=end, | ||
+ | #SBATCH --mail-user=username@univ-angers.fr | ||
+ | / | ||
+ | </ | ||
+ | |||
+ | The --array option can take special syntaxes, for irregular values or for value jumps: | ||
+ | |||
+ | <code bash> | ||
+ | # irregular values 0, | ||
+ | --array=0, | ||
+ | |||
+ | # value jumps of +2: 1, 3, 5 et 7 | ||
+ | --array=1-7: | ||
+ | </ | ||
+ | |||
+ | === Multiple instances job === | ||
+ | |||
+ | It is common to run a program many times over many instances (benchmark). | ||
+ | |||
+ | Let the following tree: | ||
+ | < | ||
+ | job_name | ||
+ | ├── error | ||
+ | ├── instances | ||
+ | │ | ||
+ | │ | ||
+ | │ | ||
+ | ├── job_name_exec | ||
+ | ├── output | ||
+ | └── submit_instances_dir.slurm | ||
+ | </ | ||
+ | |||
+ | It is easy to use an array job to execute job_name_exec on all the files to be processed in the instances directory. Just run the following command: | ||
+ | |||
+ | mkdir error output 2>/ | ||
+ | |||
+ | with the following batch file submit_instances_dir.slurm: | ||
+ | |||
+ | <code bash> | ||
+ | #!/bin/bash | ||
+ | |||
+ | #SBATCH --mail-type=END, | ||
+ | #SBATCH --mail-user=YOUR-EMAIL | ||
+ | #SBATCH -o output/ | ||
+ | #SBATCH -e error/%A-%a | ||
+ | |||
+ | #INSTANCES IS ARRAY OF INSTANCE FILE | ||
+ | INSTANCES=(instances/ | ||
+ | |||
+ | ./ | ||
+ | </ | ||
+ | |||
+ | === Multiple instances job with multiple executions (Seed number) === | ||
+ | |||
+ | Sometimes it is necessary to launch several times the execution on an instance by modifying the seed which makes it possible to generate and reproduct random numbers. | ||
+ | |||
+ | Let the following tree: | ||
+ | < | ||
+ | job_name | ||
+ | ├── error | ||
+ | ├── instances | ||
+ | │ ├── bench1.txt | ||
+ | │ ├── bench2.txt | ||
+ | │ └── bench3.txt | ||
+ | ├── job_name_exec | ||
+ | ├── output | ||
+ | ├── submit_instances_dir_with_seed.slurm | ||
+ | └── submit.sh | ||
+ | </ | ||
+ | |||
+ | Just run the following command: | ||
+ | |||
+ | ./submit.sh | ||
+ | |||
+ | with the following submit.sh file (remember to change the NB_SEED variable): | ||
+ | |||
+ | <code bash> | ||
+ | #!/bin/bash | ||
+ | |||
+ | readonly NB_SEED=50 | ||
+ | |||
+ | for instance in $ (ls instances) | ||
+ | do | ||
+ | sbatch --output output/ | ||
+ | done | ||
+ | exit 0 | ||
+ | </ | ||
+ | |||
+ | and the following submit_instances_dir_with_seed.slurm batch: | ||
+ | |||
+ | <code bash> | ||
+ | #!/bin/bash | ||
+ | #SBATCH --mail-type = END, FAIL | ||
+ | #SBATCH --mail-user = YOUR-EMAIL | ||
+ | |||
+ | echo "####### | ||
+ | echo "####### | ||
+ | echo | ||
+ | srun echo nameApplication $ {1} $ {SLURM_ARRAY_TASK_ID} | ||
+ | </ | ||
+ | |||
+ | With this method, the variable SLURM_ARRAY_TASK_ID contains the seed. And you submit as many array jobs as there are instances in the instance directory. | ||
+ | You can easily find your output which is named like this: | ||
+ | |||
+ | output/ | ||
+ | |||
+ | === Dependencies between job === | ||
+ | |||
+ | You can determine dependencies between jobs through the --depend sbatch options: | ||
+ | |||
+ | == Example == | ||
+ | |||
+ | <code bash> | ||
+ | # job can begin after the specified jobs have started | ||
+ | sbatch --depend=after: | ||
+ | |||
+ | #job can begin after the specified jobs have run to completion with an exit code of zero | ||
+ | sbatch --depend=afterok: | ||
+ | |||
+ | # job can begin after the specified jobs have terminated | ||
+ | sbatch --depend=afterany: | ||
+ | |||
+ | # job can begin after the specified array_jobs have completely and successfully finish | ||
+ | sbatch --depend=afterok: | ||
+ | </ | ||
+ | |||
+ | You can also see this [[https:// | ||
+ | ==== Steps jobs ==== | ||
+ | |||
+ | <note warning> | ||
+ | |||
+ | You can use the steps jobs for multiple and varied executions. | ||
+ | |||
+ | The steps jobs: | ||
+ | * allow to split a job into several tasks | ||
+ | * are created by prefixing the program command to be executed by the command Slurm " | ||
+ | * can run sequentially or in parallel | ||
+ | |||
+ | Each step can use n tasks (task) on N compute nodes (-n and -N options of srun). A task has CPU-per-task CPU at its disposal, and there are ntasks allocated by step. | ||
+ | |||
+ | === Example === | ||
+ | <code bash> | ||
+ | #SBATCH --job-name=nameOfJob | ||
+ | #SBATCH --cpus-per-task=1 | ||
+ | #SBATCH --ntasks=2 | ||
+ | |||
+ | #SBATCH --mail-type=END | ||
+ | #SBATCH --mail-user=username@univ-angers.fr | ||
+ | |||
+ | # Step of 2 Tasks | ||
+ | srun before.sh | ||
+ | |||
+ | # 2 Step in parallel (because &): task1 and task2 run in parallel. There is only one task per Step (option -n1) | ||
+ | srun -n1 -N1 / | ||
+ | srun -n1 -N1 / | ||
+ | |||
+ | # We wait for the end of task1 and task2 before running the last step after.sh | ||
+ | wait | ||
+ | |||
+ | srun after.sh | ||
+ | </ | ||
+ | |||
+ | === Steps creation shell bash structure according to the source of the data === | ||
+ | |||
+ | From [[http:// | ||
+ | |||
+ | == Example == | ||
+ | <code bash> | ||
+ | |||
+ | # Loop on the elements of an array (here files) : | ||
+ | files=(' | ||
+ | for f in " | ||
+ | # Adapt " | ||
+ | srun -n1 -N1 [...] " | ||
+ | done | ||
+ | |||
+ | # Loop on the files of a directory: | ||
+ | while read f; do | ||
+ | # Adapt " | ||
+ | srun -n1 -N1 [...] " | ||
+ | done < <(ls "/ | ||
+ | # Use "ls -R" or " | ||
+ | |||
+ | # Reading line by line of a file: | ||
+ | while read line; do | ||
+ | # Adapt " | ||
+ | srun -n1 -N1 [...] " | ||
+ | done <"/ | ||
+ | </ | ||
+ | |||
+ | ==== Use OpenMp ? ==== | ||
+ | |||
+ | Just add the --cpus-per-task option and export the variable OMP_NUM_THREADS | ||
+ | |||
+ | <code bash> | ||
+ | #!/bin/bash | ||
+ | # openmp_exec.slurm | ||
+ | #SBATCH --job-name=hello | ||
+ | #SBATCH --output=openmp_exec.out | ||
+ | #SBATCH --error=openmp_exec.err | ||
+ | #SBATCH --mail-type=end | ||
+ | #SBATCH --mail-user=user@univ-angers.fr | ||
+ | #SBATCH --nodes=1 | ||
+ | #SBATCH --ntasks-per-node=1 | ||
+ | #SBATCH --partition=intel-E5-2695 | ||
+ | |||
+ | #SBATCH --cpus-per-task=20 | ||
+ | |||
+ | export OMP_NUM_THREADS=20 | ||
+ | |||
+ | / | ||
+ | </ | ||
+ | |||
+ | ===== Specific use ===== | ||
+ | |||
+ | ==== Ssh access of compute nodes ==== | ||
+ | |||
+ | By default, it's impossible to connect ssh directly to the compute nodes. However, if it's justified, we can easily make temporary exceptions. In this case, please make an explicit request to technique [at] info.univ-angers.fr | ||
+ | |||
+ | Users with ssh access must be subscribed to the calcul-hpc-leria-no-slurm-mode@listes.univ-angers.fr list. To subscribe to this mailing list, simply send an email to sympa@listes.univ-angers.fr with the subject: subscribe calcul-hpc-leria-no-slurm-mode Last name First name | ||
+ | |||
+ | __Default rule:__ we don't launch a calculation on a server on which another user's calculation is already running, **even if this user doesn' | ||
+ | |||
+ | The htop command lets you know who is calculating with which resources and for how long. | ||
+ | |||
+ | If in doubt, contact the user who calculates directly by email or via the calcul-hpc-leria-no-slurm-mode@listes.univ-angers.fr list. | ||
+ | ==== Cuda ==== | ||
+ | GPU cards are present on star nodes {242, | ||
+ | * star242: P100 | ||
+ | * star253: 2*k20m | ||
+ | * star254: 2*k20m | ||
+ | |||
+ | Currently, version 9.1 of cuda-sdk-toolkit is installed. | ||
+ | |||
+ | These nodes are currently out of slurm submission lists (although the gpu partition already exists). To be able to use it, thank you to make the explicit request to technique [at] info.univ-angers.fr | ||
+ | |||
+ | ==== RAM node ==== | ||
+ | The leria has a node consisting of 1.5 TB of ram, it's about star243. | ||
+ | |||
+ | This node is accessible by submission via slurm (ram partition). To be able to use it, thank you to make the explicit request to technique [at] info.univ-angers.fr | ||
+ | |||
+ | ==== Cplex ==== | ||
+ | |||
+ | Leria has an academic license for the Cplex software. | ||
+ | |||
+ | The path to the library cplex is the default path / | ||
+ | |||
+ | ==== Conda environments (Python) ==== | ||
+ | |||
+ | The **conda activate < | ||
+ | |||
+ | source ./ | ||
+ | |||
+ | It may also be necessary to update the environment variables and to initialize conda on the node : | ||
+ | |||
+ | source .bashrc | ||
+ | conda init bash | ||
+ | |||
+ | The environment will stay active after your tasks are over. To deactivate the environment, | ||
+ | |||
+ | source ./ | ||
+ | ===== FAQ ===== | ||
+ | |||
+ | * How to know which are the resources of a partition, example with the partition std: | ||
+ | |||
+ | user@stargate~# | ||
+ | |||
+ | * What means “Some of your processes may have been killed by the cgroup out-of-memory handler” in the standard output of your job ? | ||
+ | |||
+ | You have exceeded the memories limits (--mem-per-cpu parameters) | ||
+ | |||
+ | * How to get an interactive shell prompt in a compute node of your default partition? | ||
+ | |||
+ | user@stargate~# | ||
+ | | ||
+ | < | ||
+ | |||
+ | srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --cpu-bind=no --mpi=none $SHELL | ||
+ | </ | ||
+ | |||
+ | * How to get an interactive shell prompt in a specific compute node? | ||
+ | |||
+ | user@stargate~# | ||
+ | user@NODE_NAME~# | ||
+ | |||
+ | * How can I quote the resources of LERIA in my scientific writings? | ||
+ | |||
+ | You can use the following misc bibtex entry to cite the compute cluster in your posts: | ||
+ | <code latex> | ||
+ | @Misc{HPC_LERIA, | ||
+ | title = {High Performance Computing Cluster of LERIA}, | ||
+ | year = {2018}, | ||
+ | note = {slurm/ | ||
+ | } | ||
+ | </ | ||
+ | ==== Error during job submission ==== | ||
+ | |||
+ | * When submitting my jobs, I get the following error message: | ||
+ | |||
+ | srun: error: Unable to allocate resources: Requested node configuration is not available | ||
+ | |||
+ | This probably means that you are trying to use a node without having to specify the partition in which it's located. You must use the -p or --partition option followed by the name of the partition in which the node is located. To have this information, | ||
+ | |||
+ | user@stargate# | ||
+ | |||
+ | |||
+ | ===== Lists of install software for high performance calculating ===== | ||
+ | |||
+ | ==== Via apt-get ==== | ||
+ | |||
+ | * automake | ||
+ | * bison | ||
+ | * boinc-client | ||
+ | * bowtie2 | ||
+ | * build-essential | ||
+ | * cmake | ||
+ | * flex | ||
+ | * freeglut3 | ||
+ | * freeglut3-dev | ||
+ | * g++ | ||
+ | * g++-8 | ||
+ | * g++-7 | ||
+ | * g++-6 | ||
+ | * git | ||
+ | * glibc-doc | ||
+ | * glibc-source | ||
+ | * gnuplot | ||
+ | * libglpk-dev | ||
+ | * libgmp-dev | ||
+ | * liblapack3 | ||
+ | * liblapack-dev | ||
+ | * liblas3 | ||
+ | * liblas-dev | ||
+ | * libtool | ||
+ | * libopenblas-base | ||
+ | * maven | ||
+ | * nasm | ||
+ | * openjdk-8-jdk-headless | ||
+ | * r-base | ||
+ | * r-base-dev | ||
+ | * regina-rexx | ||
+ | * samtools | ||
+ | * screen | ||
+ | * strace | ||
+ | * subversion | ||
+ | * tmux | ||
+ | * valgrind | ||
+ | * valgrind-dbg | ||
+ | * valgrind-mpi | ||
+ | |||
+ | ==== Via pip ==== | ||
+ | |||
+ | * keras | ||
+ | * scikit-learn | ||
+ | * tenserflow | ||
+ | * tenserflow-gpu # Sur nœuds gpu | ||
+ | |||
+ | ==== GPU node via apt-get ==== | ||
+ | |||
+ | * libglu1-mesa-dev | ||
+ | * libx11-dev | ||
+ | * libxi-dev | ||
+ | * libxmu-dev | ||
+ | * libgl1-mesa-dev | ||
+ | * linux-source | ||
+ | * linux-headers | ||
+ | * linux-image | ||
+ | * nvidia-cuda-toolkit | ||
+ | |||
+ | ==== Software installation ==== | ||
+ | |||
+ | Maybe a program is missing from the list above. In this case, 5 options are available to you: | ||
+ | |||
+ | * Make a request to technique [at] info.univ-angers.fr of the software you want to install | ||
+ | * Make yourself the installation via pip, pip2 or pip3 | ||
+ | * Make yourself the installation via conda: [[https:// | ||
+ | * Make yourself the installation by compiling the sources in your home_directory | ||
+ | |||
+ | ===== Visualize the load of the high-performance computing cluster ===== | ||
+ | |||
+ | For the links below, you will need to authenticate with your login and password ldap (idem ENT). | ||
+ | ==== Cluster load overview ==== | ||
+ | |||
+ | https:// | ||
+ | |||
+ | ==== Details per node ==== | ||
+ | |||
+ | https:// | ||
+ | |||
+ | < | ||
+ | |||
+ | ====== Acknowledgement ====== | ||
+ | |||
+ | [[leria: |