[Solved] Gromacs Error

Moderators: Site Moderators, FAHC Science Team

Post Reply
beuk
Posts: 6
Joined: Fri Apr 10, 2020 2:31 pm

[Solved] Gromacs Error

Post by beuk »

Hi,

We are running F@H on our cluster at the Fabrique du Loch - FabLab, Auray, France, and we have an issue with one of our nodes.
While slim nodes (16 cores/32Gb ram, I think Charmm is running on these) run ok, our SMP node (64/32 cores/1Tb ram) faces issues with Gromacs parameters:

Code: Select all

14:28:54:WU00:FS00:Starting
14:28:54:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /root/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 5606 -checkpoint 15 -np 63
14:28:54:WU00:FS00:Started FahCore on PID 6422
14:28:54:WU00:FS00:Core PID:6426
14:28:54:WU00:FS00:FahCore 0xa7 started
14:28:54:WU00:FS00:0xa7:*********************** Log Started 2020-04-10T14:28:54Z ***********************
14:28:54:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
14:28:54:WU00:FS00:0xa7:       Type: 0xa7
14:28:54:WU00:FS00:0xa7:       Core: Gromacs
14:28:54:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 6422 -checkpoint 15 -np
14:28:54:WU00:FS00:0xa7:             63
14:28:54:WU00:FS00:0xa7:************************************ CBang *************************************
14:28:54:WU00:FS00:0xa7:       Date: Nov 5 2019
14:28:54:WU00:FS00:0xa7:       Time: 06:06:57
14:28:54:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
14:28:54:WU00:FS00:0xa7:     Branch: master
14:28:54:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
14:28:54:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
14:28:54:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
14:28:54:WU00:FS00:0xa7:       Bits: 64
14:28:54:WU00:FS00:0xa7:       Mode: Release
14:28:54:WU00:FS00:0xa7:************************************ System ************************************
14:28:54:WU00:FS00:0xa7:        CPU: Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
14:28:54:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 45 Stepping 7
14:28:54:WU00:FS00:0xa7:       CPUs: 64
14:28:54:WU00:FS00:0xa7:     Memory: 1007.37GiB
14:28:54:WU00:FS00:0xa7:Free Memory: 1003.03GiB
14:28:54:WU00:FS00:0xa7:    Threads: POSIX_THREADS
14:28:54:WU00:FS00:0xa7: OS Version: 4.18
14:28:54:WU00:FS00:0xa7:Has Battery: false
14:28:54:WU00:FS00:0xa7: On Battery: false
14:28:54:WU00:FS00:0xa7: UTC Offset: 2
14:28:54:WU00:FS00:0xa7:        PID: 6426
14:28:54:WU00:FS00:0xa7:        CWD: /root/work
14:28:54:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
14:28:54:WU00:FS00:0xa7:    Version: 0.0.18
14:28:54:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
14:28:54:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
14:28:54:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
14:28:54:WU00:FS00:0xa7:       Date: Nov 5 2019
14:28:54:WU00:FS00:0xa7:       Time: 06:13:26
14:28:54:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
14:28:54:WU00:FS00:0xa7:     Branch: master
14:28:54:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
14:28:54:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
14:28:54:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
14:28:54:WU00:FS00:0xa7:       Bits: 64
14:28:54:WU00:FS00:0xa7:       Mode: Release
14:28:54:WU00:FS00:0xa7:************************************ Build *************************************
14:28:54:WU00:FS00:0xa7:       SIMD: avx_256
14:28:54:WU00:FS00:0xa7:********************************************************************************
14:28:54:WU00:FS00:0xa7:Project: 16422 (Run 1449, Clone 1, Gen 18)
14:28:54:WU00:FS00:0xa7:Unit: 0x0000001396880e6e5e8bdfe2e38d5b85
14:28:54:WU00:FS00:0xa7:Reading tar file core.xml
14:28:54:WU00:FS00:0xa7:Reading tar file frame18.tpr
14:28:54:WU00:FS00:0xa7:Digital signatures verified
14:28:54:WU00:FS00:0xa7:Calling: mdrun -s frame18.tpr -o frame18.trr -x frame18.xtc -cpt 15 -nt 63
14:28:54:WU00:FS00:0xa7:Steps: first=4500000 total=250000
14:28:54:WU00:FS00:0xa7:ERROR:
14:28:54:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
14:28:54:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
14:28:54:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
14:28:54:WU00:FS00:0xa7:ERROR:
14:28:54:WU00:FS00:0xa7:ERROR:Fatal error:
14:28:54:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 49 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
14:28:54:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
14:28:54:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
14:28:54:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
14:28:54:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
14:28:54:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
14:28:59:WU00:FS00:0xa7:WARNING:Unexpected exit() call
14:28:59:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
14:28:59:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
14:28:59:WU00:FS00:0xa7:Saving result file md.log
14:28:59:WU00:FS00:0xa7:Saving result file science.log
14:29:00:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
I don't think issue is on our side, but not sure at 100%.

How would it be possible to solve this ?

With my best regards

Beuk
Last edited by beuk on Sun Apr 12, 2020 2:06 pm, edited 1 time in total.
felipeportella
Posts: 6
Joined: Sat Mar 28, 2020 6:06 pm

Re: Gromacs Error

Post by felipeportella »

This core "Gromacs" has know problems with many cores. Check this response from bruce about it in another thread:

viewtopic.php?f=16&t=34035&p=323426&hilit=GROMACS#p323426

What I did as a workaround was to create some smaller CPU Folding Slots to avoid the problem in the config.xml. For your 32 core nodes will be like this:

Code: Select all

  <!-- Folding Slots -->
  <slot id='0' type='CPU'>
    <cpus v='16'/>
  </slot>
  <slot id='1' type='CPU'>
    <cpus v='16'/>
  </slot>
Best
portella
Last edited by felipeportella on Sat Apr 11, 2020 4:19 pm, edited 1 time in total.
Image
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Gromacs Error

Post by Neil-B »

It is not just the size that can cause this issue … CPU slots with core counts that are either large primes (large 7 and above - sometimes even 5) can cause issues and core counts that are multiples of these also cause issues.

Try reducing core count to something that is not divisible by a prime 5 or greater may resolve issue.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
JimboPalmer
Posts: 2573
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: Gromacs Error

Post by JimboPalmer »

Windows (which you are not using) has an issue with more than 32 cores. I would be tempted to try 2 CPU slots of 32 CPUs each. If it still bites you, 4 slots of 16 CPUs each

More nodes in one slot does harder science, so gets more points. But a working node beats an ideal node that can't find work.
Last edited by JimboPalmer on Fri Apr 10, 2020 6:40 pm, edited 1 time in total.
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Gromacs Error

Post by bruce »

The suggested 16 will probably work, but if not, try 12.

Unfortunately it's difficult to predict which WUs will/won't work with a specified number of threads so the next WU may work with 32 or whatever. I've never seen 12 fail.
beuk
Posts: 6
Joined: Fri Apr 10, 2020 2:31 pm

Re: Gromacs Error

Post by beuk »

Dear All,

Many thanks for these answers.

I will reduce to 16 cores slots, since even our 32 cores nodes face this issue.

It seems that FAClient retry indefinitely the same WU when it fails, making randomly nodes stuck in a loop. I will create a cron to detect that and restart FAClient + erase work dir when this occur.

After few days of run, I have few questions more, if possible to ask in this same thread:

1 . Does RAM memory is useful for FA to run ? It seems to me that calculations do not use a lot of RAM, so using our 1Tb RAM node don't seems a good idea. We should focus on slim 32Gb RAM nodes. Am I wrong ?
2 . Does FA would make some use of Infiniband network, for MPI runs ?

I wish you all an happy Easter Sunday! :D

With my best regards

Beuk
felipeportella
Posts: 6
Joined: Sat Mar 28, 2020 6:06 pm

Re: Gromacs Error

Post by felipeportella »

beuk,

1. I'm new to the project, but as I observed so far its not memory intensively ... definitely 1 Tb RAM will not be fully used. We added our fatnodes to be more computer power, even knowing that the whole memory will not be used.

2. FAH uses a single node per process, so no need for infiniband at all. If you have multiple nodes, start one different instance of FAH on each one ...

Best and a happy Easter for you too.
Portella
Image
beuk
Posts: 6
Joined: Fri Apr 10, 2020 2:31 pm

Re: Gromacs Error

Post by beuk »

Dear Portella,

Many thanks for this feedback.

So I will concentrate on slim nodes, then fat nodes, and consider first nodes without interconnect.

I will update the topic subject to solved. :)

By the way, if it could help someone, we are using an home made open source cluster stack, based on Ansible, available here: https://github.com/bluebanquise/bluebanquise
The Slurm addon is perfect to launch FAH jobs.

With my best regards

Beuk
felipeportella
Posts: 6
Joined: Sat Mar 28, 2020 6:06 pm

Re: [Solved] Gromacs Error

Post by felipeportella »

beuk,

We are using SLURM as well to launch a singularty container with FAH ... if this helps your scenario, here are the repo: https://github.com/felipeportella/foldi ... larity-gpu
(in the repo we didn't included the SLURM jobs as its very specific for each cluster due queue names etc., but if you need we can share some templates)

Best,
Portella
Image
beuk
Posts: 6
Joined: Fri Apr 10, 2020 2:31 pm

Re: [Solved] Gromacs Error

Post by beuk »

Dear Portella,

Many thanks for this! I am learning Singularity, so this will be a good topic to start :-)

With my best regards

Beuk
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: [Solved] Gromacs Error

Post by _r2w_ben »

I noticed in your log that 63 cores (nt=63) were being used. Do the 64 core nodes have a GPU? This would cause the client to reduce the cores assigned to the CPU slot by 1. Based on some thread count testing I've done, 63 would be more likely to fail than 64.
beuk
Posts: 6
Joined: Fri Apr 10, 2020 2:31 pm

Re: [Solved] Gromacs Error

Post by beuk »

Dear _r2w_ben,

No, we removed the GPU from our SMP, to use them somewhere else (and also because Nvidia dropped linux driver support on our GPU. Drivers for CUDA can only be found on Windows 10 today, so we also have Windows 10 computes nodes now, alongside our Centos ones :roll: ).

It seems FAH always use N-1 cores in our cases.

We reduced all jobs to slots of 16 cores, and we do not experience any more issues. Sot 16 was the magical number for us, and the cluster now calculate a lot more :D
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: [Solved] Gromacs Error

Post by Neil-B »

… 32core slots are usually pretty "bullet proof" as well and they get the science done quicker, but if 16s work for you that is cool :)
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
beuk
Posts: 6
Joined: Fri Apr 10, 2020 2:31 pm

Re: [Solved] Gromacs Error

Post by beuk »

Dear Neil-B,

Thank you for this feedback.

I will add more 32 cores nodes this week, so I will let them run without slots (so default 1x32 cores) and share with you the error when it happen, so it can help to debug if needed :)
Post Reply