Page 1 of 1

There is no domain decomposition for 50 ranks that is compat

Posted: Thu Apr 23, 2020 8:37 am
by craftit
My CPU client keeps getting caught in infinite restart loop with the following error. The only way to get things working again is to delete the 'work' directory and restart. Is there a way of preventing this particular type of work unit running ? Others work fine.

Code: Select all

08:33:22:WU03:FS00:0xa7:ERROR:-------------------------------------------------------
08:33:22:WU03:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
08:33:22:WU03:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
08:33:22:WU03:FS00:0xa7:ERROR:
08:33:22:WU03:FS00:0xa7:ERROR:Fatal error:
08:33:22:WU03:FS00:0xa7:ERROR:There is no domain decomposition for 50 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
08:33:22:WU03:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
08:33:22:WU03:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
08:33:22:WU03:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
08:33:22:WU03:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
08:33:22:WU03:FS00:0xa7:ERROR:-------------------------------------------------------
My system:

Code: Select all

08:33:22:WU03:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
08:33:22:WU03:FS00:0xa7:       Type: 0xa7
08:33:22:WU03:FS00:0xa7:       Core: Gromacs
08:33:22:WU03:FS00:0xa7:       Args: -dir 03 -suffix 01 -version 705 -lifeline 16575 -checkpoint 15 -np
08:33:22:WU03:FS00:0xa7:             62
08:33:22:WU03:FS00:0xa7:************************************ CBang *************************************
08:33:22:WU03:FS00:0xa7:       Date: Nov 5 2019
08:33:22:WU03:FS00:0xa7:       Time: 06:06:57
08:33:22:WU03:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
08:33:22:WU03:FS00:0xa7:     Branch: master
08:33:22:WU03:FS00:0xa7:   Compiler: GNU 8.3.0
08:33:22:WU03:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
08:33:22:WU03:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
08:33:22:WU03:FS00:0xa7:       Bits: 64
08:33:22:WU03:FS00:0xa7:       Mode: Release
08:33:22:WU03:FS00:0xa7:************************************ System ************************************
08:33:22:WU03:FS00:0xa7:        CPU: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
08:33:22:WU03:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 85 Stepping 4
08:33:22:WU03:FS00:0xa7:       CPUs: 64
08:33:22:WU03:FS00:0xa7:     Memory: 125.53GiB
08:33:22:WU03:FS00:0xa7:Free Memory: 115.55GiB
08:33:22:WU03:FS00:0xa7:    Threads: POSIX_THREADS
08:33:22:WU03:FS00:0xa7: OS Version: 4.15
08:33:22:WU03:FS00:0xa7:Has Battery: false
08:33:22:WU03:FS00:0xa7: On Battery: false
08:33:22:WU03:FS00:0xa7: UTC Offset: 1
08:33:22:WU03:FS00:0xa7:        PID: 16579
08:33:22:WU03:FS00:0xa7:        CWD: /home/charles/work
08:33:22:WU03:FS00:0xa7:******************************** Build - libFAH ********************************
08:33:22:WU03:FS00:0xa7:    Version: 0.0.18
08:33:22:WU03:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
08:33:22:WU03:FS00:0xa7:  Copyright: 2019 foldingathome.org
08:33:22:WU03:FS00:0xa7:   Homepage: https://foldingathome.org/
08:33:22:WU03:FS00:0xa7:       Date: Nov 5 2019
08:33:22:WU03:FS00:0xa7:       Time: 06:13:26
08:33:22:WU03:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
08:33:22:WU03:FS00:0xa7:     Branch: master
08:33:22:WU03:FS00:0xa7:   Compiler: GNU 8.3.0
08:33:22:WU03:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
08:33:22:WU03:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
08:33:22:WU03:FS00:0xa7:       Bits: 64
08:33:22:WU03:FS00:0xa7:       Mode: Release
08:33:22:WU03:FS00:0xa7:************************************ Build *************************************
08:33:22:WU03:FS00:0xa7:       SIMD: avx_256

Re: There is no domain decomposition for 50 ranks that is co

Posted: Thu Apr 23, 2020 9:00 am
by Neil-B
Actually if you reduce the number of cores on the slot you don't have to delete it, it will finish when you find an acceptable number … Looks like you are running a 62core slot so trying 60 looks like a good choice … See this thread for so research into what might/might not be good with core counts viewtopic.php?f=72&t=34350&p=328632&hil ... on#p328109.

If you regularly hit this issue then a permanent shift off 62core to 60core might help … or, w=even though I am an advocate for running the biggest slots possible you may find that two smaller slots might be necessary to keep you in WUs that don't have issues.

Re: There is no domain decomposition for 50 ranks that is co

Posted: Thu Apr 23, 2020 11:00 am
by anandhanju
Can you provide the entirety of the log that contains
a) the number of cores you've allocated to CPU folding (to confirm this is 50) and
b) the project number and Run, Clone, Gen identifiers for the work unit?

This will help the researchers block this project from getting allocated to clients that are using that many cores.

Re: There is no domain decomposition for 50 ranks that is co

Posted: Thu Apr 23, 2020 11:36 am
by Neil-B
Think you may find this is a 62core slot which has offloaded 12cores to PME and is having problems with the remaining 50 .. but let's see the full log

Re: There is no domain decomposition for 50 ranks that is co

Posted: Fri Apr 24, 2020 10:53 am
by craftit
Yes, there are 62 cores allocated, though I see no obvious way of changing this ? I will try and find more information when it happens again.

Re: There is no domain decomposition for 50 ranks that is co

Posted: Fri Apr 24, 2020 11:19 am
by _r2w_ben
craftit wrote:Yes, there are 62 cores allocated, though I see no obvious way of changing this ? I will try and find more information when it happens again.
On Linux, you'll need to edit /etc/fahclient/config.xml

Replace this part

Code: Select all

<slot id='0' type='CPU' />
with this

Code: Select all

<slot id='0' type='CPU'>
    <cpus v='60'/>
</slot>
If you get another domain decomposition error in the future, change 60 to 45 and let the work unit finish. Then edit back to 60 to use all cores again on the following work unit.

Re: There is no domain decomposition for 50 ranks that is co

Posted: Fri Apr 24, 2020 5:18 pm
by foldy
Another option is to create 2 cpu slots with 30 threads each