There is no domain decomposition for 50 ranks that is compat

Moderators: Site Moderators, FAHC Science Team

There is no domain decomposition for 50 ranks that is compat

Postby craftit » Thu Apr 23, 2020 9:37 am

My CPU client keeps getting caught in infinite restart loop with the following error. The only way to get things working again is to delete the 'work' directory and restart. Is there a way of preventing this particular type of work unit running ? Others work fine.

Code: Select all
08:33:22:WU03:FS00:0xa7:ERROR:-------------------------------------------------------
08:33:22:WU03:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
08:33:22:WU03:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
08:33:22:WU03:FS00:0xa7:ERROR:
08:33:22:WU03:FS00:0xa7:ERROR:Fatal error:
08:33:22:WU03:FS00:0xa7:ERROR:There is no domain decomposition for 50 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
08:33:22:WU03:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
08:33:22:WU03:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
08:33:22:WU03:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
08:33:22:WU03:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
08:33:22:WU03:FS00:0xa7:ERROR:-------------------------------------------------------


My system:

Code: Select all
08:33:22:WU03:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
08:33:22:WU03:FS00:0xa7:       Type: 0xa7
08:33:22:WU03:FS00:0xa7:       Core: Gromacs
08:33:22:WU03:FS00:0xa7:       Args: -dir 03 -suffix 01 -version 705 -lifeline 16575 -checkpoint 15 -np
08:33:22:WU03:FS00:0xa7:             62
08:33:22:WU03:FS00:0xa7:************************************ CBang *************************************
08:33:22:WU03:FS00:0xa7:       Date: Nov 5 2019
08:33:22:WU03:FS00:0xa7:       Time: 06:06:57
08:33:22:WU03:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
08:33:22:WU03:FS00:0xa7:     Branch: master
08:33:22:WU03:FS00:0xa7:   Compiler: GNU 8.3.0
08:33:22:WU03:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
08:33:22:WU03:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
08:33:22:WU03:FS00:0xa7:       Bits: 64
08:33:22:WU03:FS00:0xa7:       Mode: Release
08:33:22:WU03:FS00:0xa7:************************************ System ************************************
08:33:22:WU03:FS00:0xa7:        CPU: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
08:33:22:WU03:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 85 Stepping 4
08:33:22:WU03:FS00:0xa7:       CPUs: 64
08:33:22:WU03:FS00:0xa7:     Memory: 125.53GiB
08:33:22:WU03:FS00:0xa7:Free Memory: 115.55GiB
08:33:22:WU03:FS00:0xa7:    Threads: POSIX_THREADS
08:33:22:WU03:FS00:0xa7: OS Version: 4.15
08:33:22:WU03:FS00:0xa7:Has Battery: false
08:33:22:WU03:FS00:0xa7: On Battery: false
08:33:22:WU03:FS00:0xa7: UTC Offset: 1
08:33:22:WU03:FS00:0xa7:        PID: 16579
08:33:22:WU03:FS00:0xa7:        CWD: /home/charles/work
08:33:22:WU03:FS00:0xa7:******************************** Build - libFAH ********************************
08:33:22:WU03:FS00:0xa7:    Version: 0.0.18
08:33:22:WU03:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
08:33:22:WU03:FS00:0xa7:  Copyright: 2019 foldingathome.org
08:33:22:WU03:FS00:0xa7:   Homepage: https://foldingathome.org/
08:33:22:WU03:FS00:0xa7:       Date: Nov 5 2019
08:33:22:WU03:FS00:0xa7:       Time: 06:13:26
08:33:22:WU03:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
08:33:22:WU03:FS00:0xa7:     Branch: master
08:33:22:WU03:FS00:0xa7:   Compiler: GNU 8.3.0
08:33:22:WU03:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
08:33:22:WU03:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
08:33:22:WU03:FS00:0xa7:       Bits: 64
08:33:22:WU03:FS00:0xa7:       Mode: Release
08:33:22:WU03:FS00:0xa7:************************************ Build *************************************
08:33:22:WU03:FS00:0xa7:       SIMD: avx_256
craftit
 
Posts: 2
Joined: Thu Apr 23, 2020 9:30 am

Re: There is no domain decomposition for 50 ranks that is co

Postby Neil-B » Thu Apr 23, 2020 10:00 am

Actually if you reduce the number of cores on the slot you don't have to delete it, it will finish when you find an acceptable number … Looks like you are running a 62core slot so trying 60 looks like a good choice … See this thread for so research into what might/might not be good with core counts https://foldingforum.org/viewtopic.php?f=72&t=34350&p=328632&hilit=decomposition#p328109.

If you regularly hit this issue then a permanent shift off 62core to 60core might help … or, w=even though I am an advocate for running the biggest slots possible you may find that two smaller slots might be necessary to keep you in WUs that don't have issues.
1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent, Quadro K420 1GB, FAH 7.6.13
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro, Quadro M1000M 2GB, FAH 7.6.13
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro, GTX 750Ti 2GB, FAH 7.6.13
Neil-B
 
Posts: 1217
Joined: Sun Mar 22, 2020 6:52 pm
Location: UK

Re: There is no domain decomposition for 50 ranks that is co

Postby anandhanju » Thu Apr 23, 2020 12:00 pm

Can you provide the entirety of the log that contains
a) the number of cores you've allocated to CPU folding (to confirm this is 50) and
b) the project number and Run, Clone, Gen identifiers for the work unit?

This will help the researchers block this project from getting allocated to clients that are using that many cores.
anandhanju
 
Posts: 508
Joined: Mon Dec 03, 2007 5:33 am
Location: Australia

Re: There is no domain decomposition for 50 ranks that is co

Postby Neil-B » Thu Apr 23, 2020 12:36 pm

Think you may find this is a 62core slot which has offloaded 12cores to PME and is having problems with the remaining 50 .. but let's see the full log
Neil-B
 
Posts: 1217
Joined: Sun Mar 22, 2020 6:52 pm
Location: UK

Re: There is no domain decomposition for 50 ranks that is co

Postby craftit » Fri Apr 24, 2020 11:53 am

Yes, there are 62 cores allocated, though I see no obvious way of changing this ? I will try and find more information when it happens again.
craftit
 
Posts: 2
Joined: Thu Apr 23, 2020 9:30 am

Re: There is no domain decomposition for 50 ranks that is co

Postby _r2w_ben » Fri Apr 24, 2020 12:19 pm

craftit wrote:Yes, there are 62 cores allocated, though I see no obvious way of changing this ? I will try and find more information when it happens again.

On Linux, you'll need to edit /etc/fahclient/config.xml

Replace this part
Code: Select all
<slot id='0' type='CPU' />

with this
Code: Select all
<slot id='0' type='CPU'>
    <cpus v='60'/>
</slot>

If you get another domain decomposition error in the future, change 60 to 45 and let the work unit finish. Then edit back to 60 to use all cores again on the following work unit.
_r2w_ben
 
Posts: 277
Joined: Wed Apr 23, 2008 4:11 pm

Re: There is no domain decomposition for 50 ranks that is co

Postby foldy » Fri Apr 24, 2020 6:18 pm

Another option is to create 2 cpu slots with 30 threads each
foldy
 
Posts: 1942
Joined: Sat Dec 01, 2012 4:43 pm


Return to V7.5.1 Public Release Windows/Linux/MacOS X

Who is online

Users browsing this forum: No registered users and 1 guest

cron