Domain decomposition error in project 14524

Moderators: Site Moderators, FAHC Science Team

Domain decomposition error in project 14524

Postby Sarr » Wed Jun 10, 2020 12:20 am

Checking on my machine, I noticed it was continually trying to start the following work unit, but failing to do so, stuck in a loop repeatedly. I pasted the logs below. The CPU is a Ryzen 5 1600, also running a GPU, the client automatically configures it to 11 threads, 12 minus 1 for the GPU (i did not set how many threads manually it is -1 to let client decide automatically) and it always reduces to 10 apparently. However this log seems to show that 10 threads is incompatible with this project.

Code: Select all
23:11:46:WU02:FS00:Starting
23:11:46:WU02:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 02 -suffix 01 -version 706 -lifeline 2669 -checkpoint 15 -np 11
23:11:46:WU02:FS00:Started FahCore on PID 26940
23:11:46:WU02:FS00:Core PID:26944
23:11:46:WU02:FS00:FahCore 0xa7 started
23:11:46:WU02:FS00:0xa7:*********************** Log Started 2020-06-09T23:11:46Z ***********************
23:11:46:WU02:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
23:11:46:WU02:FS00:0xa7:       Type: 0xa7
23:11:46:WU02:FS00:0xa7:       Core: Gromacs
23:11:46:WU02:FS00:0xa7:       Args: -dir 02 -suffix 01 -version 706 -lifeline 26940 -checkpoint 15 -np
23:11:46:WU02:FS00:0xa7:             11
23:11:46:WU02:FS00:0xa7:************************************ CBang *************************************
23:11:46:WU02:FS00:0xa7:       Date: Nov 5 2019
23:11:46:WU02:FS00:0xa7:       Time: 06:06:57
23:11:46:WU02:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
23:11:46:WU02:FS00:0xa7:     Branch: master
23:11:46:WU02:FS00:0xa7:   Compiler: GNU 8.3.0
23:11:46:WU02:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
23:11:46:WU02:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
23:11:46:WU02:FS00:0xa7:       Bits: 64
23:11:46:WU02:FS00:0xa7:       Mode: Release
23:11:46:WU02:FS00:0xa7:************************************ System ************************************
23:11:46:WU02:FS00:0xa7:        CPU: AMD Ryzen 5 1600 Six-Core Processor
23:11:46:WU02:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 1 Stepping 1
23:11:46:WU02:FS00:0xa7:       CPUs: 12
23:11:46:WU02:FS00:0xa7:     Memory: 15.66GiB
23:11:46:WU02:FS00:0xa7:Free Memory: 1.27GiB
23:11:46:WU02:FS00:0xa7:    Threads: POSIX_THREADS
23:11:46:WU02:FS00:0xa7: OS Version: 4.15
23:11:46:WU02:FS00:0xa7:Has Battery: false
23:11:46:WU02:FS00:0xa7: On Battery: false
23:11:46:WU02:FS00:0xa7: UTC Offset: -4
23:11:46:WU02:FS00:0xa7:        PID: 26944
23:11:46:WU02:FS00:0xa7:        CWD: /var/lib/fahclient/work
23:11:46:WU02:FS00:0xa7:******************************** Build - libFAH ********************************
23:11:46:WU02:FS00:0xa7:    Version: 0.0.18
23:11:46:WU02:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
23:11:46:WU02:FS00:0xa7:  Copyright: 2019 foldingathome.org
23:11:46:WU02:FS00:0xa7:   Homepage: https://foldingathome.org/
23:11:46:WU02:FS00:0xa7:       Date: Nov 5 2019
23:11:46:WU02:FS00:0xa7:       Time: 06:13:26
23:11:46:WU02:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
23:11:46:WU02:FS00:0xa7:     Branch: master
23:11:46:WU02:FS00:0xa7:   Compiler: GNU 8.3.0
23:11:46:WU02:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
23:11:46:WU02:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
23:11:46:WU02:FS00:0xa7:       Bits: 64
23:11:46:WU02:FS00:0xa7:       Mode: Release
23:11:46:WU02:FS00:0xa7:************************************ Build *************************************
23:11:46:WU02:FS00:0xa7:       SIMD: avx_256
23:11:46:WU02:FS00:0xa7:********************************************************************************
23:11:46:WU02:FS00:0xa7:Project: 14524 (Run 916, Clone 1, Gen 20)
23:11:46:WU02:FS00:0xa7:Unit: 0x0000002180fccb0a5e459b90f96c8f13
23:11:46:WU02:FS00:0xa7:Reading tar file core.xml
23:11:46:WU02:FS00:0xa7:Reading tar file frame20.tpr
23:11:46:WU02:FS00:0xa7:Digital signatures verified
23:11:46:WU02:FS00:0xa7:Reducing thread count from 11 to 10 to avoid domain decomposition by a prime number > 3
23:11:46:WU02:FS00:0xa7:Calling: mdrun -s frame20.tpr -o frame20.trr -x frame20.xtc -cpt 15 -nt 10
23:11:46:WU02:FS00:0xa7:Steps: first=5000000 total=250000
23:11:46:WU02:FS00:0xa7:ERROR:
23:11:46:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
23:11:46:WU02:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
23:11:46:WU02:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
23:11:46:WU02:FS00:0xa7:ERROR:
23:11:46:WU02:FS00:0xa7:ERROR:Fatal error:
23:11:46:WU02:FS00:0xa7:ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
23:11:46:WU02:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
23:11:46:WU02:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
23:11:46:WU02:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
23:11:46:WU02:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
23:11:46:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
23:11:51:WU02:FS00:0xa7:WARNING:Unexpected exit() call
23:11:51:WU02:FS00:0xa7:WARNING:Unexpected exit from science code
23:11:51:WU02:FS00:0xa7:Saving result file ../logfile_01.txt
23:11:51:WU02:FS00:0xa7:Saving result file md.log
23:11:51:WU02:FS00:0xa7:Saving result file science.log
23:11:51:WU02:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Last edited by Sarr on Wed Jun 10, 2020 12:24 am, edited 1 time in total.
User avatar
Sarr
 
Posts: 11
Joined: Fri Apr 10, 2020 2:12 am

Re: Domain decomposition error in project 14524

Postby Sarr » Wed Jun 10, 2020 12:23 am

Manually changing the thread count to 8 makes this WU run fine, I just wanted to report that it was assigned to my system that was configured for an amount of threads it is apparently incompatible with
User avatar
Sarr
 
Posts: 11
Joined: Fri Apr 10, 2020 2:12 am

Re: Domain decomposition error in project 14524

Postby bruce » Wed Jun 10, 2020 1:01 am

At the time the WU was downloaded, your slot was configured for 11 threads. That's a number for which FAHCore_a7 has troubles because it's prime. The client recognizes this problem and issues the message Reducing thread count from 11 to 10 to avoid domain decomposition by a prime number > 3 and reduces the number to 10 threads. Unfortunately, 10 is also a "bad" number of threads, so I would expect to see a second message saying Reducing thread count from 10 to 9 to avoid domain decomposition by a prime number > 3. That didn't happen and I'm not sure why not. Apparently that's a bug.

Nevertheless, you can avoid this problem by manually reducing the number of threads allocated to that slot using FAHControl. If you really want to use all of your CPU threads, you can add a second CPU slot with 2 or more threads (keeping the total at 11).

I'm also concerned about all the other people who are encountering the same problem so I'll open a ticket and see if somebody will fix it for them.

EDIT: See https://github.com/FoldingAtHome/fah-issues/issues/1521
bruce
 
Posts: 19701
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: Domain decomposition error in project 14524

Postby uyaem » Wed Jun 10, 2020 8:13 am

Adding to this:
Code: Select all
22:39:20:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
22:39:20:WU00:FS00:0xa7:ERROR:Source code file: C:\build\fah\core-a7-avx-release\windows-10-64bit-core-a7-avx-release\gromacs-core\build\gromacs\src\gromacs\mdlib\domdec.c, line: 6902
22:39:20:WU00:FS00:0xa7:ERROR:
22:39:20:WU00:FS00:0xa7:ERROR:Fatal error:
22:39:20:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 16 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm

I run 21 cores, it normally gets decomposed to 16 and 5 PME.
I've not seen any project fail with 2^4 yet :shock:
Image
CPU: Ryzen 9 3900X (1x21 CPUs) ~ GPU: nVidia GeForce GTX 1660 Super (Asus)
uyaem
 
Posts: 222
Joined: Sat Mar 21, 2020 8:35 pm
Location: Esslingen, Germany

Re: Domain decomposition error in project 14524

Postby bruce » Wed Jun 10, 2020 5:47 pm

It gets tricky when GROMACS decides to break up 21 into 5 pme + 16 pp. A different project might break it up into 6 pme + 15 pp and the results might be different.
bruce
 
Posts: 19701
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: Domain decomposition error in project 14524

Postby uyaem » Wed Jun 10, 2020 10:55 pm

bruce wrote:It gets tricky when GROMACS decides to break up 21 into 5 pme + 16 pp. A different project might break it up into 6 pme + 15 pp and the results might be different.

I thought this was the normal allocation though?
This in the first project that didn't want to work with 21 cores. :)
uyaem
 
Posts: 222
Joined: Sat Mar 21, 2020 8:35 pm
Location: Esslingen, Germany

Re: Domain decomposition error in project 14524

Postby Joe_H » Wed Jun 10, 2020 11:59 pm

The allocation of PP versus PME threads appears to depend on the dimensions of the bounding box from the analysis that was posted here a few weeks ago. It is related to the minimum thickness of a "slice" as I understand it.

For example two different WUs from projects might both have a volume that is 60 cubic units (one unit being length longer that the minimum). One WU could be in a bounding box that is 3x4x5 and the other 2x5x6. The decompositions would be different, and that might lead to a different number of PME threads which are used to handle interactions between adjacent parts.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Joe_H
Site Admin
 
Posts: 6451
Joined: Tue Apr 21, 2009 5:41 pm
Location: W. MA

Re: Domain decomposition error in project 14524

Postby _r2w_ben » Thu Jun 11, 2020 2:39 am

uyaem wrote:
bruce wrote:It gets tricky when GROMACS decides to break up 21 into 5 pme + 16 pp. A different project might break it up into 6 pme + 15 pp and the results might be different.

I thought this was the normal allocation though?
This in the first project that didn't want to work with 21 cores. :)

Only the smallest class of projects won't work with 21. 14524 is a 3x3x3. The 16 threads used for PP needs at least one of those numbers to be a 4 so that it can do 4x4x1 or 4x2x2.

21 will also be used as 3 PME + 18 PP and 9 PME + 12 PP. The number is based on the estimated amount of time that will be spent doing PME work for that specific work unit.
_r2w_ben
 
Posts: 277
Joined: Wed Apr 23, 2008 4:11 pm


Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 3 guests

cron