Project 16404 (0, 4835, 72) -- no domain decomposition

Moderators: Site Moderators, FAHC Science Team

Post Reply
rusty
Posts: 17
Joined: Sun Mar 15, 2020 9:00 pm

Project 16404 (0, 4835, 72) -- no domain decomposition

Post by rusty »

Hello,

I have received a WU that continually generates the following error message reproduced below regarding there being no domain decomposition for 20 ranks that is compatible with the machine.

So, that CPU slot was stuck in a loop, attempting to run the WU, erroring out, and then trying again.

I manually reduced my number of usable threads to 18 and that seems to have gotten the unit running again.

Just wanted to be sure that this issue was known. Seems that this WU should not have been served to my configuration.

Thanks in advance. Details follow.

Machine:

Code: Select all

21:36:50:WU01:FS00:Started FahCore on PID 378401
21:36:50:WU01:FS00:Core PID:378405
21:36:50:WU01:FS00:FahCore 0xa7 started
21:36:50:WU01:FS00:0xa7:*********************** Log Started 2020-04-19T21:36:50Z ***********************
21:36:50:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
21:36:50:WU01:FS00:0xa7:       Type: 0xa7
21:36:50:WU01:FS00:0xa7:       Core: Gromacs
21:36:50:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 378401 -checkpoint 15 -np
21:36:50:WU01:FS00:0xa7:             29
21:36:50:WU01:FS00:0xa7:************************************ CBang *************************************
21:36:50:WU01:FS00:0xa7:       Date: Nov 5 2019
21:36:50:WU01:FS00:0xa7:       Time: 06:06:57
21:36:50:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
21:36:50:WU01:FS00:0xa7:     Branch: master
21:36:50:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
21:36:50:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
21:36:50:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
21:36:50:WU01:FS00:0xa7:       Bits: 64
21:36:50:WU01:FS00:0xa7:       Mode: Release
21:36:50:WU01:FS00:0xa7:************************************ System ************************************
21:36:50:WU01:FS00:0xa7:        CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
21:36:50:WU01:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 63 Stepping 2
21:36:50:WU01:FS00:0xa7:       CPUs: 32
21:36:50:WU01:FS00:0xa7:     Memory: 503.81GiB
21:36:50:WU01:FS00:0xa7:Free Memory: 417.90GiB
21:36:50:WU01:FS00:0xa7:    Threads: POSIX_THREADS
21:36:50:WU01:FS00:0xa7: OS Version: 5.5
21:36:50:WU01:FS00:0xa7:Has Battery: false
21:36:50:WU01:FS00:0xa7: On Battery: false
21:36:50:WU01:FS00:0xa7: UTC Offset: -4
21:36:50:WU01:FS00:0xa7:        PID: 378405
21:36:50:WU01:FS00:0xa7:        CWD: /opt/fah/work
21:36:50:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
21:36:50:WU01:FS00:0xa7:    Version: 0.0.18
21:36:50:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
21:36:50:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
21:36:50:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
21:36:50:WU01:FS00:0xa7:       Date: Nov 5 2019
21:36:50:WU01:FS00:0xa7:       Time: 06:13:26
21:36:50:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
21:36:50:WU01:FS00:0xa7:     Branch: master
21:36:50:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
21:36:50:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
21:36:50:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
21:36:50:WU01:FS00:0xa7:       Bits: 64
21:36:50:WU01:FS00:0xa7:       Mode: Release
21:36:50:WU01:FS00:0xa7:************************************ Build *************************************
21:36:50:WU01:FS00:0xa7:       SIMD: avx_256
21:36:50:WU01:FS00:0xa7:********************************************************************************
Error Message:

Code: Select all

21:36:50:WU01:FS00:0xa7:Project: 16404 (Run 0, Clone 4835, Gen 72)
21:36:50:WU01:FS00:0xa7:Unit: 0x0000004fa8f5c67d5e7eb9072a30cb57
21:36:50:WU01:FS00:0xa7:Reading tar file core.xml
21:36:50:WU01:FS00:0xa7:Reading tar file frame72.tpr
21:36:50:WU01:FS00:0xa7:Digital signatures verified
21:36:50:WU01:FS00:0xa7:Reducing thread count from 29 to 28 to avoid domain decomposition by a prime number > 3
21:36:50:WU01:FS00:0xa7:Calling: mdrun -s frame72.tpr -o frame72.trr -x frame72.xtc -cpt 15 -nt 28
21:36:50:WU01:FS00:0xa7:Steps: first=36000000 total=500000
21:36:50:WU01:FS00:0xa7:ERROR:
21:36:50:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
21:36:50:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
21:36:50:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
21:36:50:WU01:FS00:0xa7:ERROR:
21:36:50:WU01:FS00:0xa7:ERROR:Fatal error:
21:36:50:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
21:36:50:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
21:36:50:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
21:36:50:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
21:36:50:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
21:36:50:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
Image
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Post by Neil-B »

Try changing cpu slot to 24 cores ... 25 through 31 are prone to issues
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
rusty
Posts: 17
Joined: Sun Mar 15, 2020 9:00 pm

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Post by rusty »

Fair enough. The system has two 8 core CPUs (with SMT). So, I split the slot into two 16 thread CPU slots. Hopefully FAH is smart enough to set the affinity to 1 CPU per WU (or maybe it just punts to the kernel's scheduler?)
Image
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Post by Neil-B »

If you have that and aren't running a gpu then run a single 32core ... from my experience very stable .and better for the science and points than 2x 16 ... can't see from your logs why it was running as a 29core
Last edited by Neil-B on Sun Apr 19, 2020 10:16 pm, edited 1 time in total.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
rusty
Posts: 17
Joined: Sun Mar 15, 2020 9:00 pm

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Post by rusty »

Nevermind. No, it is not smart enough.

Code: Select all

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
 379034 fah       39  19 1391160 257568  13068 R  1596   0.0 133:59.22 FahCore+ 
 379060 fah       39  19 1284932 106348  13200 R  1594   0.0 131:33.24 FahCore+ 

Code: Select all

$ taskset -cp 379060
pid 379060's current affinity list: 0-31
$ taskset -cp 379034
pid 379034's current affinity list: 0-31
Oh well...
Last edited by rusty on Sun Apr 19, 2020 10:22 pm, edited 1 time in total.
Image
rusty
Posts: 17
Joined: Sun Mar 15, 2020 9:00 pm

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Post by rusty »

Well, I was running it at 30 (not 32, because the FAH client bumped it down to 30 upon install) to, presumably, avoid issues like the one I just ran into with the rank decomposition.

The default setting of 30 had been working well without issue for the last month or so when I commissioned this machine for folding.

In any case, I'm surprised that the work server passed this WU to my configuration.

Thanks for your help. I'll keep playing with this.
Last edited by rusty on Sun Apr 19, 2020 10:23 pm, edited 1 time in total.
Image
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Post by Neil-B »

30 Is divisible by 5 which is sometimes an issue ... the install may have used 30 to leave cores for gpus ... if not using gpus for folding 32 would be solid choice tbh
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Post by Neil-B »

A quick search for "large primes" on these forums should find you a thread where JimboPalmer explains the best core numbers and why
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
rusty
Posts: 17
Joined: Sun Mar 15, 2020 9:00 pm

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Post by rusty »

Ah, yes... that's right. It left the other 2 cores for the two GPUs in this machine.

Thanks for the tip on divisibility by 5. Now let me see if I can find a sustainable configuration I don't have to keep an eye on... :roll:
Image
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Post by Neil-B »

My 32 core has never faulted ... I probably shouldn't have typed that ?!
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Post Reply