Page 2 of 2

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Wed May 13, 2020 11:34 pm
by bruce
_r2w_ben wrote:I don't believe this is a GROMACS issue. The parameters FAHclient passes to mdrun results in PME being used. It allows for better utilization of high thread counts. PME could be disabled by passing -npme = 0 but would cause this problem to occur more often.
Many years ago, FAH had an option specifically aimed at systems (often with NUMA memory designs) with large numbers of CPUs that was colloquially referred to as "bigadv" One of the requirements imposed on those projects was to prevent the allocation of distinct CPUs for PME (i.e.-npme=0). Since that time, most CPU projects were run on systems with small numbers of threads. With the advent of home-based systems with high CPU counts, we have re-entered that era again.

GROMACS is designed for the scientist runniing on a dedicated computer where he can freely adjust all of the parameters. When GROMACS was adopted by FAH, it inherited a requirement that projects must on systems with an unpredictable number of threads and there is nobody around who is responsible for intercepting a job that doesn't want to work on certain numbers of CPU threads and correcting that value. As CPU couns grew, it was easy to establish rules that intercepted 7 and 11 and 13 and 14 (and maybe 5 and 10) which reduced the requested number of threads by 1 or 2, allowing it to run. Those rules are based on both GROMACS policies and direct observation. (5 is often an acceptable factor. Early testing of a project may find that 20 does or does not work and it can exclude assigning that project to that cpu count systems.)

The current procedures that FAH uses have not been extended to large values nor does it accomondate -npme adjustments. 48 is a perfectly acceptable value PROVIDED pme does reduce it to 40. Perhaps you'd like to explain when GROMACS is going to change the number of PP threads and when it isn't?

For example, 49 contains the prohibitied factor of 7, so the Core Wrapper will probably reduce it to 48 but that doesn't work because GROMACS will then re-adjust it. Will 48 ALWAYS become 40 or is that unpredictable?

For systems less that 24 (or maybe less that 32 threads) I think the current system works provided npme=0.

Are we doing it wrong? Suppose we start with 26 (13 is a factor). Reducing it to 25 is bad (5 is an unpredictable factor) but 24 is officially acceptable. When npme=0, the actually PME calculations are still performed, just not by dedicated threads. Can we allocate npme=2 to the 26 we started with and will GROMACS be more efficient but still complete the PME calculations on the PP threads?

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Thu May 14, 2020 1:54 am
by _r2w_ben
bruce wrote:
_r2w_ben wrote:I don't believe this is a GROMACS issue. The parameters FAHclient passes to mdrun results in PME being used. It allows for better utilization of high thread counts. PME could be disabled by passing -npme = 0 but would cause this problem to occur more often.
The current procedures that FAH uses have not been extended to large values nor does it accomondate -npme adjustments. 48 is a perfectly acceptable value PROVIDED pme does reduce it to 40. Perhaps you'd like to explain when GROMACS is going to change the number of PP threads and when it isn't?
GROMACS will allocate PME threads whenever there are more than 18 total threads unless it is started with npme=0.
bruce wrote:For example, 49 contains the prohibitied factor of 7, so the Core Wrapper will probably reduce it to 48 but that doesn't work because GROMACS will then re-adjust it. Will 48 ALWAYS become 40 or is that unpredictable?
It's unpredictable based on the molecules in a project. It is fairly consistent within a project but I have seen the box size change between 14x14x14 and 13x13x13 on different clones within the same project. In my testing I have observed the following unique breakdowns of 48 threads based on how much of the simulation is expected to be spent doing PME.

Code: Select all

Thread Division    Max Cells  PME load
48 = 40 +  8 PME	16x16x16	0.09
48 = 36 + 12 PME	18x18x18	0.22
48 = 32 + 16 PME	14x14x14	0.35
48 = 30 + 18 PME	13x13x13	0.36
48 = 27 + 21 PME	14x14x14	0.4
bruce wrote:For systems less that 24 (or maybe less that 32 threads) I think the current system works provided npme=0.

Are we doing it wrong? Suppose we start with 26 (13 is a factor). Reducing it to 25 is bad (5 is an unpredictable factor) but 24 is officially acceptable. When npme=0, the actually PME calculations are still performed, just not by dedicated threads. Can we allocate npme=2 to the 26 we started with and will GROMACS be more efficient but still complete the PME calculations on the PP threads?
Hard-coding npme to fixed value would not be good. Allocating too few threads will result in cores assigned to PP waiting until those PME threads finish to be able to synchronize. The reverse would happen with too many PME threads, which would end up idling while PP threads catch up. The thread allocation system within GROMACS works well and adapts to the needs of the simulation.

What FAH could do to reduce domain decomposition errors:
1) Catch decomposition errors when they occur. Retry with n-1 threads until reaching a core count that works.
2) Set max CPU limits per project. This would avoid having a 128 core slot have to fail all the way down to 27 threads for the smallest work units.

This can be done by finding the allowed number of cells in md.log. Multiplying those values together will be less than the absolute max but is a good limit if suggestion 1 is in place.

Example: p14593

Code: Select all

The maximum allowed number of cells is: X 3 Y 3 Z 3
3*3*3 = 18
The absolute max is 18 + 9 PME = 27 but 20-26 threads are all unsuccessful. 18 would be a good maximum.

Example: p16423

Code: Select all

The maximum allowed number of cells is: X 4 Y 4 Z 4
4*4*4 = 64
The absolute max is 64 + 16 PME = 80 but 65-79 threads are all unsuccessful. 64 would be a good maximum.

These changes would help to allocate projects that can use many threads to high core CPU counts. Not having to split a CPU into multiple slots would produce faster work unit returns since as many cores as possible are folding the same work unit.

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Thu May 14, 2020 4:33 am
by alien88
Checking in on my machines I see that I got a unit from this project that keeps failing:

Code: Select all

04:22:04:WU00:FS00:Removing old file 'work/00/logfile_01-20200514-035004.txt'
04:22:04:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 706 -lifeline 29621 -checkpoint 15 -np 62
04:22:04:WU00:FS00:Started FahCore on PID 42699
04:22:05:WU00:FS00:Core PID:42703
04:22:05:WU00:FS00:FahCore 0xa7 started
04:22:05:WU00:FS00:0xa7:*********************** Log Started 2020-05-14T04:22:05Z ***********************
04:22:05:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
04:22:05:WU00:FS00:0xa7:       Type: 0xa7
04:22:05:WU00:FS00:0xa7:       Core: Gromacs
04:22:05:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 706 -lifeline 42699 -checkpoint 15 -np
04:22:05:WU00:FS00:0xa7:             62
04:22:05:WU00:FS00:0xa7:************************************ CBang *************************************
04:22:05:WU00:FS00:0xa7:       Date: Nov 5 2019
04:22:05:WU00:FS00:0xa7:       Time: 06:06:57
04:22:05:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
04:22:05:WU00:FS00:0xa7:     Branch: master
04:22:05:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
04:22:05:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
04:22:05:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
04:22:05:WU00:FS00:0xa7:       Bits: 64
04:22:05:WU00:FS00:0xa7:       Mode: Release
04:22:05:WU00:FS00:0xa7:************************************ System ************************************
04:22:05:WU00:FS00:0xa7:        CPU: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
04:22:05:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 85 Stepping 7
04:22:05:WU00:FS00:0xa7:       CPUs: 64
04:22:05:WU00:FS00:0xa7:     Memory: 187.53GiB
04:22:05:WU00:FS00:0xa7:Free Memory: 48.47GiB
04:22:05:WU00:FS00:0xa7:    Threads: POSIX_THREADS
04:22:05:WU00:FS00:0xa7: OS Version: 5.3
04:22:05:WU00:FS00:0xa7:Has Battery: false
04:22:05:WU00:FS00:0xa7: On Battery: false
04:22:05:WU00:FS00:0xa7: UTC Offset: 9
04:22:05:WU00:FS00:0xa7:        PID: 42703
04:22:05:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
04:22:05:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
04:22:05:WU00:FS00:0xa7:    Version: 0.0.18
04:22:05:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
04:22:05:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
04:22:05:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
04:22:05:WU00:FS00:0xa7:       Date: Nov 5 2019
04:22:05:WU00:FS00:0xa7:       Time: 06:13:26
04:22:05:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
04:22:05:WU00:FS00:0xa7:     Branch: master
04:22:05:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
04:22:05:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
04:22:05:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
04:22:05:WU00:FS00:0xa7:       Bits: 64
04:22:05:WU00:FS00:0xa7:       Mode: Release
04:22:05:WU00:FS00:0xa7:************************************ Build *************************************
04:22:05:WU00:FS00:0xa7:       SIMD: avx_256
04:22:05:WU00:FS00:0xa7:********************************************************************************
04:22:05:WU00:FS00:0xa7:Project: 14576 (Run 0, Clone 2403, Gen 143)
04:22:05:WU00:FS00:0xa7:Unit: 0x000000b6287234c95e7b871320940533
04:22:05:WU00:FS00:0xa7:Reading tar file core.xml
04:22:05:WU00:FS00:0xa7:Reading tar file frame143.tpr
04:22:05:WU00:FS00:0xa7:Digital signatures verified
04:22:05:WU00:FS00:0xa7:Reducing thread count from 62 to 61 to avoid domain decomposition with large prime factor 31
04:22:05:WU00:FS00:0xa7:Reducing thread count from 61 to 60 to avoid domain decomposition by a prime number > 3
04:22:05:WU00:FS00:0xa7:Calling: mdrun -s frame143.tpr -o frame143.trr -x frame143.xtc -cpt 15 -nt 60
04:22:05:WU00:FS00:0xa7:Steps: first=71500000 total=500000
04:22:05:WU00:FS00:0xa7:ERROR:
04:22:05:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
04:22:05:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
04:22:05:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
04:22:05:WU00:FS00:0xa7:ERROR:
04:22:05:WU00:FS00:0xa7:ERROR:Fatal error:
04:22:05:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 50 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
04:22:05:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
04:22:05:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
04:22:05:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
04:22:05:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
04:22:05:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
04:22:10:WU00:FS00:0xa7:WARNING:Unexpected exit() call
04:22:10:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
04:22:10:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
04:22:10:WU00:FS00:0xa7:Saving result file md.log
04:22:10:WU00:FS00:0xa7:Saving result file science.log
04:22:10:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
This is on a 64 core machine.

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Thu May 14, 2020 7:17 am
by PantherX
Can you please change the CPU value to 45 and that should help it to fold. You can do that by:
Open up Advanced Control (AKA FAHControl) -> Configure -> Slots Tab -> CPU -> Edit -> Change value from -1 or 62 to 45 -> OK -> Save

The error "domain decomposition" means in simple terms that the WU is not large enough to be properly divided across the 60 threads hence is throwing a tiny fit. This has been pointed out to the researcher so will eventually be fixed.

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Fri May 15, 2020 6:57 am
by alien88
PantherX wrote:Can you please change the CPU value to 45 and that should help it to fold. You can do that by:
Open up Advanced Control (AKA FAHControl) -> Configure -> Slots Tab -> CPU -> Edit -> Change value from -1 or 62 to 45 -> OK -> Save

The error "domain decomposition" means in simple terms that the WU is not large enough to be properly divided across the 60 threads hence is throwing a tiny fit. This has been pointed out to the researcher so will eventually be fixed.
Ah, gotcha. That makes sense now. I'll update if I run into this again; I don't have any of the advanced stuff installed since it's just a Beast of a Linux Box. Would this be the right tweak to config.xml?

Code: Select all

  <cpus v='45'/> 
  <!-- Folding Slots -->
  <slot id='0' type='CPU'/>

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Fri May 15, 2020 7:13 am
by PantherX
Ah, in that case, stop FAHClient and then modify the config.xml file to this:

Code: Select all

  <slot id='0' type='CPU'>
    <cpus v='45'/>
  </slot>
Start up FAHClient and it should in theory pick up the WU and process it.