Project: 14576 (Run 0, Clone 2096, Gen 48)

Moderators: Site Moderators, FAHC Science Team

bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Post by bruce »

_r2w_ben wrote:I don't believe this is a GROMACS issue. The parameters FAHclient passes to mdrun results in PME being used. It allows for better utilization of high thread counts. PME could be disabled by passing -npme = 0 but would cause this problem to occur more often.
Many years ago, FAH had an option specifically aimed at systems (often with NUMA memory designs) with large numbers of CPUs that was colloquially referred to as "bigadv" One of the requirements imposed on those projects was to prevent the allocation of distinct CPUs for PME (i.e.-npme=0). Since that time, most CPU projects were run on systems with small numbers of threads. With the advent of home-based systems with high CPU counts, we have re-entered that era again.

GROMACS is designed for the scientist runniing on a dedicated computer where he can freely adjust all of the parameters. When GROMACS was adopted by FAH, it inherited a requirement that projects must on systems with an unpredictable number of threads and there is nobody around who is responsible for intercepting a job that doesn't want to work on certain numbers of CPU threads and correcting that value. As CPU couns grew, it was easy to establish rules that intercepted 7 and 11 and 13 and 14 (and maybe 5 and 10) which reduced the requested number of threads by 1 or 2, allowing it to run. Those rules are based on both GROMACS policies and direct observation. (5 is often an acceptable factor. Early testing of a project may find that 20 does or does not work and it can exclude assigning that project to that cpu count systems.)

The current procedures that FAH uses have not been extended to large values nor does it accomondate -npme adjustments. 48 is a perfectly acceptable value PROVIDED pme does reduce it to 40. Perhaps you'd like to explain when GROMACS is going to change the number of PP threads and when it isn't?

For example, 49 contains the prohibitied factor of 7, so the Core Wrapper will probably reduce it to 48 but that doesn't work because GROMACS will then re-adjust it. Will 48 ALWAYS become 40 or is that unpredictable?

For systems less that 24 (or maybe less that 32 threads) I think the current system works provided npme=0.

Are we doing it wrong? Suppose we start with 26 (13 is a factor). Reducing it to 25 is bad (5 is an unpredictable factor) but 24 is officially acceptable. When npme=0, the actually PME calculations are still performed, just not by dedicated threads. Can we allocate npme=2 to the 26 we started with and will GROMACS be more efficient but still complete the PME calculations on the PP threads?
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Post by _r2w_ben »

bruce wrote:
_r2w_ben wrote:I don't believe this is a GROMACS issue. The parameters FAHclient passes to mdrun results in PME being used. It allows for better utilization of high thread counts. PME could be disabled by passing -npme = 0 but would cause this problem to occur more often.
The current procedures that FAH uses have not been extended to large values nor does it accomondate -npme adjustments. 48 is a perfectly acceptable value PROVIDED pme does reduce it to 40. Perhaps you'd like to explain when GROMACS is going to change the number of PP threads and when it isn't?
GROMACS will allocate PME threads whenever there are more than 18 total threads unless it is started with npme=0.
bruce wrote:For example, 49 contains the prohibitied factor of 7, so the Core Wrapper will probably reduce it to 48 but that doesn't work because GROMACS will then re-adjust it. Will 48 ALWAYS become 40 or is that unpredictable?
It's unpredictable based on the molecules in a project. It is fairly consistent within a project but I have seen the box size change between 14x14x14 and 13x13x13 on different clones within the same project. In my testing I have observed the following unique breakdowns of 48 threads based on how much of the simulation is expected to be spent doing PME.

Code: Select all

Thread Division    Max Cells  PME load
48 = 40 +  8 PME	16x16x16	0.09
48 = 36 + 12 PME	18x18x18	0.22
48 = 32 + 16 PME	14x14x14	0.35
48 = 30 + 18 PME	13x13x13	0.36
48 = 27 + 21 PME	14x14x14	0.4
bruce wrote:For systems less that 24 (or maybe less that 32 threads) I think the current system works provided npme=0.

Are we doing it wrong? Suppose we start with 26 (13 is a factor). Reducing it to 25 is bad (5 is an unpredictable factor) but 24 is officially acceptable. When npme=0, the actually PME calculations are still performed, just not by dedicated threads. Can we allocate npme=2 to the 26 we started with and will GROMACS be more efficient but still complete the PME calculations on the PP threads?
Hard-coding npme to fixed value would not be good. Allocating too few threads will result in cores assigned to PP waiting until those PME threads finish to be able to synchronize. The reverse would happen with too many PME threads, which would end up idling while PP threads catch up. The thread allocation system within GROMACS works well and adapts to the needs of the simulation.

What FAH could do to reduce domain decomposition errors:
1) Catch decomposition errors when they occur. Retry with n-1 threads until reaching a core count that works.
2) Set max CPU limits per project. This would avoid having a 128 core slot have to fail all the way down to 27 threads for the smallest work units.

This can be done by finding the allowed number of cells in md.log. Multiplying those values together will be less than the absolute max but is a good limit if suggestion 1 is in place.

Example: p14593

Code: Select all

The maximum allowed number of cells is: X 3 Y 3 Z 3
3*3*3 = 18
The absolute max is 18 + 9 PME = 27 but 20-26 threads are all unsuccessful. 18 would be a good maximum.

Example: p16423

Code: Select all

The maximum allowed number of cells is: X 4 Y 4 Z 4
4*4*4 = 64
The absolute max is 64 + 16 PME = 80 but 65-79 threads are all unsuccessful. 64 would be a good maximum.

These changes would help to allocate projects that can use many threads to high core CPU counts. Not having to split a CPU into multiple slots would produce faster work unit returns since as many cores as possible are folding the same work unit.
alien88
Posts: 10
Joined: Mon Apr 13, 2020 1:37 am

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Post by alien88 »

Checking in on my machines I see that I got a unit from this project that keeps failing:

Code: Select all

04:22:04:WU00:FS00:Removing old file 'work/00/logfile_01-20200514-035004.txt'
04:22:04:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 706 -lifeline 29621 -checkpoint 15 -np 62
04:22:04:WU00:FS00:Started FahCore on PID 42699
04:22:05:WU00:FS00:Core PID:42703
04:22:05:WU00:FS00:FahCore 0xa7 started
04:22:05:WU00:FS00:0xa7:*********************** Log Started 2020-05-14T04:22:05Z ***********************
04:22:05:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
04:22:05:WU00:FS00:0xa7:       Type: 0xa7
04:22:05:WU00:FS00:0xa7:       Core: Gromacs
04:22:05:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 706 -lifeline 42699 -checkpoint 15 -np
04:22:05:WU00:FS00:0xa7:             62
04:22:05:WU00:FS00:0xa7:************************************ CBang *************************************
04:22:05:WU00:FS00:0xa7:       Date: Nov 5 2019
04:22:05:WU00:FS00:0xa7:       Time: 06:06:57
04:22:05:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
04:22:05:WU00:FS00:0xa7:     Branch: master
04:22:05:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
04:22:05:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
04:22:05:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
04:22:05:WU00:FS00:0xa7:       Bits: 64
04:22:05:WU00:FS00:0xa7:       Mode: Release
04:22:05:WU00:FS00:0xa7:************************************ System ************************************
04:22:05:WU00:FS00:0xa7:        CPU: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
04:22:05:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 85 Stepping 7
04:22:05:WU00:FS00:0xa7:       CPUs: 64
04:22:05:WU00:FS00:0xa7:     Memory: 187.53GiB
04:22:05:WU00:FS00:0xa7:Free Memory: 48.47GiB
04:22:05:WU00:FS00:0xa7:    Threads: POSIX_THREADS
04:22:05:WU00:FS00:0xa7: OS Version: 5.3
04:22:05:WU00:FS00:0xa7:Has Battery: false
04:22:05:WU00:FS00:0xa7: On Battery: false
04:22:05:WU00:FS00:0xa7: UTC Offset: 9
04:22:05:WU00:FS00:0xa7:        PID: 42703
04:22:05:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
04:22:05:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
04:22:05:WU00:FS00:0xa7:    Version: 0.0.18
04:22:05:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
04:22:05:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
04:22:05:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
04:22:05:WU00:FS00:0xa7:       Date: Nov 5 2019
04:22:05:WU00:FS00:0xa7:       Time: 06:13:26
04:22:05:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
04:22:05:WU00:FS00:0xa7:     Branch: master
04:22:05:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
04:22:05:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
04:22:05:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
04:22:05:WU00:FS00:0xa7:       Bits: 64
04:22:05:WU00:FS00:0xa7:       Mode: Release
04:22:05:WU00:FS00:0xa7:************************************ Build *************************************
04:22:05:WU00:FS00:0xa7:       SIMD: avx_256
04:22:05:WU00:FS00:0xa7:********************************************************************************
04:22:05:WU00:FS00:0xa7:Project: 14576 (Run 0, Clone 2403, Gen 143)
04:22:05:WU00:FS00:0xa7:Unit: 0x000000b6287234c95e7b871320940533
04:22:05:WU00:FS00:0xa7:Reading tar file core.xml
04:22:05:WU00:FS00:0xa7:Reading tar file frame143.tpr
04:22:05:WU00:FS00:0xa7:Digital signatures verified
04:22:05:WU00:FS00:0xa7:Reducing thread count from 62 to 61 to avoid domain decomposition with large prime factor 31
04:22:05:WU00:FS00:0xa7:Reducing thread count from 61 to 60 to avoid domain decomposition by a prime number > 3
04:22:05:WU00:FS00:0xa7:Calling: mdrun -s frame143.tpr -o frame143.trr -x frame143.xtc -cpt 15 -nt 60
04:22:05:WU00:FS00:0xa7:Steps: first=71500000 total=500000
04:22:05:WU00:FS00:0xa7:ERROR:
04:22:05:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
04:22:05:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
04:22:05:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
04:22:05:WU00:FS00:0xa7:ERROR:
04:22:05:WU00:FS00:0xa7:ERROR:Fatal error:
04:22:05:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 50 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
04:22:05:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
04:22:05:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
04:22:05:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
04:22:05:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
04:22:05:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
04:22:10:WU00:FS00:0xa7:WARNING:Unexpected exit() call
04:22:10:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
04:22:10:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
04:22:10:WU00:FS00:0xa7:Saving result file md.log
04:22:10:WU00:FS00:0xa7:Saving result file science.log
04:22:10:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
This is on a 64 core machine.
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Post by PantherX »

Can you please change the CPU value to 45 and that should help it to fold. You can do that by:
Open up Advanced Control (AKA FAHControl) -> Configure -> Slots Tab -> CPU -> Edit -> Change value from -1 or 62 to 45 -> OK -> Save

The error "domain decomposition" means in simple terms that the WU is not large enough to be properly divided across the 60 threads hence is throwing a tiny fit. This has been pointed out to the researcher so will eventually be fixed.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
alien88
Posts: 10
Joined: Mon Apr 13, 2020 1:37 am

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Post by alien88 »

PantherX wrote:Can you please change the CPU value to 45 and that should help it to fold. You can do that by:
Open up Advanced Control (AKA FAHControl) -> Configure -> Slots Tab -> CPU -> Edit -> Change value from -1 or 62 to 45 -> OK -> Save

The error "domain decomposition" means in simple terms that the WU is not large enough to be properly divided across the 60 threads hence is throwing a tiny fit. This has been pointed out to the researcher so will eventually be fixed.
Ah, gotcha. That makes sense now. I'll update if I run into this again; I don't have any of the advanced stuff installed since it's just a Beast of a Linux Box. Would this be the right tweak to config.xml?

Code: Select all

  <cpus v='45'/> 
  <!-- Folding Slots -->
  <slot id='0' type='CPU'/>
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Post by PantherX »

Ah, in that case, stop FAHClient and then modify the config.xml file to this:

Code: Select all

  <slot id='0' type='CPU'>
    <cpus v='45'/>
  </slot>
Start up FAHClient and it should in theory pick up the WU and process it.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Post Reply