WU 14572 INTERRUPTED

Moderators: Site Moderators, FAHC Science Team

Post Reply
seesturm
Posts: 4
Joined: Sun Mar 29, 2020 8:50 am

WU 14572 INTERRUPTED

Post by seesturm »

Client is frequently blocked by faulty WU. This time it is from Project 14572.

The faulty project is blocking CPU from useful (folding-at-home) work. FAH client does not give up on execution. How is it possible to block faulty projects?

Code: Select all

19:22:32:WU03:FS00:Starting
19:22:32:WU03:FS00:Removing old file './work/03/logfile_01-20200331-185031.txt'
19:22:32:WU03:FS00:Running FahCore: /home/ubuntu/FAHCoreWrapper /home/ubuntu/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 03 -suffix 01 -version 705 -lifeline 4172 -checkpoint 15 -np 29
19:22:32:WU03:FS00:Started FahCore on PID 75482
19:22:32:WU03:FS00:Core PID:75486
19:22:32:WU03:FS00:FahCore 0xa7 started
19:22:32:WU03:FS00:0xa7:*********************** Log Started 2020-03-31T19:22:32Z ***********************
19:22:32:WU03:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
19:22:32:WU03:FS00:0xa7:       Type: 0xa7
19:22:32:WU03:FS00:0xa7:       Core: Gromacs
19:22:32:WU03:FS00:0xa7:       Args: -dir 03 -suffix 01 -version 705 -lifeline 75482 -checkpoint 15 -np
19:22:32:WU03:FS00:0xa7:             29
19:22:32:WU03:FS00:0xa7:************************************ CBang *************************************
19:22:32:WU03:FS00:0xa7:       Date: Nov 5 2019
19:22:32:WU03:FS00:0xa7:       Time: 06:06:57
19:22:32:WU03:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
19:22:32:WU03:FS00:0xa7:     Branch: master
19:22:32:WU03:FS00:0xa7:   Compiler: GNU 8.3.0
19:22:32:WU03:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
19:22:32:WU03:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
19:22:32:WU03:FS00:0xa7:       Bits: 64
19:22:32:WU03:FS00:0xa7:       Mode: Release
19:22:32:WU03:FS00:0xa7:************************************ System ************************************
19:22:32:WU03:FS00:0xa7:        CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
19:22:32:WU03:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 1 Stepping 1
19:22:32:WU03:FS00:0xa7:       CPUs: 32
19:22:32:WU03:FS00:0xa7:     Memory: 62.84GiB
19:22:32:WU03:FS00:0xa7:Free Memory: 31.33GiB
19:22:32:WU03:FS00:0xa7:    Threads: POSIX_THREADS
19:22:32:WU03:FS00:0xa7: OS Version: 5.5
19:22:32:WU03:FS00:0xa7:Has Battery: false
19:22:32:WU03:FS00:0xa7: On Battery: false
19:22:32:WU03:FS00:0xa7: UTC Offset: 0
19:22:32:WU03:FS00:0xa7:        PID: 75486
19:22:32:WU03:FS00:0xa7:        CWD: /home/ubuntu/work
19:22:32:WU03:FS00:0xa7:******************************** Build - libFAH ********************************
19:22:32:WU03:FS00:0xa7:    Version: 0.0.18
19:22:32:WU03:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
19:22:32:WU03:FS00:0xa7:  Copyright: 2019 foldingathome.org
19:22:32:WU03:FS00:0xa7:   Homepage: https://foldingathome.org/
19:22:32:WU03:FS00:0xa7:       Date: Nov 5 2019
19:22:32:WU03:FS00:0xa7:       Time: 06:13:26
19:22:32:WU03:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
19:22:32:WU03:FS00:0xa7:     Branch: master
19:22:32:WU03:FS00:0xa7:   Compiler: GNU 8.3.0
19:22:32:WU03:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
19:22:32:WU03:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
19:22:32:WU03:FS00:0xa7:       Bits: 64
19:22:32:WU03:FS00:0xa7:       Mode: Release
19:22:32:WU03:FS00:0xa7:************************************ Build *************************************
19:22:32:WU03:FS00:0xa7:       SIMD: avx_256
19:22:32:WU03:FS00:0xa7:********************************************************************************
19:22:32:WU03:FS00:0xa7:Project: 14572 (Run 0, Clone 1661, Gen 7)
19:22:32:WU03:FS00:0xa7:Unit: 0x0000000d287234c95e792c11c06e5a31
19:22:32:WU03:FS00:0xa7:Reading tar file core.xml
19:22:32:WU03:FS00:0xa7:Reading tar file frame7.tpr
19:22:32:WU03:FS00:0xa7:Digital signatures verified
19:22:32:WU03:FS00:0xa7:Reducing thread count from 29 to 28 to avoid domain decomposition by a prime number > 3
19:22:32:WU03:FS00:0xa7:Calling: mdrun -s frame7.tpr -o frame7.trr -x frame7.xtc -cpt 15 -nt 28
19:22:32:WU03:FS00:0xa7:Steps: first=3500000 total=500000
19:22:32:WU03:FS00:0xa7:ERROR:
19:22:32:WU03:FS00:0xa7:ERROR:-------------------------------------------------------
19:22:32:WU03:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
19:22:32:WU03:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
19:22:32:WU03:FS00:0xa7:ERROR:
19:22:32:WU03:FS00:0xa7:ERROR:Fatal error:
19:22:32:WU03:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
19:22:32:WU03:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
19:22:32:WU03:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
19:22:32:WU03:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
19:22:32:WU03:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
19:22:32:WU03:FS00:0xa7:ERROR:-------------------------------------------------------
19:22:37:WU03:FS00:0xa7:WARNING:Unexpected exit() call
19:22:37:WU03:FS00:0xa7:WARNING:Unexpected exit from science code
19:22:37:WU03:FS00:0xa7:Saving result file ../logfile_01.txt
19:22:37:WU03:FS00:0xa7:Saving result file md.log
19:22:37:WU03:FS00:0xa7:Saving result file science.log
19:22:37:WU03:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
jonault
Posts: 214
Joined: Fri Dec 14, 2007 9:53 pm

Re: WU 14572 INTERRUPTED

Post by jonault »

How many CPU cores is your CPU slot configured for? It looks like it was initially trying to use 29 cores which is a bad number due to being a large prime, and then it tried to downgrade to 28 cores which is also a bad number due to being a multiple of 7 which is another large prime. You ideally want your number of CPU cores to be a multiple of only 2 and/or 3, nothing larger.

I can't tell from the log you posted whether this was a problem with your setup or a problem with a badly configured work unit, but if you're seeing this kind of thing happen a lot it suggests the problem is on your end - misconfigured work units aren't common. Post the first 200 lines of your log file & it will tell us more about how you have the client configured.
Image
seesturm
Posts: 4
Joined: Sun Mar 29, 2020 8:50 am

Re: WU 14572 INTERRUPTED

Post by seesturm »

I've configured number of CPU cores to "-1", so the number is chosen by the WU (or client?). Maybe it is unusual to have high number of CPU threads and it is therefore missed in the tests.

Code: Select all

<config>
  <!-- Folding Slots -->
  <slot id='0' type='CPU'/>
  <slot id='1' type='GPU'/>
  <slot id='2' type='GPU'/>
</config>
Most of the CPU WU are working for me. Problem is when the client encounters a bad WU it just doesn't give up. And there is no "give up" button.
jonault
Posts: 214
Joined: Fri Dec 14, 2007 9:53 pm

Re: WU 14572 INTERRUPTED

Post by jonault »

Well, you've got a 32 thread CPU and it's going to reserve 2 threads to manage your 2 GPU slots leaving you with 30 threads for the CPU slot. That 30 can be a problem since it's a multiple of 5. You might try instead creating 2 CPU slots, one with 24 cores and one with 6; that would use all 30 remaining threads without any "high prime" issues.
Image
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: WU 14572 INTERRUPTED

Post by Neil-B »

I'll note that multiples of 5, though not advised are no way as bad as larger primes - I run a 30 slot quite often and have not yet had a problem with it - but you take a slight chance you will have issues at some point
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: WU 14572 INTERRUPTED

Post by bruce »

Another logical choice would be a slot with 12 and another with 16. The -1 option does a pretty good job of allocating CPU cores for more limited numbers of threads, but with many "home" CPUs becoming available with high thread counts, it's time to putting better code behind the -1 option. IMHO, that sort of enhancement is overdue, but nobody is going to have time to do anything about it before this crisis is over.

It's a two-step process in the current client. Pre-define the slot based on the number of CPUs less the number of GPUs. Then the client puts in a request for that predefined value and then reduce it to something that avoids the "large prime" problem. Step 1 needs to be smarter.

Two or more CPU slots are never considered except by the Person Between the Chair and the Keyboard (you). To some extent, the optimium setting is a matter of chance since the absence or presence of a multitude of projects that do or do not work at specific numbers of threads is variable. If the project owner discovers his project doesn't work with K cores, he can simply prohibit assignments of his project, but you can't predict tha and can't really discover that it has been done for the number you happened to select.

Do some guessing and extract some experimental results. We can collect them as input to whomever eventually works on building better code.
Post Reply