Shutdown: BAD_WORK_UNIT (when run with 10 CPUs)

apdibbo · Post by **apdibbo** » Thu Mar 19, 2020 10:49 am

Hi,

I am getting this error in my logs:

10:47:43:WU00:FS00:0xa7:Project: 13851 (Run 0, Clone 14914, Gen 0)
10:47:43:WU00:FS00:0xa7:Unit: 0x00000000287234c95e7301ac882be327
10:47:43:WU00:FS00:0xa7:Reading tar file core.xml
10:47:43:WU00:FS00:0xa7:Reading tar file frame0.tpr
10:47:43:WU00:FS00:0xa7:Digital signatures verified
10:47:43:WU00:FS00:0xa7:Calling: mdrun -s frame0.tpr -o frame0.trr -x frame0.xtc -e frame0.edr -cpt 15 -nt 128
10:47:43:WU00:FS00:0xa7:Steps: first=0 total=500000
10:47:43:WU00:FS00:0xa7:ERROR:
10:47:43:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
10:47:43:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
10:47:43:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
10:47:43:WU00:FS00:0xa7:ERROR:
10:47:43:WU00:FS00:0xa7:ERROR:Fatal error:
10:47:43:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 96 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
10:47:43:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
10:47:43:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
10:47:43:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
10:47:43:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
10:47:43:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
10:47:48:WU00:FS00:0xa7:WARNING:Unexpected exit() call
10:47:48:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
10:47:48:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
10:47:48:WU00:FS00:0xa7:Saving result file md.log
10:47:48:WU00:FS00:0xa7:Saving result file science.log
10:47:48:WU00:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
10:47:48:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)

I am seeing this on the thirteen nodes I am currently trying to fold on. (1 28 core VM and 12 128 core epyc physical systems)

Any advice on how to get around this?

Post by **Joe_H** » Thu Mar 19, 2020 3:59 pm

Try pausing and resetting the thread count to a lower number, and then starting again. Some of these recent projects have been rushed out without a chance to check all of the higher thread counts and put in restrictions from being assigned to some.

Thread counts that are useful to try are multiples of 2, 3 and 5. Multiples of higher primes such as 7, 11, 13 and so on can not be used.

If you are willing to try this, post the results here and I will get that information back to the researcher so the lists can be updated.

apdibbo · Post by **apdibbo** » Fri Mar 20, 2020 8:25 am

Thanks, splitting the 128 core nodes into two 64 core slots seems to have worked

As does splitting the 28 core node into a 16, an 8 and a 4 core slots.

_r2w_ben · Post by **_r2w_ben** » Sat Mar 21, 2020 4:44 pm

I'm interested to see how it ends up splitting up the work unit between threads on different core counts. Deep in the work folder, corresponding to the Work Queue/Unit # is a file named science.md. It has lots of information produced by GROMACS, the molecular dynamics software used for CPU units.

How the work is broken down for each thread is domain decomposition. The way I understand it is a rectangular layer cake that can be cut into equal sized pieces with the cake size based on the volume of the the molecules in the work unit. Once the cake is cut into pieces (x and y axis), each layer of the cake (z axis) is separated. Just like people do not like too small of a piece of cake, there is a minimum piece (cell) size. The core exited with BAD_WORK_UNIT when the cell size was too small. This is required because molecules in adjacent pieces exert force on each other and that data has to passed back and forth between threads each step of the work unit. Too much communication between threads would bottleneck performance. Thus, reducing the number of threads increases the cell size making less molecules near the edge and influencing adjacent cells.

With 2 threads allocated to a slot the following is produced in science.md:

Code: Select all

Initializing Domain Decomposition on 2 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.433 nm, LJ-14, atoms 3867 3870
  multi-body bonded interactions: 0.433 nm, Proper Dih., atoms 3867 3870
Minimum cell size due to bonded interactions: 0.476 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.218 nm
Estimated maximum distance required for P-LINCS: 0.218 nm
Using 0 separate PME ranks, as there are too few total
 ranks for efficient splitting
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 2 cells with a minimum initial size of 0.595 nm
The maximum allowed number of cells is: X 20 Y 20 Z 20
Domain decomposition grid 2 x 1 x 1, separate PME ranks 0
PME domain decomposition: 2 x 1 x 1
Domain decomposition rank 0, coordinates 0 0 0

Using 2 MPI threads

This is a 1 layer cake with a single cut to split it into 2 pieces.

Your log file mentioned "domain decomposition for 96 ranks". The cube root of 96 isn't an integer so I wonder if it was broken down as 6x4x4.
64 could be 4x4x4 or 8x4x2. GROMACS must have logic to determine which should be more efficient.

Getting back to the problem... ideally, the client would reduce the number of threads used until the decomposition results in pieces bigger than the minimum size. The assignment servers due this already if you have a CPU slot allocated with 11 cores, it will normally tell the client to use 10 cores instead. I hope this gets included in a future version of the client. This would save on bandwidth because the work unit would be finished rather than reporting the failure and having the next donor, hopefully with a different number of cores, download it and start working.

Folding Forum

Shutdown: BAD_WORK_UNIT (when run with 10 CPUs)

Shutdown: BAD_WORK_UNIT (when run with 10 CPUs)

Re: Shutdown: BAD_WORK_UNIT

Re: Shutdown: BAD_WORK_UNIT (when run with 10 CPUs)

Re: Shutdown: BAD_WORK_UNIT (when run with 10 CPUs)