Project: 14576 (0,3055,4) domain decomposition

Moderators: Site Moderators, FAHC Science Team

Post Reply
tedder
Posts: 7
Joined: Mon Mar 16, 2020 4:39 pm

Project: 14576 (0,3055,4) domain decomposition

Post by tedder »

This is a CPU WU, info with project and unit at the top.

Code: Select all

WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:0xa7:Unit: 0x00000009287234c95e7b86cfe9c5549b
WU01:FS00:0xa7:Reading tar file core.xml
WU01:FS00:0xa7:Reading tar file frame4.tpr 
WU01:FS00:0xa7:Digital signatures verified 
WU01:FS00:0xa7:Calling: mdrun -s frame4.tpr -o frame4.trr -x frame4.xtc -cpt 15 -nt 24
WU01:FS00:0xa7:Steps: first=2000000 total=500000
WU01:FS00:0xa7:ERROR:
WU01:FS00:0xa7:ERROR:-------------------------------------------------------
WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
WU01:FS00:0xa7:ERROR:
WU01:FS00:0xa7:ERROR:Fatal error:
WU01:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
Likely it's a high-core-count issue, I'm curious if there's something I should do (I see references to 'excluding the unit') or if there's something that needs to be done on FAH's side or if it's a no-op. I can also post the full log.
Last edited by tedder on Tue Mar 31, 2020 11:36 pm, edited 1 time in total.
tedder
Posts: 7
Joined: Mon Mar 16, 2020 4:39 pm

Re: Project: 14576 (0,3055,4) domain decomposition

Post by tedder »

Hmm, I must need to do something to work past it.

Code: Select all

$ cat log | egrep "Project.*Run.*Clone|INTERRUPTED" | cut -c 21-
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Joe_H
Site Admin
Posts: 7870
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Project: 14576 (0,3055,4) domain decomposition

Post by Joe_H »

Try pausing the folding process and setting the CPU thread count count to 18 or 16 and see if the WU goes ahead. There are some projects which have problems with decompositions to multiple of 5, it was trying 20 in the log. Sometimes they have WU's in prerelease testing that do work at that setting, then later this problem pops up.

I will report this and the assignment to 20 threads can be restricted.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
tedder
Posts: 7
Joined: Mon Mar 16, 2020 4:39 pm

Re: Project: 14576 (0,3055,4) domain decomposition

Post by tedder »

Joe_H wrote:Try pausing the folding process and setting the CPU thread count count to 18 or 16 and see if the WU goes ahead. There are some projects which have problems with decompositions to multiple of 5, it was trying 20 in the log. Sometimes they have WU's in prerelease testing that do work at that setting, then later this problem pops up.

I will report this and the assignment to 20 threads can be restricted.
Thanks much! After realizing it had been failing for days on end I killed the process. I'll shuffle it around if it happens again, I didn't know that was an option.
tedder
Posts: 7
Joined: Mon Mar 16, 2020 4:39 pm

Re: Project: 14576 (0,3055,4) domain decomposition

Post by tedder »

I went back and looked in my logs- I attempted that WU 2700 times over the past three days. doh!
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 14576 (0,3055,4) domain decomposition

Post by bruce »

Doh.

FAH will exclude future assignments to configurations with thread-counts that are multiples of 5.
Post Reply