Page 1 of 1

Project: 16451 (Run 48, Clone 1, Gen 56) Domain

Posted: Fri Jun 19, 2020 4:24 pm
by HendricksSA
This project 16451 caused a domain decomposition error with 48 cpus. As r2w_ben has suggested for other problem children in the past, I tried it at 45. It runs perfectly with 45 cpus. I searched for 16451 reports and didn't find any but announcements. I'm not sure how this made it to gen 56 without triggering an error. Just letting y'all know. I can post appropriate logs if anyone needs the specifics.

Re: Project: 16451 (Run 48, Clone 1, Gen 56) Domain

Posted: Sat Jun 20, 2020 11:15 pm
by _r2w_ben
The sample I have of p16451 would run on 48 threads. There are rare work units where the atoms have moved enough that the box changes shape or the estimated PME load changes.

For 4x4x3 projects, 48 works when PME load is around 0.18. Once it drops towards 0.17, these CPU counts no longer work: 24, 30, 36, 48, 54, and 60. Temporarily decreasing to 21, 27, 32, or 45 in those scenarios should allow the work unit to finish.

Re: Project: 16451 (Run 48, Clone 1, Gen 56) Domain

Posted: Mon Jun 22, 2020 6:34 am
by foldy
Maybe that would be a workaround for the problem in some work units if fahcore would itself downscale the number of threads when this error occurs?

Re: Project: 16451 (Run 48, Clone 1, Gen 56) Domain

Posted: Mon Jun 22, 2020 1:57 pm
by Joe_H
Yes, that has been suggested and is being investigated. But it is not just a change to the core, but also to the FAHCoreWrapper process it runs within. It would have to capture the domain decomposition error and restart the CPU core with the changed core count. It would also have to do this in a way that does not trigger the max error threshold and cause the WU to be returned as faulty.

But as mentioned the bounding box can change in size during folding. That and the distribution between regular processing threads and PME threads for thread counts over 18-20 possibly shifting can make it a bit complicated.

Re: Project: 16451 (Run 48, Clone 1, Gen 56) Domain

Posted: Mon Jun 22, 2020 8:48 pm
by HendricksSA
_r2W_ben and Joe_H, 16451 behaved normally after this one work unit. I picked up eight more 16451s over the last two days and all ran perfectly with 48 threads. The domain decomposition problem is going to be a tough one to solve. Thank goodness it does not happen constantly - hats off to the beta crew and staffers for insulating us from this for the most part.