Project: 16451 (Run 48, Clone 1, Gen 56) Domain

Moderators: Site Moderators, FAHC Science Team

Project: 16451 (Run 48, Clone 1, Gen 56) Domain

Postby HendricksSA » Fri Jun 19, 2020 5:24 pm

This project 16451 caused a domain decomposition error with 48 cpus. As r2w_ben has suggested for other problem children in the past, I tried it at 45. It runs perfectly with 45 cpus. I searched for 16451 reports and didn't find any but announcements. I'm not sure how this made it to gen 56 without triggering an error. Just letting y'all know. I can post appropriate logs if anyone needs the specifics.
HendricksSA
 
Posts: 336
Joined: Fri Jun 26, 2009 5:34 am

Re: Project: 16451 (Run 48, Clone 1, Gen 56) Domain

Postby _r2w_ben » Sun Jun 21, 2020 12:15 am

The sample I have of p16451 would run on 48 threads. There are rare work units where the atoms have moved enough that the box changes shape or the estimated PME load changes.

For 4x4x3 projects, 48 works when PME load is around 0.18. Once it drops towards 0.17, these CPU counts no longer work: 24, 30, 36, 48, 54, and 60. Temporarily decreasing to 21, 27, 32, or 45 in those scenarios should allow the work unit to finish.
_r2w_ben
 
Posts: 281
Joined: Wed Apr 23, 2008 4:11 pm

Re: Project: 16451 (Run 48, Clone 1, Gen 56) Domain

Postby foldy » Mon Jun 22, 2020 7:34 am

Maybe that would be a workaround for the problem in some work units if fahcore would itself downscale the number of threads when this error occurs?
foldy
 
Posts: 2045
Joined: Sat Dec 01, 2012 4:43 pm

Re: Project: 16451 (Run 48, Clone 1, Gen 56) Domain

Postby Joe_H » Mon Jun 22, 2020 2:57 pm

Yes, that has been suggested and is being investigated. But it is not just a change to the core, but also to the FAHCoreWrapper process it runs within. It would have to capture the domain decomposition error and restart the CPU core with the changed core count. It would also have to do this in a way that does not trigger the max error threshold and cause the WU to be returned as faulty.

But as mentioned the bounding box can change in size during folding. That and the distribution between regular processing threads and PME threads for thread counts over 18-20 possibly shifting can make it a bit complicated.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Joe_H
Site Admin
 
Posts: 6693
Joined: Tue Apr 21, 2009 5:41 pm
Location: W. MA

Re: Project: 16451 (Run 48, Clone 1, Gen 56) Domain

Postby HendricksSA » Mon Jun 22, 2020 9:48 pm

_r2W_ben and Joe_H, 16451 behaved normally after this one work unit. I picked up eight more 16451s over the last two days and all ran perfectly with 48 threads. The domain decomposition problem is going to be a tough one to solve. Thank goodness it does not happen constantly - hats off to the beta crew and staffers for insulating us from this for the most part.
HendricksSA
 
Posts: 336
Joined: Fri Jun 26, 2009 5:34 am


Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 4 guests

cron