What are "bad WUs"?

Moderators: Site Moderators, FAHC Science Team

Post Reply
codysluder
Posts: 1024
Joined: Sun Dec 02, 2007 12:43 pm

What are "bad WUs"?

Post by codysluder »

Has anybody done a study of the so-called "bad WUs" to categorize them and determine how many different types of failures are truly involved (as opposed to the number of symptoms)? I'd think that a beneficial advance in the field of MD might "solve" some of these problem cases. Surely this might involve some refinements to the equations of motion or to the force equations might enable Gromacs to bypass the condition that is causing the failure without aborting that trajectory.

I'm going to guess that the fact that some repeatable EUEs can be avoided by stopping and restarting the simulation probably means that some EUEs can also be bypassed by a re-characterization of the random motions around the time of the EUE.

Nwkelly calls it "just one bad data point" here and it's pretty obvious to me that if a reasonably long trajectory has already been calculated, it's best if it isn't aborted and long as a reasonable solution can be obtained by restarting at some earlier time.

On the other hand, if the trajectory leads to a very early EUE, it may not be worth worrying about it. Just start a new one.

Subject: Project: 2665 (Run 0, Clone 479, Gen 20)
nwkelley wrote:hmm, well proj 2665 has had a couple work units that were simply bad imo, but since there could be many reasons on this one, we hate to stop an entire run/clone series for just one bad data point. If you end up with it again or if anyone else has noticed this one causing trouble could you let us know??
thank you!
nick

(and try NOT to delete your log ;)
7im
Posts: 10189
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: What are "bad WUs"?

Post by 7im »

Bad WU is an over used euphemism for any WU that doesn't reach 100% completion, whether by design, by construct, or by bad hardware/software.

And yes, many have been categorized. See the EUE and ERROR sections in the WIKI.

Now if you are asking for more details about how a problematic WU impacts the science of a specific project number, I better leave the full explanation to come from the Pande Group. The short answer is that a few "Bad WUs" are expected, and does not hurt the science. If you need a why it doesn't hurt, see PG for that one also.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
codysluder
Posts: 1024
Joined: Sun Dec 02, 2007 12:43 pm

Re: What are "bad WUs"?

Post by codysluder »

Thanks for the prompt reply, but that's not what I was looking for. The WIKI categorizes the symptoms, not the causes.

I'll wait for an answer from the Pande Group.
toTOW
Site Moderator
Posts: 6309
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: What are "bad WUs"?

Post by toTOW »

I see 3 main causes :

- computing error : the WU will fail (typically EUE) on one machine, but not on an other one ... it is often related to overclocking, or bad hardware.
- simulation error : the WU will reach an abnormal state (two atoms too far aways, or too close, ...), and the simulation will end (typically EUE too), but in this case, the WU will EUE on multiples machines at the same point.
- bad WU : something is wrong in the wudata file and the WU never starts, giving an immediate EUE, or Segmentation Fault, or MPI error ... it sometimes just get stuck at 0% and % CPU usage.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: What are "bad WUs"?

Post by VijayPande »

The issue here is that there are many possible causes here, as mentioned above. Overclocking (faulty hardware) could lead to instabilities, as could some numerics issue in the code (faulty software), as could some bad setup (eg 2 atoms initially too close). It's difficult for the code to distinguish these causes without knowing the "answer" for the calculation (and if we knew that, we wouldn't need to run the simulation)
Post Reply