Requesting same WU on 2nd system (after hardware failure)

The most demanding Projects are only available to a small percentage of very high-end servers.

Moderators: Site Moderators, PandeGroup

Requesting same WU on 2nd system (after hardware failure)

Postby wuffy68 » Wed Jun 04, 2014 11:42 pm

I'm occasionally using AWS EC2 Spot Requests for BigAdv WUs when the price is right...

I've found a way to work around unexpected terminations (being outbid) by creating an image of my current VM every 8 hours (requires shutting down FAH core (pause) and creating an image (req a reboot)).
I'm working on scripting this process, but right now its manual...


That being said, if a new WU is started and is only 4% in, while my VM instance is terminated, I've lost all progress on that WU (and the WU itself).

How can I find the WU assigned to me, and re-request it again (I would imagine I could as long as it hasn't expired yet).

Any advice in this matter would be appreciated...
1x nVidia 1070, 1x nVidia 1060 3g,
1x nVidia 970, 2x nVidia 960,
1x nVidia 555, 1x AMD R7, 2x AMD 295,
6x i5 CPU-only rigs
wuffy68
 
Posts: 150
Joined: Wed Jun 04, 2014 11:06 pm
Location: Roxborough, Colorado USA

Re: Requesting same WU on 2nd system (after hardware failure

Postby Joe_H » Thu Jun 05, 2014 12:26 am

The WS will for a short period of time after assigning a WU, resend it to a request from the same machine ID. Otherwise there is no way to request a specific WU. As for losing the WU after 4% progress, the default time period for checkpoints is 15 minutes. Unless you change that, that is the most work you should lose except if the checkpoint is corrupted by the client being terminated in the middle of writing it.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Joe_H
Site Admin
 
Posts: 4534
Joined: Tue Apr 21, 2009 4:41 pm
Location: W. MA

Re: Requesting same WU on 2nd system (after hardware failure

Postby 7im » Thu Jun 05, 2014 12:31 am

Once the WU is lost and your VM is gone (and with no image made), there is no way to request the same WU again. It's lost, and has to expire, then is assigned again. It's like trying to mine on a Ram Drive in a 3rd world country where the power cuts out ever few hours. All the data is lost.

That's why this is not a recommended configuration for folding BA work units. The WUs are too long and spot pricing changes too often. And the wait for the WU to expire and be reassigned slows down the science of folding.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
User avatar
7im
 
Posts: 14648
Joined: Thu Nov 29, 2007 4:30 pm
Location: Arizona

Re: Requesting same WU on 2nd system (after hardware failure

Postby wuffy68 » Thu Jun 05, 2014 12:49 am

Thanks - yea, so the trick would be to create a full, powered-off system image (not a live snapshot) right after the WU starts at 1% complete ... then one could recover the job from an early failure (by recovering the entire system)

Edit (took out the word snapshot, replaced with "image")
wuffy68
 
Posts: 150
Joined: Wed Jun 04, 2014 11:06 pm
Location: Roxborough, Colorado USA


Return to SMP with bigadv

Who is online

Users browsing this forum: No registered users and 1 guest

cron