Work being lost after reboot

Moderators: Site Moderators, PandeGroup

Work being lost after reboot

Postby Rreini » Wed Dec 09, 2009 11:50 pm

I've recently been experiencing problems with folding work being lost whenever I have to reboot my iMac. I'm running the 6.26.3 client.
Code: Select all
[16:31:39] Completed 155000 out of 250000 steps  (62%)
[16:48:04] Completed 157500 out of 250000 steps  (63%)
[17:04:28] Completed 160000 out of 250000 steps  (64%)

Folding@Home Client Shutdown.


--- Opening Log file [December 9 19:15:56 UTC]


# Mac OS X SMP Console Edition ################################################
###############################################################################

                       Folding@Home Client Version 6.26

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /Users/rreini/Library/Folding@home
Executable: /Applications/Folding@home.app/fah6


[19:15:56] - Ask before connecting: No
[19:15:56] - User name: rreini (Team 3446)
[19:15:56] - User ID: 4D80B13159B3B63D
[19:15:56] - Machine ID: 1
[19:15:56]
[19:15:56] Loaded queue successfully.
[19:15:56]
[19:15:56] + Processing work unit
[19:15:56] At least 4 processors must be requested; read 1.
[19:15:56] Core required: FahCore_a2.exe
[19:15:56] Core found.
[19:15:56] Working on queue slot 01 [December 9 19:15:56 UTC]
[19:15:56] + Working ...
[19:15:56]
[19:15:56] *------------------------------*
[19:15:56] Folding@Home Gromacs SMP Core
[19:15:56] Version 2.11 (Fri Sep 4 09:50:46 PDT 2009)
[19:15:56]
[19:15:56] Preparing to commence simulation
[19:15:56] - Ensuring status. Please wait.
[19:15:56] Files status OK
[19:15:58] - Expanded 4833455 -> 23979541 (decompressed 496.1 percent)
[19:15:58] Called DecompressByteArray: compressed_data_size=4833455 data_size=23979541, decompressed_data_size=23979541 diff=0
[19:15:58] - Digital signature verified
[19:15:58]
[19:15:58] Project: 2669 (Run 11, Clone 156, Gen 177)
[19:15:58]
[19:15:59] Assembly optimizations on if available.
[19:15:59] Entering M.D.
[19:16:05] Using Gromacs checkpoints
[19:16:08]
[19:16:11] Entering M.D.
[19:16:17] Using Gromacs checkpoints
[19:16:26] me: file hashes different -- aborting.
[19:16:30] CoreStatus = FF (255)
[19:16:30] Sending work to server
[19:16:30] Project: 2669 (Run 11, Clone 156, Gen 177)
[19:16:30] - Error: Could not get length of results file work/wuresults_01.dat
[19:16:30] - Error: Could not read unit 01 file. Removing from queue.
[19:16:30] - Preparing to get new work unit...
[19:16:30] Cleaning up work directory
[19:16:30] + Attempting to get work packet
[19:16:30] - Connecting to assignment server
[19:16:31] - Successful: assigned to (171.64.65.56).
[19:16:31] + News From Folding@Home: Welcome to Folding@Home
[19:16:31] Loaded queue successfully.
[19:16:48] + Closed connections
[19:16:53]
[19:16:53] + Processing work unit
[19:16:53] At least 4 processors must be requested; read 1.
[19:16:53] Core required: FahCore_a2.exe
[19:16:53] Core found.
[19:16:53] Working on queue slot 02 [December 9 19:16:53 UTC]
[19:16:53] + Working ...
[19:16:53]
[19:16:53] *------------------------------*
[19:16:53] Folding@Home Gromacs SMP Core
[19:16:53] Version 2.11 (Fri Sep 4 09:50:46 PDT 2009)
[19:16:53]
[19:16:53] Preparing to commence simulation
[19:16:53] - Ensuring status. Please wait.
[19:17:03] - Looking at optimizations...
[19:17:03] - Working with standard loops on this execution.
[19:17:03] - Files status OK
[19:17:05] - Expanded 4833455 -> 23979541 (decompressed 496.1 percent)
[19:17:05] Called DecompressByteArray: compressed_data_size=4833455 data_size=23979541, decompressed_data_size=23979541 diff=0
[19:17:06] - Digital signature verified
[19:17:06]
[19:17:06] Project: 2669 (Run 11, Clone 156, Gen 177)
[19:17:06]
[19:17:06] Entering M.D.
[19:17:17] Completed 0 out of 250000 steps  (0%)
[19:32:48] Completed 2500 out of 250000 steps  (1%)
[19:48:14] Completed 5000 out of 250000 steps  (2%)
[20:03:38] Completed 7500 out of 250000 steps  (3%)


Any idea what might be going on here?
Rreini
 
Posts: 12
Joined: Thu Jul 24, 2008 10:07 pm

Re: Work being lost after reboot

Postby bruce » Thu Dec 10, 2009 8:18 pm

Was there a power failure or did OSX shut down the normal way? There has always been a small probability of loss of work but it's a much higher probability when there's a power failure or other forced shutdown.
bruce
 
Posts: 21322
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Work being lost after reboot

Postby Rreini » Thu Dec 10, 2009 9:00 pm

bruce wrote:Was there a power failure or did OSX shut down the normal way? There has always been a small probability of loss of work but it's a much higher probability when there's a power failure or other forced shutdown.


It's been happening with normal shutdowns.
Rreini
 
Posts: 12
Joined: Thu Jul 24, 2008 10:07 pm

Re: Work being lost after reboot

Postby AlanH » Fri Dec 25, 2009 12:48 am

I continue to get this sort of problem. It doesn't require a reboot to get this, just stopping and restarting F@H can do it.

Today I needed to change my local LAN IP range in order to support a visitor. She wanted to VPN to a remote LAN that uses the same private subnet range as I do. Knowing that changing the interface IP address was likely to upset the stupid loopback comms links used by F@H, I figured I had better stop F@H before doing it.

I used the F@H preference pane to Disable F@H, waited for it to stop, then changed the IP structure on my LAN - my Mac's manually defined IP address, and the local net address plus a couple of incoming port mappings on my NAT router. I then went back into the preference pane and restarted F@H. I'm using the latest preference pane version, in Snow Leopard.

Here's the restart log ...

Code: Select all
--- Opening Log file [December 24 22:50:28 UTC]


# Mac OS X SMP Console Edition ################################################
###############################################################################

                       Folding@Home Client Version 6.26

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /Users/alan/Library/Folding@home
Executable: /Applications/Folding@home.app/fah6


[22:50:28] - Ask before connecting: No
[22:50:28] - User name: AlanH (Team 47958)
[22:50:28] - User ID: 76FFEC7656959A7
[22:50:28] - Machine ID: 1
[22:50:28]
[22:50:28] Loaded queue successfully.
[22:50:28]
[22:50:28] + Processing work unit
[22:50:28] At least 4 processors must be requested; read 1.
[22:50:28] Core required: FahCore_a2.exe
[22:50:28] Core found.
[22:50:28] Working on queue slot 09 [December 24 22:50:28 UTC]
[22:50:28] + Working ...
[22:50:28]
[22:50:28] *------------------------------*
[22:50:28] Folding@Home Gromacs SMP Core
[22:50:28] Version 2.11 (Fri Sep 4 09:50:46 PDT 2009)
[22:50:28]
[22:50:28] Preparing to commence simulation
[22:50:28] - Ensuring status. Please wait.
[22:50:38] - Looking at optimizations...
[22:50:38] - Working with standard loops on this execution.
[22:50:38] - Files status OK
[22:50:45] - Expanded 4842506 -> 23982741 (decompressed 495.2 percent)
[22:50:45] Called DecompressByteArray: compressed_data_size=4842506 data_size=23982741, decompressed_data_size=23982741 diff=0
[22:50:45] - Digital signature verified
[22:50:46]
[22:50:46] Project: 2669 (Run 6, Clone 162, Gen 193)
[22:50:46]
[22:51:21] Entering M.D.
[22:51:27] Using Gromacs checkpoints
[22:52:49] CoreStatus = 1 (1)
[22:52:49] Sending work to server
[22:52:49] Project: 2669 (Run 6, Clone 162, Gen 193)
[22:52:49] - Error: Could not get length of results file work/wuresults_09.dat
[22:52:49] - Error: Could not read unit 09 file. Removing from queue.
[22:52:49] - Preparing to get new work unit...
[22:52:49] Cleaning up work directory
[22:52:49] + Attempting to get work packet
[22:52:49] - Connecting to assignment server
[22:53:47] - Successful: assigned to (171.64.65.56).
[22:53:47] + News From Folding@Home: Welcome to Folding@Home
[22:53:47] Loaded queue successfully.
[22:55:15] + Closed connections
[22:55:20]
[22:55:20] + Processing work unit
[22:55:20] At least 4 processors must be requested; read 1.
[22:55:20] Core required: FahCore_a2.exe
[22:55:20] Core found.
[22:55:20] Working on queue slot 00 [December 24 22:55:20 UTC]
[22:55:20] + Working ...
[22:55:20]
[22:55:20] *------------------------------*
[22:55:20] Folding@Home Gromacs SMP Core
[22:55:20] Version 2.11 (Fri Sep 4 09:50:46 PDT 2009)
[22:55:20]
[22:55:20] Preparing to commence simulation
[22:55:20] - Ensuring status. Please wait.
[22:55:29] - Looking at optimizations...
[22:55:29] - Working with standard loops on this execution.
[22:55:29] - Files status OK
[22:55:31] - Expanded 4842506 -> 23982741 (decompressed 495.2 percent)
[22:55:31] Called DecompressByteArray: compressed_data_size=4842506 data_size=23982741, decompressed_data_size=23982741 diff=0
[22:55:31] - Digital signature verified
[22:55:31]
[22:55:31] Project: 2669 (Run 6, Clone 162, Gen 193)
[22:55:31]
[22:55:31] Entering M.D.
[22:55:42] Completed 0 out of 250000 steps  (0%)
[23:04:40] Completed 2500 out of 250000 steps  (1%)
[23:13:22] Completed 5000 out of 250000 steps  (2%)
[23:22:03] Completed 7500 out of 250000 steps  (3%)


The previous work unit was 30% complete, so that's about 4 hours folding time wasted. This is also a typical result if I have to reboot the Mac after an OS upgrade.
Folding for TeamCFC
- Mac Pro Dual 2.66GHz Xeon, 4 GBytes running Mac SMP2 client
AlanH
 
Posts: 174
Joined: Mon Dec 03, 2007 9:54 pm

Re: Work being lost after reboot

Postby codysluder » Fri Dec 25, 2009 2:33 pm

Three things that you didn't mention might help.

Did you shut down FAH manually, or was it still running when you rebooted?
If you shut FAH down, how long did you wait before rebooting?
Presumably you did a normal reboot.
codysluder
 
Posts: 2128
Joined: Sun Dec 02, 2007 12:43 pm

Re: Work being lost after reboot

Postby stevew » Mon Dec 28, 2009 5:45 am

I run 2 Macs Folding SMP WUs. Gromacs is really sensitive to any interruption of the localhost communications (~2.8 MB/sec on a 2.16 GHz C2D). I have lost over a dozen of SMP WUs over the past 3 years by reboots and any hiccup in connectivity channels, like switching WiFi which stops F@h. Now I make no changes until the currently running WU is complete and has successfully uploaded the results. Apple pushes out security and other software "reboot required' updates and I click the "Not Now" button and install after the current WU is done. Time Machine WiFi backups (a different channel) wait till WUs are completed too.

By the following procedure, sometimes, a WU can be restarted. Check Activity Monitor (I sort by Process Name) if 4 FahCore_a2.exe processes are sitting idle then you may save it. Use Pref Pane to Disable F@h. This takes a few seconds to complete. In a terminal window in ~/Library/Folding@home check the log with "tail FAHlog.txt" and you should see "Folding@Home Client Shutdown". Start F@h again in Pref Pane and maybe you'll be lucky and see
[04:21:11] Using Gromacs checkpoints
[04:21:18] Resuming from checkpoint
Hope that this may help save a WU :)

I liked it better when the SMP interprocess communication did not show up in Activity Monitor's Network monitor. Then I could use it to check the Network.
stevew
 
Posts: 143
Joined: Mon Dec 03, 2007 11:53 pm
Location: Team Hack-A-Day

Re: Work being lost after reboot

Postby AlanH » Wed Jan 06, 2010 3:51 pm

codysluder wrote:Three things that you didn't mention might help.

Did you shut down FAH manually, or was it still running when you rebooted?
Not sure if you are referring to my observations, but:

As stated, I used the Preferences panel to "Disable F@H". Is that what you mean by manually?
If you shut FAH down, how long did you wait before rebooting?

As stated, I waited until the F@H processes stopped. I did this by watching until they disappeared from the list in Activity Monitor.
Presumably you did a normal reboot.

Correct.

@stevew: Waiting for the inactive period between work units works for me as well. But with around 12 hours per work unit, it can be difficult to be available at the precise time when this occurs. Real life does intervene from time to time. In the specific situation I was in, if you have guests who can't get to their work, waiting another 8 hours to reconfigure the network is not an option.
AlanH
 
Posts: 174
Joined: Mon Dec 03, 2007 9:54 pm

Re: Work being lost after reboot

Postby stevew » Thu Jan 07, 2010 12:48 am

@AlanH -- 12 hours per WU !! ?? Wow.
Sigh, my 1.83 GHz CD mini rips through SMP WUs in 50 hours . . . and another 2.16 GHz takes 28. For a while I took solace in their watt sipping (35 and 95 watts @ 99% cpu); but caution has been thrown out and $$ sent for an new 8-core. I hope to do a little more for F@h and S@h. :)
stevew
 
Posts: 143
Joined: Mon Dec 03, 2007 11:53 pm
Location: Team Hack-A-Day

Re: Work being lost after reboot

Postby lanbrown » Thu Jan 07, 2010 3:35 pm

I would rather have the work lost after a reboot compared to when it it is done and then complains about the results. I have had that issue a few times. The client was stopped at around 10% and then when the other 90% completed it didn't even try to send it. Talk about the waste of CPU time. I guess if you need to stop a client, you might as well as delete that slot.
lanbrown
 
Posts: 173
Joined: Thu Jul 09, 2009 1:21 am

Re: Work being lost after reboot

Postby bruce » Thu Jan 07, 2010 5:08 pm

If a WU is corrupted by a shutdown, the best options available is to either restart it from 0 or download a new one. What it should NOT do is resume work from a corrupted checkpoint and then detect it much later.

When a WU dies at 100% and it's reasonable to assume that it was a result of a stop/restart earlier in the run, we need to be sure that development knows about it. I'm aware of a couple issues of that sort that were fixed by an update to the FahCore. You don't happen to have a backup of the WU data just before the restart do you? (as of the 10% completion) If we can provide development with a WU that be restarted on their system which will then process to 100% and then fail, I'm confident that they'll be able to figure out what to fix.

Of course if you discard it an don't have a FAHlog showing the actual error at 100%, they're not going to be interested because there's no reason to believe the checkpoint has been corrupted in a way or to believe that the corruption will not be detected when work is resumed.
bruce
 
Posts: 21322
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Work being lost after reboot

Postby lanbrown » Thu Jan 07, 2010 9:47 pm

I have the datafiles for that WU. It never actually deleted the work that was done. Some of them are around 20MB's in size. Could they figure out what went wrong by that? I don't have a copy of when it was at 10%.
lanbrown
 
Posts: 173
Joined: Thu Jul 09, 2009 1:21 am

Re: Work being lost after reboot

Postby AlanH » Thu Jan 07, 2010 9:57 pm

bruce wrote:If a WU is corrupted by a shutdown, the best options available is to either restart it from 0 or download a new one. What it should NOT do is resume work from a corrupted checkpoint and then detect it much later.

Agreed, and it isn't doing that for me. But since there is supposed to be a checkpoint system in place I would have thought it could manage to recover and continue reasonably often, rather than abort pretty well every time.

@stevew: Mine's a 1st generation 2.66 GHz quad MacPro. I get around 3,000 points per day. With eight modern cores you should be able to fold up a storm.
AlanH
 
Posts: 174
Joined: Mon Dec 03, 2007 9:54 pm

Re: Work being lost WITHOUT reboot

Postby mikebikemusic » Mon Jan 11, 2010 4:29 pm

I am experiencing the same loss of work after resuming F@H, even without any restart. Last week I installed the latest F@H on a white macbook running snow leopard. No prior versions have run on this machine. After about 50% of the calculation was done I used the F@H system preference pane to disable it. The log and activity monitor (and the laptop's fan) all indicated that it stopped normally. A few minutes later, without any reboot, I pressed the button in the system preferences to enable it. The logs indicated that it could not resume the work it started because the files were corrupted. It then proceeded to download the exact same WU.

I let it run for another 50%, disabled and enabled without rebooting and it found the prior work files corrupted again. At this point I gave up running this on the laptop. The fan (understandably) runs continuously when F@H is processing. But sometimes I need to concentrate and make it quiet. If I can't reliably disable this process without losing work, why bother running it at all?

If it helps any, I noticed that the logs for the PC show "writing local files" between each "Completed steps" message, but the MAC version just writes "Completed steps" messages. I think the checkpoints are not being written out.
mikebikemusic
 
Posts: 1
Joined: Mon Jan 11, 2010 4:00 pm

Re: Work being lost WITHOUT reboot

Postby AlanH » Tue Jan 12, 2010 1:50 am

mikebikemusic wrote:If it helps any, I noticed that the logs for the PC show "writing local files" between each "Completed steps" message, but the MAC version just writes "Completed steps" messages. I think the checkpoints are not being written out.

An interesting suggestion.

Checking my ~/Library/Folding@Home/Work/ folder, it looks as if I'm currently processing wudata_09, as there are a number of work files with this root and various file extensions. There's a 3.5 MByte file that's updated every 5 minutes called wudata_09.cpt. I'm guessing that's the checkpoint file. There's also one called wudata_09_prev.cpt that is always five minutes older - presumably the previous checkpoint file renamed. And there's a 77.8 KByte file called wudata_09.ckp that's updated at the same time as the .cpt file. I would imagine this is a status file listing the checkpoint file structure.

If my guesses are right, then the software seems to be updating some checkpoint files regularly. However, it appears that the files themselves don't contain viable checkpoint info.
AlanH
 
Posts: 174
Joined: Mon Dec 03, 2007 9:54 pm

Re: Work being lost after reboot

Postby Miller855 » Sat Jan 16, 2010 1:41 pm

i hope this helps, it is just a simple observation,
but I never shut down the wu, i use the pause feature of Increase 2.4.3.
i downloaded it here in the forums, then when i do a software update the auto restart turns the wu back on where it left off, no problems
i also use SMC fan control to keep the cpu at 50 degrees C +/- 3 or 4 degrees on my 2 iMac desktops
iMac 2.93 Ghz, C2D 4Gb DDR3 Ram, OSX 10.6.3, Built in NVidia GeForce GT 120
iMac 2.0 Ghz, C2D 2Gb DDR2 Ram, OSX 10.6.3, Built in ATI Raedon HD 2400
120 Gig PS3, 80 Gig PS3
Miller855
 
Posts: 117
Joined: Fri Apr 17, 2009 5:36 am
Location: Phila (Suburbs) PA USA

Next

Return to Intel Mac V6 client

Who is online

Users browsing this forum: No registered users and 1 guest

cron