Lost Four Days of Work; Seeking Guidance

This client will only use a single CPU

Moderators: Site Moderators, PandeGroup

Lost Four Days of Work; Seeking Guidance

Postby DrBB1 » Thu Sep 15, 2011 5:07 pm

I apologize if this has been answered before, or if I am posting to the wrong section of the forum.

I have been processing a WU for Project 10722, originally downloaded six days ago. On my machine, this WU has an ETA of about 9 days, significantly longer than the previous record (for this machine) of about four days. Every day or two I normally quit F@H briefly for a variety of reasons and have rarely had a problem resuming where I left off. However, after five days of crunching away and completing almost two-thirds of the WU, it failed to resume and has restarted.

I have not seen any problems reported for this WU, so my main concern is that because it will take so long to complete, and because I need to stop it periodically, that the chances are good that it will again fail to complete, wasting everyone's time and resources. I have included a chunk of the logfile below.

Code: Select all
[23:08:00] Completed 3640001 out of 7000001 steps  (52%)
[23:20:31] - Autosending finished units... [September 13 23:20:31 UTC]
[23:20:31] Trying to send all finished work units
[23:20:31] + No unsent completed units remaining.
[23:20:31] - Autosend completed
[23:20:31] + Working...
[01:15:24] Completed 3710001 out of 7000001 steps  (53%)
[03:22:46] Completed 3780001 out of 7000001 steps  (54%)
[05:20:31] - Autosending finished units... [September 14 05:20:31 UTC]
[05:20:31] Trying to send all finished work units
[05:20:31] + No unsent completed units remaining.
[05:20:31] - Autosend completed
[05:20:31] + Working...
[05:30:26] Completed 3850001 out of 7000001 steps  (55%)
[07:41:53] Completed 3920001 out of 7000001 steps  (56%)
[09:48:34] Completed 3990001 out of 7000001 steps  (57%)
[11:20:31] - Autosending finished units... [September 14 11:20:31 UTC]
[11:20:31] Trying to send all finished work units
[11:20:31] + No unsent completed units remaining.
[11:20:31] - Autosend completed
[11:20:31] + Working...
[11:55:06] Completed 4060001 out of 7000001 steps  (58%)
[14:01:33] Completed 4130001 out of 7000001 steps  (59%)
[16:16:47] Completed 4200001 out of 7000001 steps  (60%)
[17:20:31] - Autosending finished units... [September 14 17:20:31 UTC]
[17:20:31] Trying to send all finished work units
[17:20:31] + No unsent completed units remaining.
[17:20:31] - Autosend completed
[17:20:31] + Working...
[18:25:58] Completed 4270001 out of 7000001 steps  (61%)
[20:37:46] Completed 4340001 out of 7000001 steps  (62%)
[22:49:45] Completed 4410001 out of 7000001 steps  (63%)
[23:20:31] - Autosending finished units... [September 14 23:20:31 UTC]
[23:20:31] Trying to send all finished work units
[23:20:31] + No unsent completed units remaining.
[23:20:31] - Autosend completed
[23:20:31] + Working...
[00:57:14] Completed 4480001 out of 7000001 steps  (64%)
[02:51:18] ***** Got a SIGTERM signal (2)
[02:51:18] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [September 15 03:31:58 UTC]


# Windows CPU Systray Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.23

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Users\Bruce\AppData\Roaming\Folding@home-x86
Arguments: -verbosity 9 -advmethods

[03:31:58] - Ask before connecting: No
[03:31:58] - User name: DrBB1 (Team 39340)
[03:31:58] - User ID: B5E074A208A4BE8
[03:31:58] - Machine ID: 3
[03:31:58]
[03:31:58] Loaded queue successfully.
[03:31:58] Initialization complete
[03:31:58]
[03:31:58] + Processing work unit
[03:31:58] - Autosending finished units... [September 15 03:31:58 UTC]
[03:31:58] Trying to send all finished work units
[03:31:58] + No unsent completed units remaining.
[03:31:58] - Autosend completed
[03:31:58] Core required: FahCore_a4.exe
[03:31:58] Core found.
[03:31:58] Working on queue slot 01 [September 15 03:31:58 UTC]
[03:31:58] + Working ...
[03:31:58] - Calling '.\FahCore_a4.exe -dir work/ -suffix 01 -priority 96 -nocpulock -checkpoint 21 -verbose -lifeline 9132 -version 623'

[03:32:20]
[03:32:20] *------------------------------*
[03:32:20] Folding@Home Gromacs GB Core
[03:32:20] Version 2.27 (Dec. 15, 2010)
[03:32:20]
[03:32:20] Preparing to commence simulation
[03:32:20] - Looking at optimizations...
[03:32:21] - Files status OK
[03:32:21] - Expanded 271101 -> 354128 (decompressed 130.6 percent)
[03:32:21] Called DecompressByteArray: compressed_data_size=271101 data_size=354128, decompressed_data_size=354128 diff=0
[03:32:21] - Digital signature verified
[03:32:21]
[03:32:21] Project: 10722 (Run 0, Clone 1110, Gen 2)
[03:32:21]
[03:32:22] Assembly optimizations on if available.
[03:32:22] Entering M.D.
[03:32:28] Using Gromacs checkpoints
[03:32:28] Mapping NT from 1 to 1
[03:32:31] Resuming from checkpoint
[03:32:31] Verified work/wudata_01.log
[03:32:33] Verified work/wudata_01.trr
[03:32:34] Verified work/wudata_01.xtc
[03:32:35] Verified work/wudata_01.edr
[03:32:36] Completed 4542530 out of 7000001 steps  (64%)
[03:50:05] Completed 4550001 out of 7000001 steps  (65%)
[05:26:59] ***** Got a SIGTERM signal (2)
[05:26:59] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [September 15 05:29:33 UTC]


# Windows CPU Systray Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.23

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Users\Bruce\AppData\Roaming\Folding@home-x86
Arguments: -verbosity 9 -advmethods

[05:29:33] - Ask before connecting: No
[05:29:33] - User name: DrBB1 (Team 39340)
[05:29:33] - User ID: B5E074A208A4BE8
[05:29:33] - Machine ID: 3
[05:29:33]
[05:29:33] Loaded queue successfully.
[05:29:33] Initialization complete
[05:29:33]
[05:29:33] + Processing work unit
[05:29:33] - Autosending finished units... [September 15 05:29:33 UTC]
[05:29:33] Trying to send all finished work units
[05:29:33] + No unsent completed units remaining.
[05:29:33] - Autosend completed
[05:29:33] Core required: FahCore_a4.exe
[05:29:33] Core found.
[05:29:33] Working on queue slot 01 [September 15 05:29:33 UTC]
[05:29:33] + Working ...
[05:29:33] - Calling '.\FahCore_a4.exe -dir work/ -suffix 01 -priority 96 -nocpulock -checkpoint 21 -verbose -lifeline 4844 -version 623'

[05:29:35]
[05:29:35] *------------------------------*
[05:29:35] Folding@Home Gromacs GB Core
[05:29:35] Version 2.27 (Dec. 15, 2010)
[05:29:35]
[05:29:35] Preparing to commence simulation
[05:29:35] - Looking at optimizations...
[05:29:35] - Files status OK
[05:29:35] - Expanded 271101 -> 354128 (decompressed 130.6 percent)
[05:29:35] Called DecompressByteArray: compressed_data_size=271101 data_size=354128, decompressed_data_size=354128 diff=0
[05:29:35] - Digital signature verified
[05:29:35]
[05:29:35] Project: 10722 (Run 0, Clone 1110, Gen 2)
[05:29:35]
[05:29:35] Assembly optimizations on if available.
[05:29:35] Entering M.D.
[05:29:41] Using Gromacs checkpoints
[05:29:41] Mapping NT from 1 to 1
[05:29:42] Resuming from checkpoint
[05:29:42] fcSaveRestoreState: I/O failed dir=0, var=037EC56C, varsize=20
[05:29:42] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[05:29:42] mdrun returned 3
[05:29:42] Gromacs detected an invalid checkpoint.  Restarting...
[05:29:50] Folding@home Core Shutdown: UNKNOWN_ERROR
[05:29:53] CoreStatus = 62 (98)
[05:29:53] + Restarting core (settings changed)
[05:29:53]
[05:29:53] + Processing work unit
[05:29:53] Core required: FahCore_a4.exe
[05:29:53] Core found.
[05:29:53] Working on queue slot 01 [September 15 05:29:53 UTC]
[05:29:53] + Working ...
[05:29:53] - Calling '.\FahCore_a4.exe -dir work/ -suffix 01 -priority 96 -nocpulock -checkpoint 21 -notermcheck -verbose -lifeline 4844 -version 623'

[05:29:54]
[05:29:54] *------------------------------*
[05:29:54] Folding@Home Gromacs GB Core
[05:29:54] Version 2.27 (Dec. 15, 2010)
[05:29:54]
[05:29:54] Preparing to commence simulation
[05:29:54] - Looking at optimizations...
[05:29:54] - Not checking prior termination.
[05:29:54] - Expanded 271101 -> 354128 (decompressed 130.6 percent)
[05:29:54] Called DecompressByteArray: compressed_data_size=271101 data_size=354128, decompressed_data_size=354128 diff=0
[05:29:54] - Digital signature verified
[05:29:54]
[05:29:54] Project: 10722 (Run 0, Clone 1110, Gen 2)
[05:29:54]
[05:29:54] Assembly optimizations on if available.
[05:29:54] Entering M.D.
[05:30:00] Mapping NT from 1 to 1
[05:30:00] Completed 0 out of 7000001 steps  (0%)
[05:31:42] ***** Got a SIGTERM signal (2)
[05:31:42] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [September 15 05:31:49 UTC]


# Windows CPU Systray Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.23

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Users\Bruce\AppData\Roaming\Folding@home-x86
Arguments: -verbosity 9 -advmethods

[05:31:49] - Ask before connecting: No
[05:31:49] - User name: DrBB1 (Team 39340)
[05:31:49] - User ID: B5E074A208A4BE8
[05:31:49] - Machine ID: 3
[05:31:49]
[05:31:49] Loaded queue successfully.
[05:31:49] Initialization complete
[05:31:49]
[05:31:49] + Processing work unit
[05:31:49] Core required: FahCore_a4.exe
[05:31:49] Core found.
[05:31:49] - Autosending finished units... [September 15 05:31:49 UTC]
[05:31:49] Trying to send all finished work units
[05:31:49] + No unsent completed units remaining.
[05:31:49] - Autosend completed
[05:31:49] Working on queue slot 01 [September 15 05:31:49 UTC]
[05:31:49] + Working ...
[05:31:49] - Calling '.\FahCore_a4.exe -dir work/ -suffix 01 -priority 96 -nocpulock -checkpoint 21 -verbose -lifeline 5144 -version 623'

[05:31:49]
[05:31:49] *------------------------------*
[05:31:49] Folding@Home Gromacs GB Core
[05:31:49] Version 2.27 (Dec. 15, 2010)
[05:31:49]
[05:31:49] Preparing to commence simulation
[05:31:49] - Looking at optimizations...
[05:31:49] - Files status OK
[05:31:49] - Expanded 271101 -> 354128 (decompressed 130.6 percent)
[05:31:49] Called DecompressByteArray: compressed_data_size=271101 data_size=354128, decompressed_data_size=354128 diff=0
[05:31:49] - Digital signature verified
[05:31:49]
[05:31:49] Project: 10722 (Run 0, Clone 1110, Gen 2)
[05:31:49]
[05:31:50] Assembly optimizations on if available.
[05:31:50] Entering M.D.
[05:31:56] Using Gromacs checkpoints
[05:31:56] Mapping NT from 1 to 1
[05:31:56] Resuming from checkpoint
[05:31:56] Verified work/wudata_01.log
[05:31:56] Verified work/wudata_01.trr
[05:31:56] Verified work/wudata_01.xtc
[05:31:56] Verified work/wudata_01.edr
[05:31:56] Completed 860 out of 7000001 steps  (0%)
[07:41:45] Completed 70001 out of 7000001 steps  (1%)
[09:48:25] Completed 140001 out of 7000001 steps  (2%)
[11:31:49] - Autosending finished units... [September 15 11:31:49 UTC]
[11:31:49] Trying to send all finished work units
[11:31:49] + No unsent completed units remaining.
[11:31:49] - Autosend completed
[11:31:49] + Working...
[11:55:04] Completed 210001 out of 7000001 steps  (3%)
[14:02:02] Completed 280001 out of 7000001 steps  (4%)
[16:11:50] Completed 350001 out of 7000001 steps  (5%)


Unless the logfile suggests something occurred that is unlikely to repeat itself, I would like to kill this WU before too much more time passes and get a new one. However, I am not sure what I need to do to accomplish this without having any unintended consequences--for me or for F@H--or having the same WU sent. If there is a documented procedure for WU "mercy killings" I would appreciate a link to it, or if someone would reply with a step-by-step guide that would be good, too.

I would be willing to let the WU continue if the logfile reveals to a knowledgeable person (guess what I am not...) that quitting the WU in another 4-8 days will not have a similar result...though I am not optimistic right now.
========
DrBB1
DrBB1
 
Posts: 159
Joined: Wed Mar 26, 2008 12:30 am
Location: SE PA

Re: Lost Four Days of Work; Seeking Guidance

Postby 7im » Thu Sep 15, 2011 5:26 pm

v6 is known to occasionally lose a work unit on restart of the client. It's rare in the latest version, but still happens. And there is no way to know if the WU caused the problem or not. Stopping and restarted could be the cause as much as the work unit. With a specific error in the log, just no way to know.

I'd start by checking the Psummary link at the top of the page, and find out how long the Pref Deadline is for this project. If it looks like you can finish this WU before that deadline, then let it continue. (Remember, it's the time from first download, to when the redownload will finish).

If the ETA to complete is after that Pref Deadline, go ahead and kill it, getting a new WU.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
User avatar
7im
 
Posts: 14648
Joined: Thu Nov 29, 2007 4:30 pm
Location: Arizona

Re: Lost Four Days of Work; Seeking Guidance

Postby DrBB1 » Thu Sep 15, 2011 10:33 pm

7im-- Thanks for the guidance. I will check the Psummary later tonight and consider whether to continue.

7im wrote:If the ETA to complete is after that Pref Deadline, go ahead and kill it, getting a new WU.

DrBB1 wrote:If there is a documented procedure for WU "mercy killings" I would appreciate a link to it, or if someone would reply with a step-by-step guide that would be good, too.


In case I opt to kill it, how do I do this? I suspect I need to delete some files, but I'm not sure which ones. If the procedure is documented, just send me a link; if not, I'd appreciate can someone could point me to files to delete--and avoid deleting. My guess is to go to the "Work" directory and delete all files "wudata_xx.*", where"xx" is the queue identifier, and "*" are all files in that queue. Is that right? And do I need to delete any files in the parent folder, "Folding@home-x86?"

Thanks in advance.
DrBB1
 
Posts: 159
Joined: Wed Mar 26, 2008 12:30 am
Location: SE PA

Re: Lost Four Days of Work; Seeking Guidance

Postby Zagen30 » Thu Sep 15, 2011 10:45 pm

Deleting the entire work folder and queue.dat should get you a new WU.
Image
Zagen30
 
Posts: 1589
Joined: Tue Mar 25, 2008 12:45 am

Re: Lost Four Days of Work; Seeking Guidance

Postby 7im » Thu Sep 15, 2011 10:48 pm

If you get the same WU again, repeat, with the added step of changing the Machine ID.
User avatar
7im
 
Posts: 14648
Joined: Thu Nov 29, 2007 4:30 pm
Location: Arizona

Re: Lost Four Days of Work; Seeking Guidance

Postby DrBB1 » Fri Sep 16, 2011 9:00 pm

Thanks all!
DrBB1
 
Posts: 159
Joined: Wed Mar 26, 2008 12:30 am
Location: SE PA


Return to Windows Classic V6.23 Client

Who is online

Users browsing this forum: No registered users and 1 guest

cron