Losing progress after pausing - is this normal?

If you're new to FAH and need help getting started or you have very basic questions, start here.

Moderators: Site Moderators, FAHC Science Team

Losing progress after pausing - is this normal?

Postby flu2146 » Thu Sep 25, 2014 4:54 am

If I pause or switch the amount of power I want my computer to dedicate, the progress bar (on a pretty large WU) will decrease by a few percent. In the log, the number of steps will go down as well. Compare the beginning of the log to the end.

Code: Select all
03:27:57:WU00:FS00:0xa4:Completed 45000 out of 1500000 steps  (3%)
03:38:34:WU00:FS00:0xa4:Completed 60000 out of 1500000 steps  (4%)
03:39:27:FS00:Paused
03:39:27:FS00:Shutting core down
03:39:32:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
03:39:38:Saving configuration to config.xml
03:39:38:<config>
03:39:38:  <!-- Slot Control -->
03:39:38:  <power v='medium'/>
03:39:38:
03:39:38:  <!-- User Information -->
03:39:38:  <passkey v='********************************'/>
03:39:38:  <user v='flu2146'/>
03:39:38:
03:39:38:  <!-- Folding Slots -->
03:39:38:  <slot id='0' type='CPU'>
03:39:38:    <paused v='true'/>
03:39:38:  </slot>
03:39:38:</config>
03:44:39:FS00:Unpaused
03:44:39:WU00:FS00:Starting
03:44:39:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Fred/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 00 -suffix 01 -version 704 -lifeline 3128 -checkpoint 15 -np 7
03:44:39:WU00:FS00:Started FahCore on PID 9520
03:44:39:WU00:FS00:Core PID:5904
03:44:39:WU00:FS00:FahCore 0xa4 started
03:44:40:WU00:FS00:0xa4:
03:44:40:WU00:FS00:0xa4:*------------------------------*
03:44:40:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
03:44:40:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
03:44:40:WU00:FS00:0xa4:
03:44:40:WU00:FS00:0xa4:Preparing to commence simulation
03:44:40:WU00:FS00:0xa4:- Looking at optimizations...
03:44:40:WU00:FS00:0xa4:- Files status OK
03:44:40:WU00:FS00:0xa4:- Expanded 2053747 -> 5365960 (decompressed 261.2 percent)
03:44:40:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=2053747 data_size=5365960, decompressed_data_size=5365960 diff=0
03:44:40:WU00:FS00:0xa4:- Digital signature verified
03:44:40:WU00:FS00:0xa4:
03:44:40:WU00:FS00:0xa4:Project: 7808 (Run 4, Clone 342, Gen 66)
03:44:40:WU00:FS00:0xa4:
03:44:40:WU00:FS00:0xa4:Assembly optimizations on if available.
03:44:40:WU00:FS00:0xa4:Entering M.D.
03:44:43:Saving configuration to config.xml
03:44:43:<config>
03:44:43:  <!-- Slot Control -->
03:44:43:  <power v='medium'/>
03:44:43:
03:44:43:  <!-- User Information -->
03:44:43:  <passkey v='********************************'/>
03:44:43:  <user v='flu2146'/>
03:44:43:
03:44:43:  <!-- Folding Slots -->
03:44:43:  <slot id='0' type='CPU'/>
03:44:43:</config>
03:44:46:WU00:FS00:0xa4:Using Gromacs checkpoints
03:44:46:WU00:FS00:0xa4:Mapping NT from 7 to 7
03:44:46:WU00:FS00:0xa4:Resuming from checkpoint
03:44:46:WU00:FS00:0xa4:Verified 00/wudata_01.log
03:44:46:WU00:FS00:0xa4:Verified 00/wudata_01.trr
03:44:46:WU00:FS00:0xa4:Verified 00/wudata_01.xtc
03:44:46:WU00:FS00:0xa4:Verified 00/wudata_01.edr
03:44:46:WU00:FS00:0xa4:Completed 42250 out of 1500000 steps  (2%)
03:46:44:WU00:FS00:0xa4:Completed 45000 out of 1500000 steps  (3%)
flu2146
 
Posts: 3
Joined: Thu Sep 25, 2014 4:49 am

Re: Losing progress after pausing - is this normal?

Postby Joe_H » Thu Sep 25, 2014 5:09 am

Yes, this is normal. After pausing the client will restart at the last checkpoint that was written by the folding core. In the case of CPU WU's such as the Core_A4 one being worked on in your log, the default checkpoint frequency is every 15 minutes. So at most you will redo that much processing.

For current GPU folding cores the checkpoint frequency is set by the researcher running the project. Typically it is set between 2 to 5% of progress.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Joe_H
Site Admin
 
Posts: 6451
Joined: Tue Apr 21, 2009 5:41 pm
Location: W. MA

Re: Losing progress after pausing - is this normal?

Postby flu2146 » Thu Sep 25, 2014 5:10 am

Well, that's frustrating. Is there any way to determine an optimal place/time to pause?
flu2146
 
Posts: 3
Joined: Thu Sep 25, 2014 4:49 am

Re: Losing progress after pausing - is this normal?

Postby Joe_H » Thu Sep 25, 2014 5:20 am

If you examine the work folder contained in the F@H data folder on your system, you can see the modification times of the current and the previous checkpoint files. If you pause about a minute after the current checkpoint file is written, then you are sure that the entire contents have been flushed to your drive from being cached while being written.
Joe_H
Site Admin
 
Posts: 6451
Joined: Tue Apr 21, 2009 5:41 pm
Location: W. MA

Re: Losing progress after pausing - is this normal?

Postby flu2146 » Thu Sep 25, 2014 5:34 am

Ok, thank you for your help!
flu2146
 
Posts: 3
Joined: Thu Sep 25, 2014 4:49 am

Re: Losing progress after pausing - is this normal?

Postby czeski » Fri Sep 26, 2014 9:11 am

It would be nice if the new cores being developed like core17, core18 (and preferably update old ones a3, a4, core15) would report in logs when they successfully written checkpoint files. This would allow in future to implement an option in GUI to pause WU after next checkpoint.

It doesn't seem to take much work for a developer to add unified lines to core output like:
Writing checkpoint at xx%
Checkpoint at xx% written and verified successfully

In GUI there could be a timer when last checkpoint was written and maybe even average time between checkpoints to see if You should pause WU immediately or wait for next checkpoint.
This would really help uses who are not 24h folders.
czeski
 
Posts: 11
Joined: Sat Oct 27, 2012 5:59 pm

Re: Losing progress after pausing - is this normal?

Postby 7im » Fri Sep 26, 2014 3:15 pm

Most core 17, 18 projects write check points every 5%. 5, 10, 15, etc. It's a known quantity.

It's a good idea, but development on the cores you listed have all but stopped. Maybe their replacements will do better.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
User avatar
7im
 
Posts: 10189
Joined: Thu Nov 29, 2007 5:30 pm
Location: Arizona


Return to New Donors start here

Who is online

Users browsing this forum: No registered users and 2 guests

cron