Page 1 of 1

Losing progress after pausing - is this normal?

Posted: Thu Sep 25, 2014 3:54 am
by flu2146
If I pause or switch the amount of power I want my computer to dedicate, the progress bar (on a pretty large WU) will decrease by a few percent. In the log, the number of steps will go down as well. Compare the beginning of the log to the end.

Code: Select all

03:27:57:WU00:FS00:0xa4:Completed 45000 out of 1500000 steps  (3%)
03:38:34:WU00:FS00:0xa4:Completed 60000 out of 1500000 steps  (4%)
03:39:27:FS00:Paused
03:39:27:FS00:Shutting core down
03:39:32:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
03:39:38:Saving configuration to config.xml
03:39:38:<config>
03:39:38:  <!-- Slot Control -->
03:39:38:  <power v='medium'/>
03:39:38:
03:39:38:  <!-- User Information -->
03:39:38:  <passkey v='********************************'/>
03:39:38:  <user v='flu2146'/>
03:39:38:
03:39:38:  <!-- Folding Slots -->
03:39:38:  <slot id='0' type='CPU'>
03:39:38:    <paused v='true'/>
03:39:38:  </slot>
03:39:38:</config>
03:44:39:FS00:Unpaused
03:44:39:WU00:FS00:Starting
03:44:39:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Fred/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 00 -suffix 01 -version 704 -lifeline 3128 -checkpoint 15 -np 7
03:44:39:WU00:FS00:Started FahCore on PID 9520
03:44:39:WU00:FS00:Core PID:5904
03:44:39:WU00:FS00:FahCore 0xa4 started
03:44:40:WU00:FS00:0xa4:
03:44:40:WU00:FS00:0xa4:*------------------------------*
03:44:40:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
03:44:40:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
03:44:40:WU00:FS00:0xa4:
03:44:40:WU00:FS00:0xa4:Preparing to commence simulation
03:44:40:WU00:FS00:0xa4:- Looking at optimizations...
03:44:40:WU00:FS00:0xa4:- Files status OK
03:44:40:WU00:FS00:0xa4:- Expanded 2053747 -> 5365960 (decompressed 261.2 percent)
03:44:40:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=2053747 data_size=5365960, decompressed_data_size=5365960 diff=0
03:44:40:WU00:FS00:0xa4:- Digital signature verified
03:44:40:WU00:FS00:0xa4:
03:44:40:WU00:FS00:0xa4:Project: 7808 (Run 4, Clone 342, Gen 66)
03:44:40:WU00:FS00:0xa4:
03:44:40:WU00:FS00:0xa4:Assembly optimizations on if available.
03:44:40:WU00:FS00:0xa4:Entering M.D.
03:44:43:Saving configuration to config.xml
03:44:43:<config>
03:44:43:  <!-- Slot Control -->
03:44:43:  <power v='medium'/>
03:44:43:
03:44:43:  <!-- User Information -->
03:44:43:  <passkey v='********************************'/>
03:44:43:  <user v='flu2146'/>
03:44:43:
03:44:43:  <!-- Folding Slots -->
03:44:43:  <slot id='0' type='CPU'/>
03:44:43:</config>
03:44:46:WU00:FS00:0xa4:Using Gromacs checkpoints
03:44:46:WU00:FS00:0xa4:Mapping NT from 7 to 7 
03:44:46:WU00:FS00:0xa4:Resuming from checkpoint
03:44:46:WU00:FS00:0xa4:Verified 00/wudata_01.log
03:44:46:WU00:FS00:0xa4:Verified 00/wudata_01.trr
03:44:46:WU00:FS00:0xa4:Verified 00/wudata_01.xtc
03:44:46:WU00:FS00:0xa4:Verified 00/wudata_01.edr
03:44:46:WU00:FS00:0xa4:Completed 42250 out of 1500000 steps  (2%)
03:46:44:WU00:FS00:0xa4:Completed 45000 out of 1500000 steps  (3%)

Re: Losing progress after pausing - is this normal?

Posted: Thu Sep 25, 2014 4:09 am
by Joe_H
Yes, this is normal. After pausing the client will restart at the last checkpoint that was written by the folding core. In the case of CPU WU's such as the Core_A4 one being worked on in your log, the default checkpoint frequency is every 15 minutes. So at most you will redo that much processing.

For current GPU folding cores the checkpoint frequency is set by the researcher running the project. Typically it is set between 2 to 5% of progress.

Re: Losing progress after pausing - is this normal?

Posted: Thu Sep 25, 2014 4:10 am
by flu2146
Well, that's frustrating. Is there any way to determine an optimal place/time to pause?

Re: Losing progress after pausing - is this normal?

Posted: Thu Sep 25, 2014 4:20 am
by Joe_H
If you examine the work folder contained in the F@H data folder on your system, you can see the modification times of the current and the previous checkpoint files. If you pause about a minute after the current checkpoint file is written, then you are sure that the entire contents have been flushed to your drive from being cached while being written.

Re: Losing progress after pausing - is this normal?

Posted: Thu Sep 25, 2014 4:34 am
by flu2146
Ok, thank you for your help!

Re: Losing progress after pausing - is this normal?

Posted: Fri Sep 26, 2014 8:11 am
by czeski
It would be nice if the new cores being developed like core17, core18 (and preferably update old ones a3, a4, core15) would report in logs when they successfully written checkpoint files. This would allow in future to implement an option in GUI to pause WU after next checkpoint.

It doesn't seem to take much work for a developer to add unified lines to core output like:
Writing checkpoint at xx%
Checkpoint at xx% written and verified successfully

In GUI there could be a timer when last checkpoint was written and maybe even average time between checkpoints to see if You should pause WU immediately or wait for next checkpoint.
This would really help uses who are not 24h folders.

Re: Losing progress after pausing - is this normal?

Posted: Fri Sep 26, 2014 2:15 pm
by 7im
Most core 17, 18 projects write check points every 5%. 5, 10, 15, etc. It's a known quantity.

It's a good idea, but development on the cores you listed have all but stopped. Maybe their replacements will do better.