Checkpointing apparently not working properly

Moderators: Site Moderators, FAHC Science Team

Checkpointing apparently not working properly

Postby Valkyrie » Wed Oct 03, 2012 11:59 pm

Hello

Apologies if this has been covered. My machine is currently running a WU that takes ~40 min/frame. My checkpointing frequency is set at 15 min. If I pause folding to do something else with the machine it is repeating two frames that the log marked as completed. It seems that it is not really checkpointing every 15 min. Is there a remedy for this (other than never pausing folding).

Thanks

V
Valkyrie
 
Posts: 43
Joined: Sat Apr 14, 2012 10:03 pm
Location: Canada

Re: Checkpointing apparently not working properly

Postby 7im » Thu Oct 04, 2012 12:08 am

No remedy. Not all fahcores behave the same. GPU fahcores behave differently than SMP fahcores. As you've seen, some cores only write a checkpoint after a completed frame.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
User avatar
7im
 
Posts: 10189
Joined: Thu Nov 29, 2007 5:30 pm
Location: Arizona

Re: Checkpointing apparently not working properly

Postby Valkyrie » Thu Oct 04, 2012 12:19 am

7im thanks for the quick reply,

The machine is on Win 7 two core/4 thread running SMP only. I still don't get why it has to redo two completed 40 minute frames after a pause. Sorry if I'm missing something obvious.

Please elaborate.

Thanks

V
Valkyrie
 
Posts: 43
Joined: Sat Apr 14, 2012 10:03 pm
Location: Canada

Re: Checkpointing apparently not working properly

Postby 7im » Thu Oct 04, 2012 12:24 am

I misunderstood what you said. Sorry. I took it to mean the client restarted more than 2 x 15 min checkpoints back from where it left off. Not as having gone back 2 whole completed frames.

I have not seen that on SMP work units (GPU yes), which is why I misunderstood. What are the PRCG numbers for that work unit?
User avatar
7im
 
Posts: 10189
Joined: Thu Nov 29, 2007 5:30 pm
Location: Arizona

Re: Checkpointing apparently not working properly

Postby Valkyrie » Thu Oct 04, 2012 12:42 am

The numbers are 7647(95,1,9). And yes it seems to be wasting and hour and twenty minutes redoing frames the log says were completed.

Thanks again.

V
Valkyrie
 
Posts: 43
Joined: Sat Apr 14, 2012 10:03 pm
Location: Canada

Re: Checkpointing apparently not working properly

Postby 7im » Thu Oct 04, 2012 12:50 am

Please post the log showing the before and restart with the 2 lost frames.
User avatar
7im
 
Posts: 10189
Joined: Thu Nov 29, 2007 5:30 pm
Location: Arizona

Re: Checkpointing apparently not working properly

Postby Valkyrie » Thu Oct 04, 2012 2:33 am

Here is an excerpt from the log. It's just a cut and paste. Not how I see others in here doing it. I'll have to figure out the right way to put up logfiles.

Code: Select all
15:06:24:WU01:FS00:0xa4:Completed 1200000 out of 2500000 steps  (48%)
15:45:13:WU01:FS00:0xa4:Completed 1225000 out of 2500000 steps  (49%)
16:23:59:WU01:FS00:0xa4:Completed 1250000 out of 2500000 steps  (50%)
16:28:41:FS00:Paused
16:28:41:FS00:Shutting core down
16:28:48:WU01:FS00:0xa4:Client no longer detected. Shutting down core
16:28:48:WU01:FS00:0xa4:
16:28:48:WU01:FS00:0xa4:Folding@home Core Shutdown: CLIENT_DIED
16:28:48:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
17:32:59:FS00:Unpaused
17:32:59:WU01:FS00:Starting
17:32:59:WU01:FS00:Core PID:3412
17:32:59:WU01:FS00:FahCore 0xa4 started
17:33:00:WU01:FS00:0xa4:
17:33:00:WU01:FS00:0xa4:*------------------------------*
17:33:00:WU01:FS00:0xa4:Folding@Home Gromacs GB Core
17:33:00:WU01:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
17:33:00:WU01:FS00:0xa4:
17:33:00:WU01:FS00:0xa4:Preparing to commence simulation
17:33:00:WU01:FS00:0xa4:- Looking at optimizations...
17:33:00:WU01:FS00:0xa4:- Files status OK
17:33:00:WU01:FS00:0xa4:- Expanded 548697 -> 848028 (decompressed 154.5 percent)
17:33:00:WU01:FS00:0xa4:Called DecompressByteArray: compressed_data_size=548697 data_size=848028, decompressed_data_size=848028 diff=0
17:33:00:WU01:FS00:0xa4:- Digital signature verified
17:33:00:WU01:FS00:0xa4:
17:33:00:WU01:FS00:0xa4:Project: 7647 (Run 95, Clone 1, Gen 9)
17:33:00:WU01:FS00:0xa4:
17:33:00:WU01:FS00:0xa4:Assembly optimizations on if available.
17:33:00:WU01:FS00:0xa4:Entering M.D.
17:33:05:WU01:FS00:0xa4:Using Gromacs checkpoints
17:33:05:WU01:FS00:0xa4:Mapping NT from 4 to 4
17:33:05:WU01:FS00:0xa4:Resuming from checkpoint
17:33:05:WU01:FS00:0xa4:Verified 01/wudata_01.log
17:33:05:WU01:FS00:0xa4:Verified 01/wudata_01.trr
17:33:05:WU01:FS00:0xa4:Verified 01/wudata_01.xtc
17:33:05:WU01:FS00:0xa4:Verified 01/wudata_01.edr
17:33:06:WU01:FS00:0xa4:Completed 1246750 out of 2500000 steps  (49%)
17:38:09:WU01:FS00:0xa4:Completed 1250000 out of 2500000 steps  (50%)
******************************** Date: 03/10/12 ********************************
18:17:39:WU01:FS00:0xa4:Completed 1275000 out of 2500000 steps  (51%)
18:57:47:WU01:FS00:0xa4:Completed 1300000 out of 2500000 steps  (52%)



Again, thanks for helping

V

Mod note: you use Code tags selected from the top of the full editor for posts. I have inserted them for you.
Valkyrie
 
Posts: 43
Joined: Sat Apr 14, 2012 10:03 pm
Location: Canada

Re: Checkpointing apparently not working properly

Postby 7im » Thu Oct 04, 2012 3:32 am

While the frame numbers make it look like one whole frame was repeated, if you look closer, only 5 minutes was repeated, not 2 whole frames.

So it looks like the checkpoints are working.
User avatar
7im
 
Posts: 10189
Joined: Thu Nov 29, 2007 5:30 pm
Location: Arizona

Re: Checkpointing apparently not working properly

Postby Valkyrie » Thu Oct 04, 2012 4:27 am

Okay....after looking at the next pause/restart and seeing something more reasonable, I shook out the cobwebs and took a closer look at the timestamps. :oops:

It seems that on occasion it takes the program five or six minutes to figure out that its working on a frame that it's already completed. It makes for weird looking log entries if you don't closely examine the time stamps AND it is a waste of cpu time, but it's not wasting an hour and twenty. Sometimes it restarts with what appears to be close to 15 min checkpointing and sometimes it crunches on a previously completed frame for five or six minutes before it moves on. It's strange...........

Oh well,

Thanks again,

V
Valkyrie
 
Posts: 43
Joined: Sat Apr 14, 2012 10:03 pm
Location: Canada

Re: Checkpointing apparently not working properly

Postby Valkyrie » Thu Oct 04, 2012 4:37 am

Thanks 7im,

I had realized my error. You Ninja'd me. I submitted my reply and there you were.
Valkyrie
 
Posts: 43
Joined: Sat Apr 14, 2012 10:03 pm
Location: Canada

Re: Checkpointing apparently not working properly

Postby bruce » Thu Oct 04, 2012 5:56 am

Valkyrie wrote:Okay....after looking at the next pause/restart and seeing something more reasonable, I shook out the cobwebs and took a closer look at the timestamps. :oops:

It seems that on occasion it takes the program five or six minutes to figure out that its working on a frame that it's already completed. It makes for weird looking log entries if you don't closely examine the time stamps AND it is a waste of cpu time, but it's not wasting an hour and twenty. Sometimes it restarts with what appears to be close to 15 min checkpointing and sometimes it crunches on a previously completed frame for five or six minutes before it moves on. It's strange...........

Oh well,

Thanks again,

V


If I remember correctly, the checkpointing process used by FahCore_a4 makes a checkpoint 15 minutes after an unpause or the after start of a WU and every 15 minutes after that. It totally ignores frames, so perhaps you should, too. From the time of the Pause, back up 15 minutes. There should be one checkpoint at an undetermined point anywhere in that 15 minute interval. After it is unpaused, it will start from that unknown point, no matter how many frame messages appear in the log.

There's a possibility that I'm thinking for a different core than A4.
bruce
 
Posts: 20140
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: Checkpointing apparently not working properly

Postby Valkyrie » Thu Oct 04, 2012 9:42 am

Thanks Bruce,

Watching F@H Control I have observed sudden slowdowns in TPF at frame turnovers..(nothing good on cable or satellite). I had assumed that this was due to checkpointing.....even though it had nothing to do with the checkpointing interval. I know that assuming makes............an...awful lot of problems. The machine in question is a daily driver and what it can do for F@H is what it can do. That's all well and good. When it hooks into these larger WU's and the number of days ETA sometimes seems to go up instead of down I wonder if there is something I can do. Maybe reducing the checkpoint interval. Maybe SSD for much quicker checkpoints. I have other machines that just crunch away and don't get looked at as much.

We'll soldier on.

Cheers to all,

V
Valkyrie
 
Posts: 43
Joined: Sat Apr 14, 2012 10:03 pm
Location: Canada

Re: Checkpointing apparently not working properly

Postby P5-133XL » Thu Oct 04, 2012 10:17 am

An SSD isn't going to help significantly for the amount of time used to read or write to disk is insignificant. Checkpointing is also not going to help with the accuracy of TPF's or ETA's.

Since you are seeing variations in TPF's I would like to know if you are using v7's FAHControl's estimated TPF's ( and it's estimated ETA's) which others have observed can be relatively inaccurate. The accurate TPF's and ETA's can be calculated from the logs or you can use a 3rd party monitoring program like HFM.net which produces accurate TPF's and ETA's.
Image
P5-133XL
 
Posts: 2948
Joined: Sun Dec 02, 2007 5:36 am
Location: Salem. OR USA

Re: Checkpointing apparently not working properly

Postby Valkyrie » Thu Oct 04, 2012 10:41 am

Hello P5-133XL,

Yes, as I said just observing FAHControl I see these slowdowns in TPF on frame turnover. Sometimes plus 1.5 times depending on WU. This happens on machines that seldom get paused as well. That's why I thought the program was doing something other than folding at these times.

I'm just trying to get the best out of them, but we'll take what we can get....

Thanks

V

Edit: I only run FAHControl occasionally.
Valkyrie
 
Posts: 43
Joined: Sat Apr 14, 2012 10:03 pm
Location: Canada

Re: Checkpointing apparently not working properly

Postby bruce » Thu Oct 04, 2012 5:48 pm

As P5-133XL has said, the magnitude of the checkpoint interval does not change the speed of the WU -- though it may have something to do with how often the TPF/ETA are recalculated, depending partly on which FahCore is used by the project.

You said SMP on a 2C/4T dedicated machine. If you're running SMP:4, there's a good chance that the speed is actually changing because something else is using the CPU. The SMP calculations are particularly sensitive to computational distractions. That may be as simple as a virus scan or a disk check or even an index rebuild for the find function. Windows schedules activities like that for background processing and ordinarily, people don't notice that they're running. Even some of the fancier screensavers can chew up enough resources to slow up SMP. (You'd probably notice if you're running FAH-GPU on an AMD GPU or some other distributed computing project.)

Consider adjusting SMP:4 to be SMP:3. You should allocate at least the number of cores your CPU has to SMP, but allocating additional threads is much less significant. (Adding the threads to FAH-SMP makes a relatively small contribution to speed, but whatever you allocate must be readily available.) Eliminating those background tasks is a reasonable approach, but so is leaving a free thread or two for them to use.
bruce
 
Posts: 20140
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Next

Return to V7.1.52 Windows/Linux

Who is online

Users browsing this forum: No registered users and 1 guest

cron