What about checkpoints?

Moderators: Site Moderators, FAHC Science Team

iceman1992
Posts: 527
Joined: Fri Mar 23, 2012 5:16 pm

What about checkpoints?

Post by iceman1992 »

bruce wrote:The fundamental concept is that you need to record all the information about the current state of the protein. That's a list of numbers. They have to describe the exact condition of everything that's happening so that if you need to interrupt the folding process and later to restart the process at the exact point you previously stopped it so you can continue processing as if it had never been interrupted.
(A bit off topic)How many checkpoints does the client keep? I mean if the latest checkpoint somehow gets corrupted, can it rollback to the previous checkpoint? And if I pause the client, why doesn't it take a snapshot right at that moment so I don't lose any progress?
bruce wrote:If you turn that list of numbers into a series of graphic images and run them as a video, you'll see a movie of the folding process, so in that sense, a checkpoint is similar to a real snapshot but it actually contains more information.
Speaking of the folding process movie, any new developments on the FAHViewer?
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Runs, Clones, Gens

Post by bruce »

The client doesn't actually keep the checkpoints, the FahCore does. That's significant because various projects may use various FahCores, and therefore the answer can be different for different projects.

Nevertheless, the short answer is: one.

In the past, that was the only answer. A couple of recent cores have started keeping two checkpoints but the second one isn't supported yet because they haven't written the code so that if the primary checkpoint is found to be corrupt, the software knows how to restart from the previous checkpoint. Nevertheless, there's hope for a future version.

There's a certain amount of data that is also retained and uploaded in the final result but it's probably not the entire contents of each checkpoint -- just enough to contain the important scientific data plus resume work from the final state.
Joe_H
Site Admin
Posts: 7878
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Runs, Clones, Gens

Post by Joe_H »

As for which cores keep two checkpoints, from looking at the work directories I know the SMP A3 and A4 cores do. Since the A5 core was developed from the A3 code, I would assume it does as well. Hopefully they will add the code or settings in the near future to use the secondary checkpoint if the primary is corrupt. I don't run into that often, but only losing 15 minutes of processing time versus starting over from the beginning would be much more preferable.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
iceman1992
Posts: 527
Joined: Fri Mar 23, 2012 5:16 pm

Re: Runs, Clones, Gens

Post by iceman1992 »

How much processing time is used when writing a checkpoint? If it is negligible, can't it write checkpoints more often (say, everytime I pause the client)? That way I can pause and not lose any work.
jimerickson
Posts: 533
Joined: Tue May 27, 2008 11:56 pm
Hardware configuration: Parts:
Asus H370 Mining Master motherboard (X2)
Patriot Viper DDR4 memory 16gb stick (X4)
Nvidia GeForce GTX 1080 gpu (X16)
Intel Core i7 8700 cpu (X2)
Silverstone 1000 watt psu (X4)
Veddha 8 gpu miner case (X2)
Thermaltake hsf (X2)
Ubit riser card (X16)
Location: ames, iowa

Re: Runs, Clones, Gens

Post by jimerickson »

its not about processing time i believe that it is about disk activity. writing lots of check points requires lots of disk activity. thats why it defaults to 15 minutes. at least thats how i understand it.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Runs, Clones, Gens

Post by bruce »

The time to write a checkpoint depends on several factors, but it's not a large number so you can add more frequent checkpoints if you choose to.

The biggest disadvantage of too many checkpoints is that there's always a small risk of checkpoint corruption every time you stop and resume a WU. If you do that too often, you'll lose some work to corrupted checkpoints. This seems to be related to how quickly you either restart FAH or you shut down your computer. The active WU does take some time to shut down.
iceman1992
Posts: 527
Joined: Fri Mar 23, 2012 5:16 pm

Re: What about checkpoints?

Post by iceman1992 »

jimerickson wrote:its not about processing time i believe that it is about disk activity. writing lots of check points requires lots of disk activity. thats why it defaults to 15 minutes. at least thats how i understand it.
Well other than the risk of corrupting the checkpoint that bruce mentioned, is there any other reason not to shorten the interval?
bruce wrote:The biggest disadvantage of too many checkpoints is that there's always a small risk of checkpoint corruption every time you stop and resume a WU.
That's why we need the fahcore to keep more than 1 checkpoint :roll:
bruce wrote:This seems to be related to how quickly you either restart FAH or you shut down your computer. The active WU does take some time to shut down.
I always make sure the log says "Interrupted" and I wait for the log to finish showing things before turning off my computer. Does that reduce the risk?
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: What about checkpoints?

Post by bruce »

iceman1992 wrote:
bruce wrote:This seems to be related to how quickly you either restart FAH or you shut down your computer. The active WU does take some time to shut down.
I always make sure the log says "Interrupted" and I wait for the log to finish showing things before turning off my computer. Does that reduce the risk?
Yes. (I don't remember the last time I had a corrupt checkpoint, but I rarely shut down and I'm careful to wait.)
7im
Posts: 10189
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: What about checkpoints?

Post by 7im »

iceman1992 wrote:Well other than the risk of corrupting the checkpoint that bruce mentioned, is there any other reason not to shorten the interval?
Yes, back in the day, I tested checkpoints at 3 minutes, and check points at 30 minutes. Over a full day, it can add one to several minutes (depending on WU size) if you write a lot of check points. Minutes add up to a long time when you are folding thousands of WUs. If you don't stop and start your client very often, then you don't need checkpoints every 3 minutes. 15 is the recommended setting for a reason.
iceman1992 wrote:That's why we need the fahcore to keep more than 1 checkpoint :roll:
And that's why we've had that as a feature request for a VERY long time. :roll: Gromacs only recently added support for that feature. FAH has yet to incorporate it.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Joe_H
Site Admin
Posts: 7878
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: What about checkpoints?

Post by Joe_H »

iceman1992 wrote:I always make sure the log says "Interrupted" and I wait for the log to finish showing things before turning off my computer. Does that reduce the risk?
It helps with some FAHcores, but the log entry does not actually have a direct relationship with whether a checkpoint was written correctly. Beyond that, from various things I have noticed over the years of folding is that, different cores write checkpoints based on different criteria. Most will follow the settings done in the client, i.e. every 15 minutes, others write a checkpoint after a set number of steps. Some WU's and cores will correctly process an interrupt and write out a checkpoint before exiting, and others don't.

To be more certain on whether a checkpoint has been done and written to disk you need to look into the work directory for the WU and watch the last modification time of the checkpoint file. Since files are first written to memory cache from the running core, you do need to add a bit of time to be sure the entire file has been flushed to disk. That extra bit of time will depend on which OS you are folding on; Linux, Windows and OS X all handle flushing the cached file writes slightly differently and OS settings can modify that action. But generally waiting an extra minute or two past the file modification time is enough.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Jesse_V
Site Moderator
Posts: 2851
Joined: Mon Jul 18, 2011 4:44 am
Hardware configuration: OS: Windows 10, Kubuntu 19.04
CPU: i7-6700k
GPU: GTX 970, GTX 1080 TI
RAM: 24 GB DDR4
Location: Western Washington

Re: What about checkpoints?

Post by Jesse_V »

Joe_H's recommendation really is an effective way to see if the checkpoint has been written. I've rarely had a problem with corrupted checkpoints, it seems like a pretty rare event, but if you want to absolutely sure that it doesn't occur for you, use that method.
F@h is now the top computing platform on the planet and nothing unites people like a dedicated fight against a common enemy. This virus affects all of us. Lets end it together.
iceman1992
Posts: 527
Joined: Fri Mar 23, 2012 5:16 pm

Re: What about checkpoints?

Post by iceman1992 »

7im wrote: Yes, back in the day, I tested checkpoints at 3 minutes, and check points at 30 minutes. Over a full day, it can add one to several minutes (depending on WU size) if you write a lot of check points. Minutes add up to a long time when you are folding thousands of WUs. If you don't stop and start your client very often, then you don't need checkpoints every 3 minutes. 15 is the recommended setting for a reason.
But if the client is paused a lot, minutes of lost work add up too. Thanks for the info, I'll keep it at 15 for now.
Joe_H wrote:To be more certain on whether a checkpoint has been done and written to disk you need to look into the work directory for the WU and watch the last modification time of the checkpoint file. Since files are first written to memory cache from the running core, you do need to add a bit of time to be sure the entire file has been flushed to disk. That extra bit of time will depend on which OS you are folding on; Linux, Windows and OS X all handle flushing the cached file writes slightly differently and OS settings can modify that action. But generally waiting an extra minute or two past the file modification time is enough.
How much time is generally required to write a checkpoint?
bruce wrote:Yes. (I don't remember the last time I had a corrupt checkpoint, but I rarely shut down and I'm careful to wait.)
Well I don't shut down often either. I usually hibernate. :D

To Jesse_V and Joe_H, where do I find the work directory on linux? Tried and couldn't find it.
Joe_H
Site Admin
Posts: 7878
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: What about checkpoints?

Post by Joe_H »

iceman1992 wrote:How much time is generally required to write a checkpoint?
From the small to medium sized WU's that I have been doing recently, just a few seconds. But that is written to cache in system RAM. It may take a bit longer when dealing with large and bigadv WU's. The vulnerable time is between when the checkpoint is written to the RAM and when the entire contents are flushed to the hard drive's media. There is also caching going on in the drive between the interface and the media that depends on the drive settings, that can also be an issue sometimes. Some of the file metadata is already updated at the time the checkpoint is written and in cache.

As for the time before cache contents are flushed to disk, as I said that depends. A recent version of OS X for instance had a system process that once a minute flushed all cached writes that were still pending. I have not checked if that has been changed in current versions. Windows and Linux have similar processes, it just has been a while since I looked up what their defaults were. With Linux the choice of filesystem type and its settings will also vary that even more.

The gap in time between a checkpoint write and it being fully flushed to disk is another reason to not do them too frequently. That increases the chance that you will interrupt and corrupt the checkpoint. To use the current minimum setting of 3 minutes as an example, each time it is done there would be up to a minute period on that version of OS X I mentioned where it was not on disk. Even if the average was 30 seconds, with 20 an hour that adds up to potentially 1/6th of the time you could corrupt the checkpoint. With 15 minutes it is reduced to less than 1 in 30. Anyways, when I do check first before shutting down folding or my machine, I have yet to get a corrupted checkpoint. I have had a few corrupted when I did not check.
To Jesse_V and Joe_H, where do I find the work directory on linux? Tried and couldn't find it.
It has been a while since I tried a Linux install in a VM, so no idea where current clients stick the work directory. At one time they used similar paths to what is used for OS X, but that may have changed.
iceman1992
Posts: 527
Joined: Fri Mar 23, 2012 5:16 pm

Re: What about checkpoints?

Post by iceman1992 »

I forgot to check last night. I was in a rush but fortunately the time it took me to type in the "sudo pm-hibernate" command and the password was enough :lol:. The checkpoint was not corrupted. That was like 6-7 seconds after hitting pause. I fold normal WUs, not big or bigadv. But yeah I'm keeping it at 15.
Joe_H wrote:It has been a while since I tried a Linux install in a VM, so no idea where current clients stick the work directory. At one time they used similar paths to what is used for OS X, but that may have changed.
Ah okay then I'll find it one day lol
Stonecold
Posts: 332
Joined: Sun Dec 25, 2011 9:20 pm

Re: What about checkpoints?

Post by Stonecold »

iceman1992 wrote:
Joe_H wrote:It has been a while since I tried a Linux install in a VM, so no idea where current clients stick the work directory. At one time they used similar paths to what is used for OS X, but that may have changed.
Ah okay then I'll find it one day lol
For the *buntus, it is in the folder ".FAHClient" in your home directory (it's a hidden file, so you'll have to turn on "show hidden files". Under .FAHClient is "work", which is the work directory. .FAHClient also has all the configs, logs, cores, etc. I assume it's the same for other Linux distros but I'm not sure.
Post Reply