Fatal error on step 0

The most demanding Projects are only available to a small percentage of very high-end servers.

Moderators: Site Moderators, PandeGroup

Fatal error on step 0

Postby RHN » Wed Mar 02, 2011 4:39 pm

Perhaps this is my fault, but I'm not sure. I had to restart my machine, so I used Control+C and came back a few minutes later, restarted it, and thought I was good to go. But then, this:

Code: Select all
[ryanhn@neptune folding]$ ./fah6

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

8 cores detected


--- Opening Log file [March 2 16:17:20 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/ryanhn/folding
Executable: ./fah6
Arguments: -smp -bigadv -verbosity 9

[16:17:20] - Ask before connecting: No
[16:17:20] - User name: RHN (Team 199966)
[16:17:20] - User ID: 7C216A345C507C3F
[16:17:20] - Machine ID: 1
[16:17:20]
[16:17:20] Loaded queue successfully.
[16:17:20]
[16:17:20] + Processing work unit
[16:17:20] Core required: FahCore_a5.exe
[16:17:20] - Autosending finished units... [March 2 16:17:20 UTC]
[16:17:20] Core found.
[16:17:20] Trying to send all finished work units
[16:17:20] + No unsent completed units remaining.
[16:17:20] - Autosend completed
[16:17:20] Working on queue slot 01 [March 2 16:17:20 UTC]
[16:17:20] + Working ...
[16:17:20] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 01 -np 8 -checkpoint 15 -verbose -lifeline 2312 -version 634'

[16:17:20]
[16:17:20] *------------------------------*
[16:17:20] Folding@Home Gromacs SMP Core
[16:17:20] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[16:17:20]
[16:17:20] Preparing to commence simulation
[16:17:20] - Looking at optimizations...
[16:17:20] - Files status OK
[16:17:23] - Expanded 24861639 -> 30796292 (decompressed 123.8 percent)
[16:17:23] Called DecompressByteArray: compressed_data_size=24861639 data_size=30796292, decompressed_data_size=30796292 diff=0
[16:17:23] - Digital signature verified
[16:17:23]
[16:17:23] Project: 6901 (Run 9, Clone 14, Gen 1)
[16:17:23]
[16:17:23] Assembly optimizations on if available.
[16:17:23] Entering M.D.
[16:17:29] Using Gromacs checkpoints
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                            :-)  VERSION 4.5.3  (-:

        Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
      Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra,
        Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff,
           Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz,
                Michael Shirts, Alfons Sijbers, Peter Tieleman,

               Berk Hess, David van der Spoel, and Erik Lindahl.

       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
            Copyright (c) 2001-2010, The GROMACS development team at
        Uppsala University & The Royal Institute of Technology, Sweden.
            check out http://www.gromacs.org for more information.


                               :-)  Gromacs  (-:

[16:17:31] Mapping NT from 8 to 8
Reading file work/wudata_01.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 70, software version 73
Starting 8 threads

Reading checkpoint file work/wudata_01.cpt generated: Wed Mar  2 09:11:07 2011


Making 1D domain decomposition 8 x 1 x 1

-------------------------------------------------------
Program Gromacs, VERSION 4.5.3
Source code file: /vspm58/VM/fah-converted/mnt/fah_windows_build/LinuxBuilds/gromacs-4.5.3/src/kernel/md.c, line: 1539

Fatal error:
Checkpoint error on step 0

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day
: No such process
[16:17:34] fcSaveRestoreState: I/O failed dir=0, var=00007F033F70D8E0, varsize=20
[16:17:34] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[16:17:34] mdrun returned 3
[16:17:34] Gromacs detected an invalid checkpoint.  Restarting...starting mdrun 'SINGLE VESICLE in water'
500000 steps,   2000.0 ps (continuing from step 386296,   1545.2 ps).
fcSaveRestoreState: I/O failed dir=0, var=00007F0340F108E0, varsize=20
[16:17:35] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[16:17:35] Can't open checkpoint file
[16:17:35] ResumiCan't open checkpoint file
[16:17:35] Can't open checkpoint file
[16:17:35] Can't open checkpoint file
[16:17:35] Can't open checkpoint file
[16:17:35] Can't open checkpoint file
[16:22:58]
[16:22:58] Folding@home Core Shutdown: UNKNOWN_ERROR
[16:22:59] CoreStatus = 62 (98)
[16:22:59] + Restarting core (settings changed)
[16:22:59]
[16:22:59] + Processing work unit
[16:22:59] Core required: FahCore_a5.exe
[16:22:59] Core found.
[16:22:59] Working on queue slot 01 [March 2 16:22:59 UTC]
[16:22:59] + Working ...
[16:22:59] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 01 -np 8 -checkpoint 15 -notermcheck -verbose -lifeline 2312 -version 634'

[16:22:59]
[16:22:59] *------------------------------*
[16:22:59] Folding@Home Gromacs SMP Core
[16:22:59] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[16:22:59]
[16:22:59] Preparing to commence simulation
[16:22:59] - Looking at optimizations...
[16:22:59] - Not checking prior termination.
[16:23:01] - Expanded 24861639 -> 30796292 (decompressed 123.8 percent)
[16:23:01] Called DecompressByteArray: compressed_data_size=24861639 data_size=30796292, decompressed_data_size=30796292 diff=0
[16:23:01] - Digital signature verified
[16:23:01]
[16:23:01] Project: 6901 (Run 9, Clone 14, Gen 1)
[16:23:01]
[16:23:01] Assembly optimizations on if available.
[16:23:01] Entering M.D.
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                            :-)  VERSION 4.5.3  (-:

        Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
      Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra,
        Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff,
           Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz,
                Michael Shirts, Alfons Sijbers, Peter Tieleman,

               Berk Hess, David van der Spoel, and Erik Lindahl.

       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
            Copyright (c) 2001-2010, The GROMACS development team at
        Uppsala University & The Royal Institute of Technology, Sweden.
            check out http://www.gromacs.org for more information.


                               :-)  Gromacs  (-:

Reading file work/wudata_01.tpr, VERSION 4.0.99_development_20090605 (single precision)
[16:23:08] Mapping NT from 8 to 8
Note: tpx file_version 70, software version 73
Starting 8 threads
Making 1D domain decomposition 8 x 1 x 1
starting mdrun 'SINGLE VESICLE in water'
500000 steps,   2000.0 ps (continuing from step 250000,   1000.0 ps).
[16:23:10] Completed 0 out of 250000 steps  (0%)


It recognized I was at step 386,296, which was correct from before my restart, and after that I don't know why it happened. I'm running Fedora 14 (Linux 2.6.35.11-x64) on an i7 875k with 8GB of RAM. I'd already submitted a -bigadv unit in Linux before this, and have submitted at least dozens in Windows.

Some big mistake I made?
RHN
 
Posts: 22
Joined: Tue Feb 23, 2010 2:04 pm

Re: Fatal error on step 0

Postby tear » Wed Mar 02, 2011 4:42 pm

Nope, I've seen failures with checkpoint resumption in A5 too.
One man's ceiling is another man's floor.
Image
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Fatal error on step 0

Postby linuxfah » Wed Mar 02, 2011 9:33 pm

The checkpoint resume issue was fairly common with previous cores (A2/A3) in Linux after shutting down a work unit midway. It may be from shutting down the client while a checkpoint is being written or possibly because of a file system sync issue. I believe what I did before was wrote a script that would check the previous timestamp of the checkpoint and allow for enough time for when the next checkpoint was expected to be written before shutting down the client.
linuxfah
 
Posts: 333
Joined: Tue Sep 22, 2009 9:28 pm

Re: Fatal error on step 0

Postby tear » Wed Mar 02, 2011 9:47 pm

If clean shutdown is involved (or no shutdown at all ftm) "sync" is N/A really.

I think problem arises from signal handling in the FahCore. If you send a SIGINT
directly to the client (i.e. to fah6 PID) A5 gives you this:

Code: Select all
Received the TERM signal, stopping at the next NS step


Whereas if you hit ^C in the terminal, signal gets handled twice and
you see yet another message (don't have it at hand here) saying that
INT/TERM was received "again". Perhaps handling that second signal
affects checkpoint writing process?
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Fatal error on step 0

Postby gwildperson » Thu Mar 03, 2011 1:51 am

tear wrote:Whereas if you hit ^C in the terminal, signal gets handled twice and
you see yet another message (don't have it at hand here) saying that
INT/TERM was received "again". Perhaps handling that second signal
affects checkpoint writing process?

Is this from your own personal experience, or from somebody else's report? One thing that might have happened is that the person typed ^C which was accepted but no immediate response was observed because processing was continuing to the next full NS and then was planning on writing a checkpoint and shutting down. If a person got impatient and hit ^C again it could have been interpreted as a "Opps. Apparently the normal shutdown procedure is not working, abort writing the checkpoint and shutdown NOW." (I'm not saying that is what happened, just suggesting that it might be what happened.) Many of the on-line bill paying sites have a distinct warning "Press the submit button ONLY ONCE...." because processing takes long enough that a person may get impatient.
gwildperson
 
Posts: 785
Joined: Tue Dec 04, 2007 8:36 pm

Re: Fatal error on step 0

Postby tear » Thu Mar 03, 2011 2:07 am

Yes, this is my own genuine observation.

Hitting ^C eventually gives this (100% of the cases at my end):
Code: Select all
Received the second INT/TERM signal, stopping at the next step

You can try and reproduce it at your end as well.

Correlation of this behaviour with checkpoint corruption is no more than a hypothesis
at this point. Though OTOH I've been only killing client PID with SIGINT for a while
now and no corruption occurred.
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Fatal error on step 0

Postby tear » Thu Mar 03, 2011 2:19 am

Linuxfah, you might be right here. I've tried to reproduce the issue (on purpose) to no avail. Perhaps it is a matter of ^C vs. checkpoint storing race?
I'll closely examine timestamps next time the corruption occurs.

It also seems that A5 doesn't make an explicit checkpoint at exit which pretty much invalidates my theory.
EDIT: update: it does but not every time, weird, it's dinner time&
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Fatal error on step 0

Postby linuxfah » Thu Mar 03, 2011 3:28 am

tear wrote:Linuxfah, you might be right here. I've tried to reproduce the issue (on purpose) to no avail. Perhaps it is a matter of ^C vs. checkpoint storing race?
I'll closely examine timestamps next time the corruption occurs.

It also seems that A5 doesn't make an explicit checkpoint at exit which pretty much invalidates my theory.
EDIT: update: it does but not every time, weird, it's dinner time&


I did a shutdown twice on my work unit and in both cases the client wrote a checkpoint just before exiting. The exit was with the ^C. Otherwise it appears the checkpoint is being written every 15-minutes. I'll try and see if I can time the shutdown right when the checkpoint is being written to see if the work unit fails. I'll post up my results for this.
linuxfah
 
Posts: 333
Joined: Tue Sep 22, 2009 9:28 pm

Re: Fatal error on step 0

Postby tear » Thu Mar 03, 2011 4:53 am

linuxfah wrote:I did a shutdown twice on my work unit and in both cases the client wrote a checkpoint just before exiting.

How long were your runs? I noticed that short runs (3 min, maybe) do not yield a checkpoint at exit.

On a side note -- shortening checkpoint interval might expedite the process (just a thought).
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Fatal error on step 0

Postby linuxfah » Thu Mar 03, 2011 3:30 pm

The first time I stopped the client was after around 21 hour runtime, the second time was at 10-minutes, and third time at 15-minutes. I'll try shortening the checkpoint interval. I did not get much chance to experiment yesterday but hopefully have some more time today.
linuxfah
 
Posts: 333
Joined: Tue Sep 22, 2009 9:28 pm

Re: Fatal error on step 0

Postby linuxfah » Sat Mar 05, 2011 10:09 pm

After adjusting the checkpoint to 4-minutes and restarting multiple times, I was able to get my work unit to crash:

Code: Select all

Making 1D domain decomposition 8 x 1 x 1

-------------------------------------------------------
Program Gromacs, VERSION 4.5.3
Source code file: /vspm58/VM/fah-converted/mnt/fah_windows_build/LinuxBuilds/gromacs-4.5.3/src/kernel/md.c, line: 1539

Fatal error:
Checkpoint error on step 0

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day
: No such process
[22:06:33] fcSaveRestoreState: I/O failed dir=0, var=00007F1E971CB8E0, varsize=20
[22:06:33] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[22:06:33] fcSaveRestoreState: I/O failed dir=0, var=00007F1E951C78E0, varsize=20
[22:06:33] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[22:06:34] fcSaveRestoreState: I/O failed dir=0, var=00007F1E969CA8E0, varsize=20
[22:06:34] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[22:06:34] fcSaveRestoreState: I/O failed dir=0, var=00007F1E979CC8E0, varsize=20
[22:06:34] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
starting mdrun 'SINGLE VESICLE in water'
16000001 steps,  64000.0 ps (continuing from step 15817981,  63271.9 ps).
[22:06:35] fcSaveRestoreState: I/O failed dir=0, var=00007F1E959C88E0, varsize=20
[22:06:35] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[22:06:35] mdrun returned 3
[22:06:35] Gromacs detected an invalid checkpoint.  Restarting...fcCheckPointResume: file hashes different -- aborting.
[22:06:35] Can't open checkpoint file
[22:06:35] Resuming from checkpoint
[22:06:35] Can't open checkpoint file
[22:06:35]
[22:06:35] Folding@home Core Shutdown: UNKNOWN_ERROR
[22:06:35] CoreStatus = 62 (98)
[22:06:35] + Restarting core (settings changed)
[22:06:35]
[22:06:35] + Processing work unit
[22:06:35] Core required: FahCore_a5.exe
[22:06:35] Core found.
[22:06:35] Working on queue slot 09 [March 5 22:06:35 UTC]
[22:06:35] + Working ...


It took quite a few tries to make this happen though and the client did attempt to restart the work unit after the failure. The Core A5 seems to be more stable overall than what I saw with Core A2/A3.

Edit: Also one of the more recent times I stopped the client, the client did not make a checkpoint before exit. It does make the checkpoint before exit in most cases though.
linuxfah
 
Posts: 333
Joined: Tue Sep 22, 2009 9:28 pm

Re: Fatal error on step 0

Postby bruce » Sun Mar 06, 2011 2:24 am

How about posting FAHlog.txt? It's the one that collects data from several sources and adds timestamps. The logs that you're posting have a lot of data that's only useful to the scientists and it's missing some other important information.
bruce
Site Admin
 
Posts: 20180
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Fatal error on step 0

Postby tear » Sun Mar 06, 2011 5:05 am

Got another one (yet slightly different). I have complete (yet corrupted) client directory if anyone's interested. Client was stopped with ^C from keyboard.
Code: Select all
[04:47:35] Project: 6901 (Run 17, Clone 13, Gen 3)
[04:47:35]
[04:47:35] Assembly optimizations on if available.
[04:47:35] Entering M.D.
[04:47:41] Using Gromacs checkpoints
[04:47:45] Mapping NT from 48 to 48
[04:47:50] fcSaveRestoreState: I/O failed dir=0, var=00007FFFA6FE98E0, varsize=20
[04:47:50] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:47:51] fcSaveRestoreState: I/O failed dir=0, var=00007FFFABFF38E0, varsize=20
[04:47:51] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:47:52] fcSaveRestoreState: I/O failed dir=0, var=00007FFFCD7F68E0, varsize=20
[04:47:52] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:47:53] fcSaveRestoreState: I/O failed dir=0, var=00007FFFAC7F48E0, varsize=20
[04:47:53] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:47:54] fcSaveRestoreState: I/O failed dir=0, var=00007FFFA77EA8E0, varsize=20
[04:47:54] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:47:56] fcSaveRestoreState: I/O failed dir=0, var=00007FFFAE7F88E0, varsize=20
[04:47:56] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:47:57] fcSaveRestoreState: I/O failed dir=0, var=00007FFFA9FEF8E0, varsize=20
[04:47:57] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:47:58] fcSaveRestoreState: I/O failed dir=0, var=00007FFFA97EE8E0, varsize=20
[04:47:58] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:47:59] fcSaveRestoreState: I/O failed dir=0, var=00007FFFEBFFB8E0, varsize=20
[04:47:59] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:00] fcSaveRestoreState: I/O failed dir=0, var=00007FFFD67F88E0, varsize=20
[04:48:00] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:01] fcSaveRestoreState: I/O failed dir=0, var=00007FFFF0D338E0, varsize=20
[04:48:01] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:02] fcSaveRestoreState: I/O failed dir=0, var=00007FFFF1D358E0, varsize=20
[04:48:02] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:04] fcSaveRestoreState: I/O failed dir=0, var=00007FFFE8FF58E0, varsize=20
[04:48:04] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:05] fcSaveRestoreState: I/O failed dir=0, var=00007FFFAB7F28E0, varsize=20
[04:48:05] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:06] fcSaveRestoreState: I/O failed dir=0, var=00007FFFCEFF98E0, varsize=20
[04:48:06] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:07] fcSaveRestoreState: I/O failed dir=0, var=00007FFFEAFF98E0, varsize=20
[04:48:07] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:08] fcSaveRestoreState: I/O failed dir=0, var=00007FFFD57F68E0, varsize=20
[04:48:08] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:09] fcSaveRestoreState: I/O failed dir=0, var=00007FFFA8FED8E0, varsize=20
[04:48:09] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:10] fcSaveRestoreState: I/O failed dir=0, var=00007FFFA5FE78E0, varsize=20
[04:48:10] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:12] fcSaveRestoreState: I/O failed dir=0, var=00007FFFA67E88E0, varsize=20
[04:48:12] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:13] fcSaveRestoreState: I/O failed dir=0, var=00007FFFEA7F88E0, varsize=20
[04:48:13] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:14] fcSaveRestoreState: I/O failed dir=0, var=00007FFFCF7FA8E0, varsize=20
[04:48:14] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:15] fcSaveRestoreState: I/O failed dir=0, var=00007FFFAAFF18E0, varsize=20
[04:48:15] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:16] fcSaveRestoreState: I/O failed dir=0, var=00007FFFAEFF98E0, varsize=20
[04:48:16] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:17] fcSaveRestoreState: I/O failed dir=0, var=00007FFFD7FFB8E0, varsize=20
[04:48:17] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:19] fcSaveRestoreState: I/O failed dir=0, var=00007FFFEB7FA8E0, varsize=20
[04:48:19] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:19] Resuming from checkpoint
[04:48:20] fcSaveRestoreState: I/O failed dir=0, var=00007FFFF4535D30, varsize=20
[04:48:20] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:21] fcSaveRestoreState: I/O failed dir=0, var=00007FFFF2D378E0, varsize=20
[04:48:21] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:22] fcSaveRestoreState: I/O failed dir=0, var=00007FFFAA7F08E0, varsize=20
[04:48:22] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:23] fcSaveRestoreState: I/O failed dir=0, var=00007FFFE97F68E0, varsize=20
[04:48:23] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:24] fcSaveRestoreState: I/O failed dir=0, var=00007FFFCE7F88E0, varsize=20
[04:48:24] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:25] fcSaveRestoreState: I/O failed dir=0, var=00007FFFA7FEB8E0, varsize=20
[04:48:25] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:27] fcSaveRestoreState: I/O failed dir=0, var=00007FFFE9FF78E0, varsize=20
[04:48:27] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:28] fcSaveRestoreState: I/O failed dir=0, var=00007FFFCCFF58E0, varsize=20
[04:48:28] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:29] fcSaveRestoreState: I/O failed dir=0, var=00007FFFF15348E0, varsize=20
[04:48:29] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:30] fcSaveRestoreState: I/O failed dir=0, var=00007FFFAF7FA8E0, varsize=20
[04:48:30] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:31] fcSaveRestoreState: I/O failed dir=0, var=00007FFFACFF58E0, varsize=20
[04:48:31] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:32] fcSaveRestoreState: I/O failed dir=0, var=00007FFFD6FF98E0, varsize=20
[04:48:32] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:34] fcSaveRestoreState: I/O failed dir=0, var=00007FFFD5FF78E0, varsize=20
[04:48:34] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:35] fcSaveRestoreState: I/O failed dir=0, var=00007FFFAFFFB8E0, varsize=20
[04:48:35] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:36] fcSaveRestoreState: I/O failed dir=0, var=00007FFFD4FF58E0, varsize=20
[04:48:36] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:37] fcSaveRestoreState: I/O failed dir=0, var=00007FFFF25368E0, varsize=20
[04:48:37] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:38] fcSaveRestoreState: I/O failed dir=0, var=00007FFFCFFFB8E0, varsize=20
[04:48:38] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[04:48:38] mdrun returned 3
[04:48:38] Gromacs detected an invalid checkpoint.  Restarting...fcCheckPointResume: file hashes different -- aborting.
[04:48:38] Can't open checkpoint file
[04:48:38] Can't open checkpoint file
[04:48:38] Can't open checkpoint file
[04:48:38] Can't open checkpoint file
[04:48:38]
[04:48:38] Folding@home Core Shutdown: UNKNOWN_ERROR
[04:48:39] CoreStatus = 62 (98)
[04:48:39] + Restarting core (settings changed)
[04:48:39]
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Fatal error on step 0

Postby linuxfah » Sun Mar 06, 2011 5:29 am

Those errors look similar to what I saw with my client. Perhaps these log files would be helpful to the developers.

I can try using -verbosity 9 next time to see if there is any other debug information available.
linuxfah
 
Posts: 333
Joined: Tue Sep 22, 2009 9:28 pm

Re: Fatal error on step 0

Postby tear » Wed Mar 16, 2011 5:41 pm

Two more observations (ran into the issue several more times since):
1. all failures I experienced show the ckp file being 20 bytes short (75140 vs. 75160);
  this is consistent with the error message (fcSaveRestoreState: I/O failed dir=0, var=00007FFFA6FE98E0, varsize=20), dir == 0 means "reading" whereas dir == 1 means "writing"
2. Re
linuxfah wrote:Edit: Also one of the more recent times I stopped the client, the client did not make a checkpoint before exit. It does make the checkpoint before exit in most cases though.

As a part of different endeavor I traced client shutdowns a number of times. It appears that no-checkpoint scenarios are consistent with one of the FahCore threads raising SIGABRT (application calling abort(3))
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Next

Return to SMP with bigadv

Who is online

Users browsing this forum: No registered users and 2 guests

cron