Lost two 6901s at 50%&23%, resume from checkpoint problems

The most demanding Projects are only available to a small percentage of very high-end servers.

Moderators: Site Moderators, PandeGroup

Lost two 6901s at 50%&23%, resume from checkpoint problems

Postby Agencyman » Mon May 07, 2012 11:11 pm

Scenario:
6.34 Console client, installed via Terminal, the old fashioned way. I wiped the drive because the new 7.x automated package did not work for me on the first two tries. I'll look at that later.

I had just finished a new solo-on-this-drive Ubuntu 12.04 build on a clean drive, dual hex processors at 3.6 GHz TPF about 12 min. on these 6901 WUs. I have plenty of time re;-deadline, and so shut them down to do other things with the machine, and at night.

Always use [cntrl+c] before exiting the Term., always reliable in the past.

The first loss, I accepted as simple fate, now a second one is wasted, due to failure to restart from the last checkpoint. I wonder if any other folks here have experienced this recently, indicating a core/WU issue or if perhaps it is all due to some glitch in my new Ubuntu install.

I intend to rebuild yet again, (darn, all that time making my huge but perfect 3D Docky down the drain!), unless there are other systems with this problem. If there is a pattern, then I'll join that chase.

Bruce Hinton

Log:
This is a long log but a scan will show the points where resume worked, and the other points where disaster came for me.

Code: Select all
--- Opening Log file [May 3 15:20:28 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/bruce/folding
Executable: ./fah6
Arguments: -smp 24 -bigadv -oneunit -verbosity 9

[15:20:28] - Ask before connecting: No
[15:20:28] - User name: Agencyman (Team 196420)
[15:20:28] - User ID: 2F92F17A07EB3ACD
[15:20:28] - Machine ID: 15
[15:20:28]
[15:20:28] Loaded queue successfully.
[15:20:28]
[15:20:28] - Autosending finished units... [May 3 15:20:28 UTC]
[15:20:28] Trying to send all finished work units
[15:20:28] + No unsent completed units remaining.
[15:20:28] - Autosend completed
[15:20:28] + Processing work unit
[15:20:28] Core required: FahCore_a5.exe
[15:20:28] Core found.
[15:20:28] Working on queue slot 02 [May 3 15:20:28 UTC]
[15:20:28] + Working ...
[15:20:28] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 02 -np 24 -checkpoint 15 -verbose -lifeline 19551 -version 634'

[15:20:28]
[15:20:28] *------------------------------*
[15:20:28] Folding@Home Gromacs SMP Core
[15:20:28] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[15:20:28]
[15:20:28] Preparing to commence simulation
[15:20:28] - Ensuring status. Please wait.
[15:20:38] - Looking at optimizations...
[15:20:38] - Working with standard loops on this execution.
[15:20:38] - Previous termination of core was improper.
[15:20:38] - Files status OK
[15:20:40] - Expanded 24875160 -> 30796292 (decompressed 123.8 percent)
[15:20:40] Called DecompressByteArray: compressed_data_size=24875160 data_size=30796292, decompressed_data_size=30796292 diff=0
[15:20:40] - Digital signature verified
[15:20:40]
[15:20:40] Project: 6901 (Run 16, Clone 6, Gen 207)
[15:20:40]
[15:20:40] Entering M.D.
[15:20:47] Mapping NT from 24 to 24
[15:20:49] Completed 0 out of 250000 steps  (0%)
[15:32:44] Completed 2500 out of 250000 steps  (1%)
[15:44:43] Completed 5000 out of 250000 steps  (2%)
[15:56:47] Completed 7500 out of 250000 steps  (3%)
[16:08:54] Completed 10000 out of 250000 steps  (4%)
[16:18:49] ***** Got an Activate signal (2)
[16:18:49] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [May 3 18:01:11 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/bruce/folding
Executable: ./fah6
Arguments: -smp 24 -bigadv -oneunit -verbosity 9

[18:01:11] - Ask before connecting: No
[18:01:11] - User name: Agencyman (Team 196420)
[18:01:11] - User ID: 2F92F17A07EB3ACD
[18:01:11] - Machine ID: 15
[18:01:11]
[18:01:11] Loaded queue successfully.
[18:01:11]
[18:01:11] + Processing work unit
[18:01:11] Core required: FahCore_a5.exe
[18:01:11] Core found.
[18:01:11] - Autosending finished units... [May 3 18:01:11 UTC]
[18:01:11] Trying to send all finished work units
[18:01:11] + No unsent completed units remaining.
[18:01:11] - Autosend completed
[18:01:11] Working on queue slot 02 [May 3 18:01:11 UTC]
[18:01:11] + Working ...
[18:01:11] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 02 -np 24 -checkpoint 15 -verbose -lifeline 2739 -version 634'

[18:01:12]
[18:01:12] *------------------------------*
[18:01:12] Folding@Home Gromacs SMP Core
[18:01:12] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[18:01:12]
[18:01:12] Preparing to commence simulation
[18:01:12] - Looking at optimizations...
[18:01:12] - Files status OK
[18:01:14] - Expanded 24875160 -> 30796292 (decompressed 123.8 percent)
[18:01:14] Called DecompressByteArray: compressed_data_size=24875160 data_size=30796292, decompressed_data_size=30796292 diff=0
[18:01:14] - Digital signature verified
[18:01:14]
[18:01:14] Project: 6901 (Run 16, Clone 6, Gen 207)
[18:01:14]
[18:01:14] Assembly optimizations on if available.
[18:01:14] Entering M.D.
[18:01:20] Using Gromacs checkpoints
[18:01:22] Mapping NT from 24 to 24
[18:01:25] Resuming from checkpoint
[18:01:34] Verified work/wudata_02.log
[18:01:35] Verified work/wudata_02.trr
[18:01:35] Verified work/wudata_02.xtc
[18:01:35] Verified work/wudata_02.edr
[18:01:35] Completed 9370 out of 250000 steps  (3%)
[18:04:25] Completed 10000 out of 250000 steps  (4%)
[18:15:34] Completed 12500 out of 250000 steps  (5%)
[18:26:43] Completed 15000 out of 250000 steps  (6%)
[18:37:52] Completed 17500 out of 250000 steps  (7%)
[18:49:01] Completed 20000 out of 250000 steps  (8%)
[19:00:04] Completed 22500 out of 250000 steps  (9%)
[19:11:15] Completed 25000 out of 250000 steps  (10%)
[19:22:31] Completed 27500 out of 250000 steps  (11%)
[19:33:58] Completed 30000 out of 250000 steps  (12%)
[19:45:26] Completed 32500 out of 250000 steps  (13%)
[19:56:50] Completed 35000 out of 250000 steps  (14%)
[20:08:05] Completed 37500 out of 250000 steps  (15%)
[20:19:20] Completed 40000 out of 250000 steps  (16%)
[20:30:36] Completed 42500 out of 250000 steps  (17%)
[20:41:50] Completed 45000 out of 250000 steps  (18%)
[20:53:12] Completed 47500 out of 250000 steps  (19%)
[21:04:30] Completed 50000 out of 250000 steps  (20%)
[21:15:41] Completed 52500 out of 250000 steps  (21%)
[21:20:21] ***** Got an Activate signal (2)
[21:20:21] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [May 4 10:59:04 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/bruce/folding
Executable: ./fah6
Arguments: -smp 24 -bigadv -oneunit -verbosity 9

[10:59:04] - Ask before connecting: No
[10:59:04] - User name: Agencyman (Team 196420)
[10:59:04] - User ID: 2F92F17A07EB3ACD
[10:59:04] - Machine ID: 15
[10:59:04]
[10:59:04] Loaded queue successfully.
[10:59:04]
[10:59:04] + Processing work unit
[10:59:04] Core required: FahCore_a5.exe
[10:59:04] - Autosending finished units... [May 4 10:59:04 UTC]
[10:59:04] Core found.
[10:59:04] Trying to send all finished work units
[10:59:04] + No unsent completed units remaining.
[10:59:04] - Autosend completed
[10:59:04] Working on queue slot 02 [May 4 10:59:04 UTC]
[10:59:04] + Working ...
[10:59:04] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 02 -np 24 -checkpoint 15 -verbose -lifeline 2841 -version 634'

[10:59:04]
[10:59:04] *------------------------------*
[10:59:04] Folding@Home Gromacs SMP Core
[10:59:04] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[10:59:04]
[10:59:04] Preparing to commence simulation
[10:59:04] - Looking at optimizations...
[10:59:04] - Files status OK
[10:59:06] - Expanded 24875160 -> 30796292 (decompressed 123.8 percent)
[10:59:06] Called DecompressByteArray: compressed_data_size=24875160 data_size=30796292, decompressed_data_size=30796292 diff=0
[10:59:06] - Digital signature verified
[10:59:06]
[10:59:06] Project: 6901 (Run 16, Clone 6, Gen 207)
[10:59:06]
[10:59:06] Assembly optimizations on if available.
[10:59:06] Entering M.D.
[10:59:12] Using Gromacs checkpoints
[10:59:14] Mapping NT from 24 to 24
[10:59:26] Resuming from checkpoint
[10:59:29] Verified work/wudata_02.log
[10:59:29] Verified work/wudata_02.trr
[10:59:30] Verified work/wudata_02.xtc
[10:59:30] Verified work/wudata_02.edr
[10:59:30] Completed 53545 out of 250000 steps  (21%)
[11:06:03] Completed 55000 out of 250000 steps  (22%)
[11:17:12] Completed 57500 out of 250000 steps  (23%)
[11:28:26] Completed 60000 out of 250000 steps  (24%)
[11:39:57] Completed 62500 out of 250000 steps  (25%)
[11:51:27] Completed 65000 out of 250000 steps  (26%)
[12:03:18] Completed 67500 out of 250000 steps  (27%)
[12:14:50] Completed 70000 out of 250000 steps  (28%)
[12:26:14] Completed 72500 out of 250000 steps  (29%)
[12:37:42] Completed 75000 out of 250000 steps  (30%)
[12:49:12] Completed 77500 out of 250000 steps  (31%)
[12:55:55] ***** Got an Activate signal (2)
[12:55:55] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [May 4 14:33:21 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/bruce/folding
Executable: ./fah6
Arguments: -smp 24 -bigadv -oneunit -verbosity 9

[14:33:21] - Ask before connecting: No
[14:33:21] - User name: Agencyman (Team 196420)
[14:33:21] - User ID: 2F92F17A07EB3ACD
[14:33:21] - Machine ID: 15
[14:33:21]
[14:33:21] Loaded queue successfully.
[14:33:21]
[14:33:21] - Autosending finished units... [May 4 14:33:21 UTC]
[14:33:21] Trying to send all finished work units
[14:33:21] + No unsent completed units remaining.
[14:33:21] - Autosend completed
[14:33:21] + Processing work unit
[14:33:21] Core required: FahCore_a5.exe
[14:33:21] Core found.
[14:33:21] Working on queue slot 02 [May 4 14:33:21 UTC]
[14:33:21] + Working ...
[14:33:21] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 02 -np 24 -checkpoint 15 -verbose -lifeline 2769 -version 634'

[14:33:21]
[14:33:21] *------------------------------*
[14:33:21] Folding@Home Gromacs SMP Core
[14:33:21] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[14:33:21]
[14:33:21] Preparing to commence simulation
[14:33:21] - Looking at optimizations...
[14:33:21] - Files status OK
[14:33:23] - Expanded 24875160 -> 30796292 (decompressed 123.8 percent)
[14:33:23] Called DecompressByteArray: compressed_data_size=24875160 data_size=30796292, decompressed_data_size=30796292 diff=0
[14:33:23] - Digital signature verified
[14:33:23]
[14:33:23] Project: 6901 (Run 16, Clone 6, Gen 207)
[14:33:23]
[14:33:23] Assembly optimizations on if available.
[14:33:23] Entering M.D.
[14:33:29] Using Gromacs checkpoints
[14:33:31] Mapping NT from 24 to 24
[14:33:46] Resuming from checkpoint
[14:33:47] Verified work/wudata_02.log
[14:33:47] Verified work/wudata_02.trr
[14:33:47] Verified work/wudata_02.xtc
[14:33:47] Verified work/wudata_02.edr
[14:33:48] Completed 78965 out of 250000 steps  (31%)
[14:38:25] Completed 80000 out of 250000 steps  (32%)
[14:49:40] Completed 82500 out of 250000 steps  (33%)
[15:00:55] Completed 85000 out of 250000 steps  (34%)
[15:12:09] Completed 87500 out of 250000 steps  (35%)
[15:23:31] Completed 90000 out of 250000 steps  (36%)
[15:35:02] Completed 92500 out of 250000 steps  (37%)
[15:46:29] Completed 95000 out of 250000 steps  (38%)
[15:58:02] Completed 97500 out of 250000 steps  (39%)
[16:09:53] Completed 100000 out of 250000 steps  (40%)
[16:22:03] Completed 102500 out of 250000 steps  (41%)
[16:33:59] Completed 105000 out of 250000 steps  (42%)
[16:45:40] Completed 107500 out of 250000 steps  (43%)
[16:57:23] Completed 110000 out of 250000 steps  (44%)
[17:12:35] Completed 112500 out of 250000 steps  (45%)
[17:25:20] Completed 115000 out of 250000 steps  (46%)
[17:36:42] Completed 117500 out of 250000 steps  (47%)
[17:48:04] Completed 120000 out of 250000 steps  (48%)
[17:59:23] Completed 122500 out of 250000 steps  (49%)
[18:10:39] Completed 125000 out of 250000 steps  (50%)
[18:13:08] ***** Got an Activate signal (2)
[18:13:08] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [May 5 11:29:51 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/bruce/folding
Executable: ./fah6
Arguments: -smp 24 -bigadv -oneunit -verbosity 9

[11:29:51] - Ask before connecting: No
[11:29:51] - User name: Agencyman (Team 196420)
[11:29:51] - User ID: 2F92F17A07EB3ACD
[11:29:51] - Machine ID: 15
[11:29:51]
[11:29:52] Loaded queue successfully.
[11:29:52]
[11:29:52] + Processing work unit
[11:29:52] Core required: FahCore_a5.exe
[11:29:52] Core found.
[11:29:52] - Autosending finished units... [11:29:52]
[11:29:52] Trying to send all finished work units
[11:29:52] + No unsent completed units remaining.
[11:29:52] - Autosend completed
[11:29:52] Working on queue slot 02 [May 5 11:29:52 UTC]
[11:29:52] + Working ...
[11:29:52] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 02 -np 24 -checkpoint 15 -verbose -lifeline 2730 -version 634'

[11:29:52]
[11:29:52] *------------------------------*
[11:29:52] Folding@Home Gromacs SMP Core
[11:29:52] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[11:29:52]
[11:29:52] Preparing to commence simulation
[11:29:52] - Looking at optimizations...
[11:29:52] - Files status OK
[11:29:54] - Expanded 24875160 -> 30796292 (decompressed 123.8 percent)
[11:29:54] Called DecompressByteArray: compressed_data_size=24875160 data_size=30796292, decompressed_data_size=30796292 diff=0
[11:29:54] - Digital signature verified
[11:29:54]
[11:29:54] Project: 6901 (Run 16, Clone 6, Gen 207)
[11:29:54]
[11:29:54] Assembly optimizations on if available.
[11:29:54] Entering M.D.
[11:30:00] Using Gromacs checkpoints
[11:30:02] Mapping NT from 24 to 24
[11:30:05] fcSaveRestoreState: I/O failed dir=0, var=00007FF90FFFB8E0, varsize=20
[11:30:05] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[11:30:05] fcSaveRestoreState: I/O failed dir=0, var=00007FF90EFF98E0, varsize=20
[11:30:05] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[11:30:06] fcSaveRestoreState: I/O failed dir=0, var=00007FF92428F8E0, varsize=20
[11:30:06] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[11:30:06] fcSaveRestoreState: I/O failed dir=0, var=00007FF921A8A8E0, varsize=20
[11:30:06] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[11:30:07] fcSaveRestoreState: I/O failed dir=0, var=00007FF919FF78E0, varsize=20
[11:30:07] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[11:30:07] fcSaveRestoreState: I/O failed dir=0, var=00007FF923A8E8E0, varsize=20
[11:30:07] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[11:30:08] fcSaveRestoreState: I/O failed dir=0, var=00007FF920A888E0, varsize=20
[11:30:08] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[11:30:08] fcSaveRestoreState: I/O failed dir=0, var=00007FF90D7F68E0, varsize=20
[11:30:08] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[11:30:09] fcSaveRestoreState: I/O failed dir=0, var=00007FF9252918E0, varsize=20
[11:30:09] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[11:30:09] fcSaveRestoreState: I/O failed dir=0, var=00007FF90F7FA8E0, varsize=20
[11:30:09] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[11:30:10] fcSaveRestoreState: I/O failed dir=0, var=00007FF92228B8E0, varsize=20
[11:30:10] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[11:30:10] fcSaveRestoreState: I/O failed dir=0, var=00007FF91AFF98E0, varsize=20
[11:30:10] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[11:30:11] fcSaveRestoreState: I/O failed dir=0, var=00007FF91A7F88E0, varsize=20
[11:30:11] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[11:30:11] mdrun returned 3
[11:30:11] Gromacs detected an invalid checkpoint.  Restarting...fcSaveRestoreState: I/O failed dir=0, var=00007FF92328D8E0, varsize=20
[11:30:11] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[11:30:11] Can't open checkpoint file
[11:30:11] Can't open checkpoint file
[11:30:11] Can't open checkpoint file
[11:30:11] Can't open checkpoint file
[11:30:11] Resuming from checkpoint
[11:30:11] Can't open checkpoint file
[11:30:11] Can't open checkpoint file
[11:30:11] Can't open checkpoint file
[11:30:11] Can't open checkpoint file
[11:30:11] Can't open checkpoint file
[11:30:11] Can't open checkpoint file
[11:30:40] ***** Got an Activate signal (2)
[11:30:40] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [May 5 11:30:48 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/bruce/folding
Executable: ./fah6
Arguments: -smp 24 -bigadv -oneunit -verbosity 9

[11:30:48] - Ask before connecting: No
[11:30:48] - User name: Agencyman (Team 196420)
[11:30:48] - User ID: 2F92F17A07EB3ACD
[11:30:48] - Machine ID: 15
[11:30:48]
[11:30:48] Loaded queue successfully.
[11:30:48]
[11:30:48] + Processing work unit
[11:30:48] Core required: FahCore_a5.exe
[11:30:48] Core found.
[11:30:48] - Autosending finished units... [May 5 11:30:48 UTC]
[11:30:48] Trying to send all finished work units
[11:30:48] + No unsent completed units remaining.
[11:30:48] - Autosend completed
[11:30:48] Working on queue slot 02 [May 5 11:30:48 UTC]
[11:30:48] + Working ...
[11:30:48] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 02 -np 24 -checkpoint 15 -verbose -lifeline 2805 -version 634'

[11:30:48]
[11:30:48] *------------------------------*
[11:30:48] Folding@Home Gromacs SMP Core
[11:30:48] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[11:30:48]
[11:30:48] Preparing to commence simulation
[11:30:48] - Looking at optimizations...
[11:30:48] - Files status OK
[11:30:50] - Expanded 24875160 -> 30796292 (decompressed 123.8 percent)
[11:30:50] Called DecompressByteArray: compressed_data_size=24875160 data_size=30796292, decompressed_data_size=30796292 diff=0
[11:30:50] - Digital signature verified
[11:30:50]
[11:30:50] Project: 6901 (Run 16, Clone 6, Gen 207)
[11:30:50]
[11:30:50] Assembly optimizations on if available.
[11:30:50] Entering M.D.
[11:30:57] Mapping NT from 24 to 24
[11:30:59] Completed 0 out of 250000 steps  (0%)
[11:44:43] Completed 2500 out of 250000 steps  (1%)
[11:56:33] Completed 5000 out of 250000 steps  (2%)
[12:07:51] Completed 7500 out of 250000 steps  (3%)
[12:19:03] Completed 10000 out of 250000 steps  (4%)
[12:30:42] Completed 12500 out of 250000 steps  (5%)
[12:42:08] Completed 15000 out of 250000 steps  (6%)
[12:53:37] Completed 17500 out of 250000 steps  (7%)
[13:05:08] Completed 20000 out of 250000 steps  (8%)
[13:19:04] Completed 22500 out of 250000 steps  (9%)
[13:30:51] Completed 25000 out of 250000 steps  (10%)
[13:42:42] Completed 27500 out of 250000 steps  (11%)
[13:54:21] Completed 30000 out of 250000 steps  (12%)
[14:05:42] Completed 32500 out of 250000 steps  (13%)
[14:17:11] Completed 35000 out of 250000 steps  (14%)
[14:28:49] Completed 37500 out of 250000 steps  (15%)
[14:29:41] ***** Got an Activate signal (2)
[14:29:41] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [May 6 11:00:14 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/bruce/folding
Executable: ./fah6
Arguments: -smp 24 -bigadv -oneunit -verbosity 9

[11:00:14] - Ask before connecting: No
[11:00:14] - User name: Agencyman (Team 196420)
[11:00:14] - User ID: 2F92F17A07EB3ACD
[11:00:14] - Machine ID: 15
[11:00:14]
[11:00:14] Loaded queue successfully.
[11:00:14]
[11:00:14] + Processing work unit
[11:00:14] Core required: FahCore_a5.exe
[11:00:14] Core found.
[11:00:14] - Autosending finished units... [May 6 11:00:14 UTC]
[11:00:14] Trying to send all finished work units
[11:00:14] + No unsent completed units remaining.
[11:00:14] - Autosend completed
[11:00:14] Working on queue slot 02 [May 6 11:00:14 UTC]
[11:00:14] + Working ...
[11:00:14] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 02 -np 24 -checkpoint 15 -verbose -lifeline 2889 -version 634'

[11:00:14]
[11:00:14] *------------------------------*
[11:00:14] Folding@Home Gromacs SMP Core
[11:00:14] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[11:00:14]
[11:00:14] Preparing to commence simulation
[11:00:14] - Looking at optimizations...
[11:00:14] - Files status OK
[11:00:16] - Expanded 24875160 -> 30796292 (decompressed 123.8 percent)
[11:00:16] Called DecompressByteArray: compressed_data_size=24875160 data_size=30796292, decompressed_data_size=30796292 diff=0
[11:00:17] - Digital signature verified
[11:00:17]
[11:00:17] Project: 6901 (Run 16, Clone 6, Gen 207)
[11:00:17]
[11:00:17] Assembly optimizations on if available.
[11:00:17] Entering M.D.
[11:00:23] Using Gromacs checkpoints
[11:00:24] Mapping NT from 24 to 24
[11:00:40] Resuming from checkpoint
[11:00:41] Verified work/wudata_02.log
[11:00:42] Verified work/wudata_02.trr
[11:00:42] Verified work/wudata_02.xtc
[11:00:42] Verified work/wudata_02.edr
[11:00:42] Completed 37685 out of 250000 steps  (15%)
[11:10:55] Completed 40000 out of 250000 steps  (16%)
[11:22:31] Completed 42500 out of 250000 steps  (17%)
[11:34:20] Completed 45000 out of 250000 steps  (18%)
[11:46:20] Completed 47500 out of 250000 steps  (19%)
[11:58:09] Completed 50000 out of 250000 steps  (20%)
[12:09:49] Completed 52500 out of 250000 steps  (21%)
[12:21:26] Completed 55000 out of 250000 steps  (22%)
[12:29:11] ***** Got an Activate signal (2)
[12:29:12] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [May 6 20:58:11 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/bruce/folding
Executable: ./fah6
Arguments: -smp 24 -bigadv -oneunit -verbosity 9

[20:58:11] - Ask before connecting: No
[20:58:11] - User name: Agencyman (Team 196420)
[20:58:11] - User ID: 2F92F17A07EB3ACD
[20:58:11] - Machine ID: 15
[20:58:11]
[20:58:11] Loaded queue successfully.
[20:58:11]
[20:58:11] + Processing work unit
[20:58:11] Core required: FahCore_a5.exe
[20:58:11] Core found.
[20:58:11] - Autosending finished units... [May 6 20:58:11 UTC]
[20:58:11] Trying to send all finished work units
[20:58:11] + No unsent completed units remaining.
[20:58:11] - Autosend completed
[20:58:11] Working on queue slot 02 [May 6 20:58:11 UTC]
[20:58:11] + Working ...
[20:58:11] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 02 -np 24 -checkpoint 15 -verbose -lifeline 2818 -version 634'

[20:58:11]
[20:58:11] *------------------------------*
[20:58:11] Folding@Home Gromacs SMP Core
[20:58:11] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[20:58:11]
[20:58:11] Preparing to commence simulation
[20:58:11] - Looking at optimizations...
[20:58:11] - Files status OK
[20:58:13] - Expanded 24875160 -> 30796292 (decompressed 123.8 percent)
[20:58:13] Called DecompressByteArray: compressed_data_size=24875160 data_size=30796292, decompressed_data_size=30796292 diff=0
[20:58:14] - Digital signature verified
[20:58:14]
[20:58:14] Project: 6901 (Run 16, Clone 6, Gen 207)
[20:58:14]
[20:58:14] Assembly optimizations on if available.
[20:58:14] Entering M.D.
[20:58:20] Using Gromacs checkpoints
[20:58:21] Mapping NT from 24 to 24
[20:58:24] fcSaveRestoreState: I/O failed dir=0, var=00007FAE277FA8E0, varsize=20
[20:58:24] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[20:58:25] fcSaveRestoreState: I/O failed dir=0, var=00007FAE24FF58E0, varsize=20
[20:58:25] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[20:58:25] fcSaveRestoreState: I/O failed dir=0, var=00007FAE304E58E0, varsize=20
[20:58:25] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[20:58:26] fcSaveRestoreState: I/O failed dir=0, var=00007FAE257F68E0, varsize=20
[20:58:26] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[20:58:26] fcSaveRestoreState: I/O failed dir=0, var=00007FAE1BFFB8E0, varsize=20
[20:58:26] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[20:58:27] fcSaveRestoreState: I/O failed dir=0, var=00007FAE31CE88E0, varsize=20
[20:58:27] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[20:58:27] fcSaveRestoreState: I/O failed dir=0, var=00007FAE2CCDE8E0, varsize=20
[20:58:27] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[20:58:27] fcSaveRestoreState: I/O failed dir=0, var=00007FAE2D4DF8E0, varsize=20
[20:58:27] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[20:58:28] fcSaveRestoreState: I/O failed dir=0, var=00007FAE314E78E0, varsize=20
[20:58:28] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[20:58:28] fcSaveRestoreState: I/O failed dir=0, var=00007FAE1AFF98E0, varsize=20
[20:58:28] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[20:58:29] fcSaveRestoreState: I/O failed dir=0, var=00007FAE2DCE08E0, varsize=20
[20:58:29] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[20:58:29] fcSaveRestoreState: I/O failed dir=0, var=00007FAE2ECE28E0, varsize=20
[20:58:29] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[20:58:30] fcSaveRestoreState: I/O failed dir=0, var=00007FAE324E98E0, varsize=20
[20:58:30] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[20:58:30] mdrun returned 3
[20:58:30] Gromacs detected an invalid checkpoint.  Restarting...Resuming from checkpoint
[20:58:30] fcSaveRestoreState: I/O failed dir=0, var=00007FAE33CE7D30, varsize=20
[20:58:30] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore cpt hash.
[20:58:30] Can't open checkpoint file
[20:58:30] Can't open checkpoint file
[20:58:30] Can't open checkpoint file
[20:58:30] Can't open checkpoint file
[20:58:30] Can't open checkpoint file
[20:58:30] Can't open checkpoint file
[20:58:30] Can't open checkpoint file
[20:58:30] Can't open checkpoint file
[20:58:30] Can't open checkpoint file
[20:58:30] Can't open checkpoint file
[20:59:56] ***** Got an Activate signal (2)
[20:59:56] Killing all core threads

Folding@Home Client Shutdown.
Agencyman
 
Posts: 44
Joined: Thu Oct 28, 2010 7:59 pm

Re: Lost two 6901s at 50%&23%, resume from checkpoint proble

Postby 7im » Mon May 07, 2012 11:24 pm

Forum search for EXT3.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
User avatar
7im
 
Posts: 15147
Joined: Thu Nov 29, 2007 4:30 pm
Location: Arizona

Re: Lost two 6901s at 50%&23%, resume from checkpoint proble

Postby codysluder » Mon May 07, 2012 11:27 pm

Are you running an ext4 filesystem? There are many topics on the forum saying data on it can be corrupted by shutting down too rapidly and suggesting either putting FAH on an ext3 partition or changing how the ext4 filesystem is mounted.
codysluder
 
Posts: 2238
Joined: Sun Dec 02, 2007 12:43 pm

Re: Lost two 6901s at 50%&23%, resume from checkpoint proble

Postby Agencyman » Mon May 07, 2012 11:47 pm

Couple of months back, I got into that subject, did a build with ext3 but could get no one at the time to answer my questions about exactly why I did it. Perhaps now I'll know.

I kept reading how much better ext4 or maybe soon 5 will be, so just let this one run its course.

I'll get to it!

Ubuntu is on an old favorite but small drive. If I can re-partition, and move things I will, or just give up and do it all over again... :)

Thanks for the new direction.

Bruce
Agencyman
 
Posts: 44
Joined: Thu Oct 28, 2010 7:59 pm

Re: Lost two 6901s at 50%&23%, resume from checkpoint proble

Postby Agencyman » Tue May 08, 2012 12:54 am

OK; now I have a new ext3 partition of 15 Gb, left 10Gb for a swap, but have not seen it used yet.

Do I install FAH again into the new ext3 or can I move parts of it?

I would never assume the W operating system could do this but a good friend tells me of such wonderful things with his Linux boxes.

I wish I could intuitively know all this and not have to ask, but, someday I promise to help out some noobs, and give you guys a break.

Bruce
Agencyman
 
Posts: 44
Joined: Thu Oct 28, 2010 7:59 pm

Re: Lost two 6901s at 50%&23%, resume from checkpoint proble

Postby Leonardo » Tue May 08, 2012 7:18 am

For Folding, 4GB swap is plenty.

Please check your private messages...
Image
User avatar
Leonardo
 
Posts: 613
Joined: Tue Dec 04, 2007 5:09 am
Location: Eagle River, Alaska

Re: Lost two 6901s at 50%&23%, resume from checkpoint proble

Postby Agencyman » Tue May 08, 2012 11:31 am

Thanks for that, I've never used any guide other than the basic console directions in the FAH guide. This should be interesting;

Wrappers! Off to try all this now

Bruce
Agencyman
 
Posts: 44
Joined: Thu Oct 28, 2010 7:59 pm

Re: Lost two 6901s at 50%&23%, resume from checkpoint proble

Postby tear » Fri May 11, 2012 2:15 am

This issue isn't related to underlying filesystem at all (== this is not "install ext3 and you'll be good" issue).

It's a FahCore_a3 and _a5 bug that has been there since their release.

In a nutshell (from documentation I created for one of my tools):
Code: Select all
    CAUTION: FahCore_a5 is known to be problematic at user-induced shutdowns*.
             To be on a safe side make a backup of complete client directory
             before hitting Ctrl+C!

    To tell whether checkpoint was written correctly check the size
    of work/wudata_XX.ckp file (XX being current slot number).
    It should be 75160 (for core A5). If it's not -- better switch to backed up
    directory.

    *) http://foldingforum.org/viewtopic.php?f=55&t=17774


It was claimed that removing -verbosity 9 from client's command-line/options
hides the issue but said claim has eventually been found to be false.
One man's ceiling is another man's floor.
Image
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Lost two 6901s at 50%&23%, resume from checkpoint proble

Postby Grandpa_01 » Fri May 11, 2012 2:53 am

As tear said make a back-up before shutting fah down. In practice with Ubuntu 10.10 I go to Places / home folder then right click on the fah folder and choose copy to desk top before closing fah down. That way if fah does not start up correctly I just delete the fah folder from the home folder and right click on the fah folder on the desktop and choose copy to home folder. Works every time. :wink:
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
User avatar
Grandpa_01
 
Posts: 1863
Joined: Wed Mar 04, 2009 7:36 am

Re: Lost two 6901s at 50%&23%, resume from checkpoint proble

Postby Agencyman » Fri May 11, 2012 11:05 am

Interesting! I have thrown away over half a million points, that could have been saved, but knowledge comes hard and slow sometimes.

SO; Thanks, I have a new procedure, for any shut-downs, now. to save partials. One problem, though, my client loaded a new WU the instant it decided the other was not recoverable, and it would then be lost due to no fault of my computer. I hope the percentage of those stays well under the 20%, anyone know how far back the servers track the ratio?

Well, now I have gotten two 8101s in a row. The points are about 2/3 of the 6901-&-3s, even worse on 6904. My 12/24 rig is at 3.6 GHz so deadlines are no problem, (40 hrs v. 2.4 day preferred), but I don't have enough horsepower to get to the steep part of the Kfactor curve on these WUs designed for 18 true cores. And that is where the bonus points are hanging about.

Do these work the machine harder and hotter? I need to get HWMon or something, not all my old Win tools work in Ubuntu, even with Wine.

Thanks to you all, for the help and advice,
Bruce
Agencyman
 
Posts: 44
Joined: Thu Oct 28, 2010 7:59 pm

Re: Lost two 6901s at 50%&23%, resume from checkpoint proble

Postby Grandpa_01 » Fri May 11, 2012 1:43 pm

One way to stop it from downloading a new wu when it starts is to disconnect from the Internet until fah starts. Yes the 8101's work the machine harder, There are several temperature monitoring tools out there for Linux https://www.google.com/search?client=ub ... 20&bih=855
User avatar
Grandpa_01
 
Posts: 1863
Joined: Wed Mar 04, 2009 7:36 am

Re: Lost two 6901s at 50%&23%, resume from checkpoint proble

Postby PinHead » Mon Jul 02, 2012 2:16 am

I'm sure I'll get lit up for this, but I found in the forum about 6 months ago. There seems to be no confirmation that it should or should not work, but! It has worked for me for the last few months and before that I lost 1 out of 3'ish when performing a [Ctrl-c]. The ext4 will reduce the results compression time, but seems to have little to do with the issue you bring to the table.

There was a thread here a few month's back about reducing the -verbosity 9 to -verbosity 5 and the problem went away. When I made the change, my problem went away. Use at your own risk!

There seems to be no justification as to why it should or should not work, but I'm a fan. Even when top showed everything was idle, I still lost WU. Just saying, give it a try!!
PinHead
 
Posts: 368
Joined: Tue Jan 24, 2012 3:43 am

Re: Lost two 6901s at 50%&23%, resume from checkpoint proble

Postby HayesK » Sun Jul 29, 2012 2:52 pm

I used to subscribe to the "backup \work" before stopping the client v6.34 with CTR-C, but still lost occasional wu due to corrupted checkpoint file.

The key to to avoiding client 6.34 checkpoint problems is looking at the timestamp on the checkpoint file, wudata_0X.ckp (where X is current wu) located in the \work folder, to ensure the client is not writing a checkpoint while being stopped with CTRL-C. The checkpoint frequency in the client.cfg file defaults to 15 minutes, unless changed by the user, and is independent of the frame completion. Stopping the client just after the timestamp is written will minimize the loss of work done between checkpoints, since the client will resume the wu from the last checkpoint, which is independent of the frame %, and why the first frame completed after restarting has shorter time than typical.

My client v6.34 checkpoints are typically set at 30 minutes, which minimizes chance of a write being in progress during a random reboot and makes it easy to manually stop between checkpoints.
Image
<- 20-GPUs -> folding for OCF T32 as HayesK
User avatar
HayesK
 
Posts: 314
Joined: Sun Feb 22, 2009 4:23 pm
Location: LaPorte, Texas


Return to SMP with bigadv

Who is online

Users browsing this forum: No registered users and 2 guests

cron