WU's fail after 100%, left over work files, ext4 cause?

Moderators: Site Moderators, PandeGroup

WU's fail after 100%, left over work files, ext4 cause?

Postby chrisretusn » Thu Dec 31, 2009 2:48 pm

This is the second time I have lost a WU. The first was after I shutdown the client. When I started it backup I got the "[14:46:55] - Error: Could not get length of results file work/wuresults_02.dat" and [14:46:55] - Error: Could not read unit 02 file. Removing from queue." I started this client in a tty and stopped it with Ctrl-C, it took a very long time to actually quit.

The second time was after the WU completed. I got the error after the log said "Sending work to server" This particular WU was also shut down, this time via /etc/rc.d/rc.folding stop (Slackware64). This is a rc script uses killall to stop fah6. It also includes a sleep 10 plus a sync to ensure data is written to disk. On restart of the WU all looked well as the client started up and proceeded to run normally. Resuming from a check point and no complaints on Verified work/wudata_09.log, work/wudata_09.trr, work/wudata_09.xtc and work/wudata_09.edr. It did not verify or apparently have any problems with work/wuresults_09.dat. Only after completing 100% did it fail. That is frustrating for sure.

I saw a thread here discussing problem with ext4 (that is what I am using now on this machine); Folding Forum • View topic - Core hangs: jbd2_log_wait_commit. I am now thinking that this may be part of the problem. The only time I have had this problem is when shutting down the client. In all other cases the client finished and send its work. I could also be the 6.24beta client; but if it is, then it's only when using the -smp switch. I was folding on this dual core machine using the 6.24beta as a uniprocessor client for quite sometime with no problems.

I have been looking at qfix, it appears that I may be able to reconstruct the second (09) WU at this point since it has not passed deadline yet. The client is currently crunching on a WU, in this case (00), the first WU (02) and (09) have several files present. I am guessing the 02 is lost because it is past deadline. In reading the direction for qfix, the client must be stopped. I don't want to chance losing another WU. Can I run qfix without effecting the currently running WU?

In answer to my own question I am guessing since another WU has started and qfix needs to use the queue.dat file; that the answer in no. Which means 09 is lost as is 02 (due to past deadline). Is there a way to rebuild the results file from the remaining files and send the results? These are the 09 files still sitting in the work directory:

wudata_09.ckp
wudata_09.cpt
wudata_09.dat
wudata_09.dyn
wudata_09.edr
wudata_09.gro
wudata_09.log
wudata_09.tpr
wudata_09.trr
wudata_09.xtc
wudata_09_prev.cpt

This is the first time log:
Code: Select all
[13:35:16] Completed 70000 out of 250000 steps  (28%)
[13:59:57] Completed 72500 out of 250000 steps  (29%)
[14:13:37] - Autosending finished units... [December 19 14:13:37 UTC]
[14:13:37] Trying to send all finished work units
[14:13:37] + No unsent completed units remaining.
[14:13:37] - Autosend completed
[14:30:47] Completed 75000 out of 250000 steps  (30%)
[14:43:44] ***** Got an Activate signal (2)
[14:43:44] Killing all core threads

Folding@Home Client Shutdown.
#####################

--- Opening Log file [December 19 14:46:31 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/folder/fah6-1
Executable: /home/folder/fah6-1/fah6
Arguments: -verbosity 9 -smp

[14:46:31] - Ask before connecting: No
[14:46:31] - User name: chrisretusn (Team 2291)
[14:46:31] - User ID: xx
[14:46:31] - Machine ID: 1
[14:46:31]
[14:46:31] Loaded queue successfully.
[14:46:31] - Autosending finished units... [December 19 14:46:31 UTC]
[14:46:31] Trying to send all finished work units
[14:46:31] + No unsent completed units remaining.
[14:46:31] - Autosend completed
[14:46:31]
[14:46:31] + Processing work unit
[14:46:31] At least 4 processors must be requested.Core required: FahCore_a2.exe
[14:46:31] Core found.
[14:46:31] Working on queue slot 02 [December 19 14:46:31 UTC]
[14:46:31] + Working ...
[14:46:31] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 18271 -version 624'

[14:46:31]
[14:46:31] *------------------------------*
[14:46:31] Folding@Home Gromacs SMP Core
[14:46:31] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[14:46:31]
[14:46:31] Preparing to commence simulation
[14:46:31] - Ensuring status. Please wait.
[14:46:32] Called DecompressByteArray: compressed_data_size=4923220 data_size=24360573, decompressed_data_size=24360573 diff=0
[14:46:33] - Digital signature verified
[14:46:33]
[14:46:33] Project: 2662 (Run 2, Clone 214, Gen 50)
[14:46:33]
[14:46:33] Assembly optimizations on if available.
[14:46:33] Entering M.D.
[14:46:39] Using Gromacs checkpoints
[14:46:43]
[14:46:43] Entering M.D.
[14:46:49] Using Gromacs checkpoints
[14:46:55] CoreStatus = 0 (0)
[14:46:55] Sending work to server
[14:46:55] Project: 2662 (Run 2, Clone 214, Gen 50)
[14:46:55] - Error: Could not get length of results file work/wuresults_02.dat
[14:46:55] - Error: Could not read unit 02 file. Removing from queue.
[14:46:55] Trying to send all finished work units
[14:46:55] + No unsent completed units remaining.
[14:46:55] - Preparing to get new work unit...


Second time log.
Code: Select all
--- Opening Log file [December 30 09:56:19 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/folder/fah6-1
Executable: /home/folder/fah6-1/fah6
Arguments: -verbosity 9 -smp

[09:56:19] - Ask before connecting: No
[09:56:19] - User name: chrisretusn (Team 2291)
[09:56:19] - User ID: xx
[09:56:19] - Machine ID: 1
[09:56:19]
[09:56:20] Loaded queue successfully.
[09:56:20] - Autosending finished units... [December 30 09:56:20 UTC]
[09:56:20] Trying to send all finished work units
[09:56:20] + No unsent completed units remaining.
[09:56:20] - Autosend completed
[09:56:20]
[09:56:20] + Processing work unit
[09:56:20] At least 4 processors must be requested.Core required: FahCore_a2.exe
[09:56:20] Core found.
[09:56:20] Working on queue slot 09 [December 30 09:56:20 UTC]
[09:56:20] + Working ...
[09:56:20] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 09 -checkpoint 15 -verbose -lifeline 14288 -version 624'

[09:56:20]
[09:56:20] *------------------------------*
[09:56:20] Folding@Home Gromacs SMP Core
[09:56:20] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[09:56:20]
[09:56:20] Preparing to commence simulation
[09:56:20] - Ensuring status. Please wait.
[09:56:21] Called DecompressByteArray: compressed_data_size=4855104 data_size=24045785, decompressed_data_size=24045785 diff=0
[09:56:21] - Digital signature verified
[09:56:21]
[09:56:21] Project: 2662 (Run 0, Clone 467, Gen 44)
[09:56:21]
[09:56:21] Assembly optimizations on if available.
[09:56:21] Entering M.D.
[09:56:27] Using Gromacs checkpoints
[09:56:31]
[09:56:31] Entering M.D.
[09:56:37] Using Gromacs checkpoints
[09:56:45] Resuming from checkpoint
[09:56:45] Verified work/wudata_09.log
[09:56:46] Verified work/wudata_09.trr
[09:56:46] Verified work/wudata_09.xtc
[09:56:46] Verified work/wudata_09.edr
[09:56:47] Completed 114001 out of 250000 steps  (45%)
[10:05:54] Completed 115000 out of 250000 steps  (46%)
[10:30:27] Completed 117500 out of 250000 steps  (47%)
[10:55:04] Completed 120000 out of 250000 steps  (48%)
[11:18:00] Completed 122500 out of 250000 steps  (49%)
[11:40:54] Completed 125000 out of 250000 steps  (50%)
[12:03:51] Completed 127500 out of 250000 steps  (51%)
[12:26:46] Completed 130000 out of 250000 steps  (52%)
[12:49:42] Completed 132500 out of 250000 steps  (53%)
[13:12:37] Completed 135000 out of 250000 steps  (54%)
[13:35:31] Completed 137500 out of 250000 steps  (55%)
[13:58:27] Completed 140000 out of 250000 steps  (56%)
[14:21:21] Completed 142500 out of 250000 steps  (57%)
[14:44:18] Completed 145000 out of 250000 steps  (58%)
[15:07:12] Completed 147500 out of 250000 steps  (59%)
[15:32:14] Completed 150000 out of 250000 steps  (60%)
[15:56:20] - Autosending finished units... [December 30 15:56:20 UTC]
[15:56:20] Trying to send all finished work units
[15:56:20] + No unsent completed units remaining.
[15:56:20] - Autosend completed
[15:59:25] Completed 152500 out of 250000 steps  (61%)
[16:33:08] Completed 155000 out of 250000 steps  (62%)
[17:14:04] Completed 157500 out of 250000 steps  (63%)
[17:36:32] Completed 160000 out of 250000 steps  (64%)
[17:58:58] Completed 162500 out of 250000 steps  (65%)
[18:21:27] Completed 165000 out of 250000 steps  (66%)
[18:44:01] Completed 167500 out of 250000 steps  (67%)
[19:06:22] Completed 170000 out of 250000 steps  (68%)
[19:28:37] Completed 172500 out of 250000 steps  (69%)
[19:50:55] Completed 175000 out of 250000 steps  (70%)
[20:13:11] Completed 177500 out of 250000 steps  (71%)
[20:35:29] Completed 180000 out of 250000 steps  (72%)
[20:58:14] Completed 182500 out of 250000 steps  (73%)
[21:20:44] Completed 185000 out of 250000 steps  (74%)
[21:43:08] Completed 187500 out of 250000 steps  (75%)
[21:56:20] - Autosending finished units... [December 30 21:56:20 UTC]
[21:56:20] Trying to send all finished work units
[21:56:20] + No unsent completed units remaining.
[21:56:20] - Autosend completed
[22:05:51] Completed 190000 out of 250000 steps  (76%)
[22:28:46] Completed 192500 out of 250000 steps  (77%)
[22:51:43] Completed 195000 out of 250000 steps  (78%)
[23:14:31] Completed 197500 out of 250000 steps  (79%)
[23:37:06] Completed 200000 out of 250000 steps  (80%)
[23:59:56] Completed 202500 out of 250000 steps  (81%)
[00:21:53] Completed 205000 out of 250000 steps  (82%)
[00:44:03] Completed 207500 out of 250000 steps  (83%)
[01:06:31] Completed 210000 out of 250000 steps  (84%)
[01:29:24] Completed 212500 out of 250000 steps  (85%)
[01:55:44] Completed 215000 out of 250000 steps  (86%)
[02:29:55] Completed 217500 out of 250000 steps  (87%)
[03:07:36] Completed 220000 out of 250000 steps  (88%)
[03:37:11] Completed 222500 out of 250000 steps  (89%)
[03:56:20] - Autosending finished units... [December 31 03:56:20 UTC]
[03:56:20] Trying to send all finished work units
[03:56:20] + No unsent completed units remaining.
[03:56:20] - Autosend completed
[04:00:56] Completed 225000 out of 250000 steps  (90%)
[04:26:04] Completed 227500 out of 250000 steps  (91%)
[04:49:10] Completed 230000 out of 250000 steps  (92%)
[05:12:55] Completed 232500 out of 250000 steps  (93%)
[05:41:22] Completed 235000 out of 250000 steps  (94%)
[06:05:59] Completed 237500 out of 250000 steps  (95%)
[06:28:46] Completed 240000 out of 250000 steps  (96%)
[06:51:43] Completed 242500 out of 250000 steps  (97%)
[07:14:37] Completed 245000 out of 250000 steps  (98%)
[07:37:36] Completed 247500 out of 250000 steps  (99%)
[08:00:27] Completed 250000 out of 250000 steps  (100%)
[08:00:35] CoreStatus = 0 (0)
[08:00:35] Sending work to server
[08:00:35] Project: 2662 (Run 0, Clone 467, Gen 44)
[08:00:35] - Error: Could not get length of results file work/wuresults_09.dat
[08:00:35] - Error: Could not read unit 09 file. Removing from queue.
[08:00:35] Trying to send all finished work units
[08:00:35] + No unsent completed units remaining.
[08:00:35] - Preparing to get new work unit...
[08:00:35] + Attempting to get work packet
[08:00:35] - Will indicate memory of 3816 MB
[08:00:35] - Connecting to assignment server
[08:00:35] Connecting to http://assign.stanford.edu:8080/
[08:00:37] Posted data.
[08:00:37] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[08:00:37] + News From Folding@Home: Welcome to Folding@Home
[08:00:37] Loaded queue successfully.
[08:00:37] Connecting to http://171.64.65.56:8080/
[08:00:45] Posted data.
[08:00:45] Initial: 0000; - Receiving payload (expected size: 4855616)
[08:02:15] - Downloaded at ~52 kB/s
[08:02:15] - Averaged speed for that direction ~51 kB/s
[08:02:15] + Received work.
[08:02:15] Trying to send all finished work units
[08:02:15] + No unsent completed units remaining.
[08:02:15] + Closed connections
[08:02:20]
[08:02:20] + Processing work unit
[08:02:20] At least 4 processors must be requested.Core required: FahCore_a2.exe
[08:02:20] Core found.
[08:02:20] Working on queue slot 00 [December 31 08:02:20 UTC]
[08:02:20] + Working ...
[08:02:20] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 00 -checkpoint 15 -verbose -lifeline 14288 -version 624'

[08:02:20]
[08:02:20] *------------------------------*
[08:02:20] Folding@Home Gromacs SMP Core
[08:02:20] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[08:02:20]
[08:02:20] Preparing to commence simulation
[08:02:20] - Ensuring status. Please wait.
[08:02:29] - Looking at optimizations...
[08:02:29] - Working with standard loops on this execution.
[08:02:29] - Files status OK
[08:02:31] - Expanded 4855104 -> 24045785 (decompressed 495.2 percent)
[08:02:32] Called DecompressByteArray: compressed_data_size=4855104 data_size=24045785, decompressed_data_size=24045785 diff=0
[08:02:32] - Digital signature verified
[08:02:32]
[08:02:32] Project: 2662 (Run 0, Clone 467, Gen 44)
[08:02:32]
[08:02:32] Entering M.D.
[08:02:46] Completed 0 out of 250000 steps  (0%)
[08:25:46] Completed 2500 out of 250000 steps  (1%)
[08:48:35] Completed 5000 out of 250000 steps  (2%)
[09:11:27] Completed 7500 out of 250000 steps  (3%)
[09:34:15] Completed 10000 out of 250000 steps  (4%)
[09:56:20] - Autosending finished units... [December 31 09:56:20 UTC]
[09:56:20] Trying to send all finished work units
[09:56:20] + No unsent completed units remaining.
[09:56:20] - Autosend completed
[09:57:04] Completed 12500 out of 250000 steps  (5%)
[10:20:17] Completed 15000 out of 250000 steps  (6%)
[10:43:44] Completed 17500 out of 250000 steps  (7%)
[11:06:45] Completed 20000 out of 250000 steps  (8%)
[11:30:49] Completed 22500 out of 250000 steps  (9%)
[11:55:00] Completed 25000 out of 250000 steps  (10%)
[12:20:26] Completed 27500 out of 250000 steps  (11%)
[12:45:24] Completed 30000 out of 250000 steps  (12%)
Last edited by chrisretusn on Wed Feb 10, 2010 11:00 pm, edited 1 time in total.
Image
Folding on Slackware Linux.
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Sun Jan 17, 2010 6:00 am

This from a draft I saved Fri Jan 08, 2010 12:39 pm and forgot to post.

My first post to much info? Not understandable? Got missed?

I was trying to point out a problem I noticed on my system. When I stop a client running the a2 core that work will be lost. It has happened to me more than once. If I restart the client in all but one case, the client continued to run where it left off. The problem occurs after the work has be completed. At that point a problem is found with work/wuresults_##.dat and all work is lost. The client removes it from the cue and gets another assignment.

One thing I noticed, and this has been discussed elsewhere on the forum is the time it takes to shutdown the client. Note the time span between ... Done. and - Shutting down core in the below log snippets:

[08:02:32] Project: 2662 (Run 0, Clone 467, Gen 44)
[00:15:11] - Writing 26056329 bytes of core data to disk...
[00:15:13] ... Done.
[00:24:38] - Shutting down core
-- 9 minutes

[00:53:10] Project: 2662 (Run 0, Clone 148, Gen 65)
[16:18:19] - Writing 26011656 bytes of core data to disk...
[16:18:21] ... Done.
[16:29:51] - Shutting down core
-- 13 minutes

[17:00:39] Project: 2669 (Run 3, Clone 182, Gen 186)
[08:52:20] - Writing 26007282 bytes of core data to disk...
[08:52:21] ... Done.
[09:03:52] - Shutting down core
-- 11 minutes

[09:32:31] Project: 2677 (Run 39, Clone 80, Gen 64)
[23:36:09] - Writing 49211080 bytes of core data to disk...
[23:36:11] ... Done.
[23:51:59] - Shutting down core
-- 15 minutes

I have also noticed this behavior when shutting down a client (via Ctrl-C or killing fah6). There is at least one FahCore_a2.exe process that continues to run for several minutes before it stops during this time it obvious by observing disk I/O and disk activity that it is still writing to disk. Once the a2 process dies so does disk I/O.

After reviewing the thread about possible problems with the ext4 file system's delayed allocation (data=writeback) problem I have the opinion that folding a2 cores on ext4 is not a good idea. I am in the process of verifying this. I have changed to another fs other that ext4, jfs. I could have selected ext3 but choose not to for my own reasons (I normally use jfs and should have not strayed.)
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: - Error: Could not get length of results file work/wuresult

Postby bruce » Sun Jan 17, 2010 6:15 am

Is it reasonable to assume that this is only a problem if (A) Linux, and (B) FahCore_a2, and (C) ext4? If that's true, development will have a lot better chance of isolating the problem.
bruce
 
Posts: 22531
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Sun Jan 17, 2010 7:18 am

More to report on this. I have since gone back to jfs on my partition were fah is located. I have gone through several WU's now and have not lost one of them. (There is an exception, I will cover later.) These WU have been subject to Ctrl-C, kill, killall and one power loss. Signals 2, 15, All of those interrupted WU's finished to 100% and successfully sent the results. These are the times I am getting now.

Note: This WU ran from start to finish without interruption.
[12:26:46] - Writing 48698707 bytes of core data to disk...
[12:26:48] ... Done.
[12:27:11] - Shutting down core
[12:27:11]
[12:27:11] Folding@home Core Shutdown: FINISHED_UNIT
[12:30:08] CoreStatus = 64 (100)
[12:30:08] Project: 2677 (Run 30, Clone 82, Gen 67)

Note: This WU was stopped once at 78% with a signal 15.
[12:50:28] - Writing 26399589 bytes of core data to disk...
[12:50:29] ... Done.
[12:50:47] - Shutting down core
[12:50:47]
[12:50:47] Folding@home Core Shutdown: FINISHED_UNIT
[12:53:48] CoreStatus = 64 (100)
[12:53:48] Project: 2662 (Run 2, Clone 493, Gen 44)

Note: This WU was killed via signal 15 at 31%
[06:57:29] - Writing 26004167 bytes of core data to disk...
[06:57:30] ... Done.
[06:57:47] - Shutting down core
[06:57:47]
[06:57:47] Folding@home Core Shutdown: FINISHED_UNIT
[07:00:51] CoreStatus = 64 (100)
[07:00:51] Project: 2669 (Run 3, Clone 57, Gen 43)

Note: This WU was stopped at 91% and 96% with a signal 15
[21:57:26] - Writing 25891479 bytes of core data to disk...
[21:57:27] ... Done.
[21:57:43] - Shutting down core
[21:57:43]
[21:57:43] Folding@home Core Shutdown: FINISHED_UNIT
[22:00:46] CoreStatus = 64 (100)
[22:00:46] Project: 2669 (Run 3, Clone 2, Gen 73)

As you can see the time to write to disk is a consistent 3 minutes.

Other observations. The core_a2 when a core fails it leaves all of it's data behind. Note the "[22:22:08] - Warning: Could not delete all work unit files (1): Core file absent". Also note that a new WU is downloaded and it will be working on queue slot 02. Problem. There is an old (failed) WU in slot 02 that did not get sent, it was "removed from the queue", but it's data files are still present in the work directory. The result is "[22:24:12] - Error: Could not get length of results file work/wuresults_02.dat" and "[22:24:12] - Error: Could not read unit 02 file. Removing from queue." So fah6 downloads another WU and shifts to slot 03.

Code: Select all
--- Opening Log file [January 16 20:37:19 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/folder/fah6-1
Executable: ./fah6
Arguments: -verbosity 9 -smp

[20:37:19] - Ask before connecting: No
[20:37:19] - User name: chrisretusn (Team 2291)
[20:37:19] - User ID: XXXX
[20:37:19] - Machine ID: 1
[20:37:19]
[20:37:19] Loaded queue successfully.
[20:37:19] - Autosending finished units... [January 16 20:37:19 UTC]
[20:37:19] Trying to send all finished work units
[20:37:19] + No unsent completed units remaining.
[20:37:19] - Autosend completed
[20:37:19]
[20:37:19] + Processing work unit
[20:37:19] At least 4 processors must be requested.Core required: FahCore_a2.exe
[20:37:19] Core found.
[20:37:19] Working on queue slot 01 [January 16 20:37:19 UTC]
[20:37:19] + Working ...
[20:37:19] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 15 -verbose -lifeline 3742 -version 624'

[20:37:19]
[20:37:19] *------------------------------*
[20:37:19] Folding@Home Gromacs SMP Core
[20:37:19] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[20:37:19]
[20:37:19] Preparing to commence simulation
[20:37:19] - Ensuring status. Please wait.
[20:37:19] Files status OK
[20:37:20] - Expanded 4844654 -> 24003985 (decompressed 495.4 percent)
[20:37:20] Called DecompressByteArray: compressed_data_size=4844654 data_size=24003985, decompressed_data_size=24003985 diff=0
[20:37:20] - Digital signature verified
[20:37:20]
[20:37:20] Project: 2669 (Run 3, Clone 2, Gen 73)
[20:37:20]
[20:37:20] Assembly optimizations on if available.
[20:37:20] Entering M.D.
[20:37:26] Using Gromacs checkpoints
[20:37:30]
[20:37:30] Entering M.D.
[20:37:36] Using Gromacs checkpoints
[20:37:44] data_01.log
[20:37:44] Verified work/wudata_01.trr
[20:37:44] Verified work/wudata_01.xtc
[20:37:44] Verified work/wudata_01.edr
[20:37:46] Completed 241230 out of 250000 steps  (96%)
[20:49:28] Completed 242500 out of 250000 steps  (97%)
[21:12:02] Completed 245000 out of 250000 steps  (98%)
[21:34:35] Completed 247500 out of 250000 steps  (99%)
[21:57:10] Completed 250000 out of 250000 steps  (100%)
[21:57:13] DynamicWrapper: Finished Work Unit: sleep=10000
[21:57:23]
[21:57:23] Finished Work Unit:
[21:57:23] - Reading up to 21148416 from "work/wudata_01.trr": Read 21148416
[21:57:23] trr file hash check passed.
[21:57:23] - Reading up to 4398316 from "work/wudata_01.xtc": Read 4398316
[21:57:23] xtc file hash check passed.
[21:57:23] edr file hash check passed.
[21:57:23] logfile size: 194611
[21:57:23] Leaving Run
[21:57:26] - Writing 25891479 bytes of core data to disk...
[21:57:27]   ... Done.
[21:57:43] - Shutting down core
[21:57:43]
[21:57:43] Folding@home Core Shutdown: FINISHED_UNIT
[22:00:46] CoreStatus = 64 (100)
[22:00:46] Unit 1 finished with 47 percent of time to deadline remaining.
[22:00:46] Updated performance fraction: 0.467449
[22:00:46] Sending work to server
[22:00:46] Project: 2669 (Run 3, Clone 2, Gen 73)


[22:00:46] + Attempting to send results [January 16 22:00:46 UTC]
[22:00:46] - Reading file work/wuresults_01.dat from core
[22:00:46]   (Read 25891479 bytes from disk)
[22:00:46] Connecting to http://171.64.65.56:8080/
[22:21:55] Posted data.
[22:21:55] Initial: 0000; - Uploaded at ~19 kB/s
[22:22:01] - Averaged speed for that direction ~19 kB/s
[22:22:01] + Results successfully sent
[22:22:01] Thank you for your contribution to Folding@Home.
[22:22:01] + Number of Units Completed: 55

[22:22:08] - Warning: Could not delete all work unit files (1): Core file absent
[22:22:08] Trying to send all finished work units
[22:22:08] + No unsent completed units remaining.
[22:22:08] - Preparing to get new work unit...
[22:22:08] + Attempting to get work packet
[22:22:08] - Will indicate memory of 3816 MB
[22:22:08] - Connecting to assignment server
[22:22:08] Connecting to http://assign.stanford.edu:8080/
[22:22:10] Posted data.
[22:22:10] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[22:22:10] + News From Folding@Home: Welcome to Folding@Home
[22:22:10] Loaded queue successfully.
[22:22:10] Connecting to http://171.64.65.56:8080/
[22:22:15] Posted data.
[22:22:15] Initial: 0000; - Receiving payload (expected size: 4855336)
[22:23:45] - Downloaded at ~52 kB/s
[22:23:45] - Averaged speed for that direction ~52 kB/s
[22:23:45] + Received work.
[22:23:45] Trying to send all finished work units
[22:23:45] + No unsent completed units remaining.
[22:23:45] + Closed connections
[22:23:45]
[22:23:45] + Processing work unit
[22:23:45] At least 4 processors must be requested.Core required: FahCore_a2.exe
[22:23:45] Core found.
[22:23:45] Working on queue slot 02 [January 16 22:23:45 UTC]
[22:23:45] + Working ...
[22:23:45] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 3742 -version 624'

[22:23:45]
[22:23:45] *------------------------------*
[22:23:45] Folding@Home Gromacs SMP Core
[22:23:45] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[22:23:45]
[22:23:45] Preparing to commence simulation
[22:23:45] - Ensuring status. Please wait.
[22:23:46] - Digital signature verAssembly optimizations on if available.
[22:23:46] Entering Entering M.D.
[22:23:52] Using Gromacs checkpoints
[22:23:56]  M.D.
[22:24:02] Using Gromacs checkpoints
[22:24:12] CoreStatus = FF (255)
[22:24:12] Sending work to server
[22:24:12] Project: 2662 (Run 0, Clone 428, Gen 52)
[22:24:12] - Error: Could not get length of results file work/wuresults_02.dat
[22:24:12] - Error: Could not read unit 02 file. Removing from queue.
[22:24:12] Trying to send all finished work units
[22:24:12] + No unsent completed units remaining.
[22:24:12] - Preparing to get new work unit...
[22:24:12] + Attempting to get work packet
[22:24:12] - Will indicate memory of 3816 MB
[22:24:12] - Connecting to assignment server
[22:24:12] Connecting to http://assign.stanford.edu:8080/
[22:24:13] Posted data.
[22:24:13] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[22:24:13] + News From Folding@Home: Welcome to Folding@Home
[22:24:13] Loaded queue successfully.
[22:24:13] Connecting to http://171.64.65.56:8080/
[22:24:18] Posted data.
[22:24:18] Initial: 0000; - Receiving payload (expected size: 4855336)
[22:25:48] - Downloaded at ~52 kB/s
[22:25:48] - Averaged speed for that direction ~52 kB/s
[22:25:48] + Received work.
[22:25:48] Trying to send all finished work units
[22:25:48] + No unsent completed units remaining.
[22:25:48] + Closed connections
[22:25:53]
[22:25:53] + Processing work unit
[22:25:53] At least 4 processors must be requested.Core required: FahCore_a2.exe
[22:25:53] Core found.
[22:25:53] Working on queue slot 03 [January 16 22:25:53 UTC]
[22:25:53] + Working ...
[22:25:53] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 03 -checkpoint 15 -verbose -lifeline 3742 -version 624'

[22:25:53]
[22:25:53] *------------------------------*
[22:25:53] Folding@Home Gromacs SMP Core
[22:25:53] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[22:25:53]
[22:25:53] Preparing to commence simulation
[22:25:53] - Ensuring status. Please wait.
[22:26:02] - Looking at optimizations...
[22:26:02] - Working with standard loops on this execution.
[22:26:02] - Files status OK
[22:26:04] - Expanded 4854824 -> 24045785 (decompressed 495.2 percent)
[22:26:04] Called DecompressByteArray: compressed_data_size=4854824 data_size=24045785, decompressed_data_size=24045785 diff=0
[22:26:05] - Digital signature verified
[22:26:05]
[22:26:05] Project: 2662 (Run 0, Clone 428, Gen 52)
[22:26:05]
[22:26:05] Entering M.D.
[22:26:19] Completed 0 out of 250000 steps  (0%)
[22:49:13] Completed 2500 out of 250000 steps  (1%)
[23:12:05] Completed 5000 out of 250000 steps  (2%)
[23:34:57] Completed 7500 out of 250000 steps  (3%)
[23:57:48] Completed 10000 out of 250000 steps  (4%)
[00:20:39] Completed 12500 out of 250000 steps  (5%)
[00:43:32] Completed 15000 out of 250000 steps  (6%)
[01:06:23] Completed 17500 out of 250000 steps  (7%)
[01:29:14] Completed 20000 out of 250000 steps  (8%)
[01:52:02] Completed 22500 out of 250000 steps  (9%)
[02:14:52] Completed 25000 out of 250000 steps  (10%)
[02:37:19] - Autosending finished units... [January 17 02:37:19 UTC]
[02:37:19] Trying to send all finished work units
[02:37:19] + No unsent completed units remaining.
[02:37:19] - Autosend completed
[02:37:28] Completed 27500 out of 250000 steps  (11%)
[03:05:54] Completed 30000 out of 250000 steps  (12%)
[03:31:02] Completed 32500 out of 250000 steps  (13%)
[03:54:36] Completed 35000 out of 250000 steps  (14%)
[04:18:00] Completed 37500 out of 250000 steps  (15%)
[04:42:50] Completed 40000 out of 250000 steps  (16%)
[05:09:18] Completed 42500 out of 250000 steps  (17%)
[05:36:06] Completed 45000 out of 250000 steps  (18%)
[06:02:34] Completed 47500 out of 250000 steps  (19%)
[06:28:28] Completed 50000 out of 250000 steps  (20%)
[06:56:35] Completed 52500 out of 250000 steps  (21%)
[07:25:13] Completed 55000 out of 250000 steps  (22%)


This is the same period but with a log captured using "./fah6 >> /home/folder/fah6-1/fah.log 2>&1" in my start up script.
Code: Select all
--- Opening Log file [January 16 20:37:19 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/folder/fah6-1
Executable: ./fah6
Arguments: -verbosity 9 -smp

[20:37:19] - Ask before connecting: No
[20:37:19] - User name: chrisretusn (Team 2291)
[20:37:19] - User ID: XXXX
[20:37:19] - Machine ID: 1
[20:37:19]
[20:37:19] Loaded queue successfully.
[20:37:19] - Autosending finished units... [January 16 20:37:19 UTC]
[20:37:19] Trying to send all finished work units
[20:37:19] + No unsent completed units remaining.
[20:37:19] - Autosend completed
[20:37:19]
[20:37:19] + Processing work unit
[20:37:19] At least 4 processors must be requested.Core required: FahCore_a2.exe
[20:37:19] Core found.
[20:37:19] Working on queue slot 01 [January 16 20:37:19 UTC]
[20:37:19] + Working ...
[20:37:19] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 15 -verbose -lifeline 3742 -version 624'

[20:37:19]
[20:37:19] *------------------------------*
[20:37:19] Folding@Home Gromacs SMP Core
[20:37:19] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[20:37:19]
[20:37:19] Preparing to commence simulation
[20:37:19] - Ensuring status. Please wait.
[20:37:19] Files status OK
[20:37:20] - Expanded 4844654 -> 24003985 (decompressed 495.4 percent)
[20:37:20] Called DecompressByteArray: compressed_data_size=4844654 data_size=24003985, decompressed_data_size=24003985 diff=0
[20:37:20] - Digital signature verified
[20:37:20]
[20:37:20] Project: 2669 (Run 3, Clone 2, Gen 73)
[20:37:20]
[20:37:20] Assembly optimizations on if available.
[20:37:20] Entering M.D.
[20:37:26] Using Gromacs checkpoints
[20:37:30]
[20:37:30] Entering M.D.
[20:37:36] Using Gromacs checkpoints
NNODES=4, MYRANK=0, HOSTNAME=racermach
NNODES=4, MYRANK=2, HOSTNAME=racermach
NODEID=0 argc=23
NNODES=4, MYRANK=3, HOSTNAME=racermach
NNODES=4, MYRANK=1, HOSTNAME=racermach
NODEID=1 argc=23
NODEID=2 argc=23
NODEID=3 argc=23
Reading file work/wudata_01.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

Reading checkpoint file work/wudata_01.cpt generated: Sun Jan 17 04:27:30 2010


NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22890 system'
18500002 steps,  37000.0 ps (continuing from step 18491232,  36982.5 ps).
[20:37:44] data_01.log
[20:37:44] Verified work/wudata_01.trr
[20:37:44] Verified work/wudata_01.xtc
[20:37:44] Verified work/wudata_01.edr
[20:37:46] Completed 241230 out of 250000 steps  (96%)
[20:49:28] Completed 242500 out of 250000 steps  (97%)
[21:12:02] Completed 245000 out of 250000 steps  (98%)
[21:34:35] Completed 247500 out of 250000 steps  (99%)
[21:57:10] Completed 250000 out of 250000 steps  (100%)

Writing final coordinates.

 Average load imbalance: 1.2 %
 Part of the total run time spent waiting due to load imbalance: 0.8 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Z 0 %


   Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:   4769.204   4769.204    100.0
                       1h19:29
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:     95.230      4.004      0.318     75.520

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[21:57:13] DynamicWrapper: Finished Work Unit: sleep=10000
[21:57:23]
[21:57:23] Finished Work Unit:
[21:57:23] - Reading up to 21148416 from "work/wudata_01.trr": Read 21148416
[21:57:23] trr file hash check passed.
[21:57:23] - Reading up to 4398316 from "work/wudata_01.xtc": Read 4398316
[21:57:23] xtc file hash check passed.
[21:57:23] edr file hash check passed.
[21:57:23] logfile size: 194611
[21:57:23] Leaving Run
[21:57:26] - Writing 25891479 bytes of core data to disk...
[21:57:27]   ... Done.
[21:57:43] - Shutting down core
[21:57:43]
[21:57:43] Folding@home Core Shutdown: FINISHED_UNIT
Attempting to use an MPI routine after finalizing MPICH
[22:00:46] CoreStatus = 64 (100)
[22:00:46] Unit 1 finished with 47 percent of time to deadline remaining.
[22:00:46] Updated performance fraction: 0.467449
[22:00:46] Sending work to server
[22:00:46] Project: 2669 (Run 3, Clone 2, Gen 73)


[22:00:46] + Attempting to send results [January 16 22:00:46 UTC]
[22:00:46] - Reading file work/wuresults_01.dat from core
[22:00:46]   (Read 25891479 bytes from disk)
[22:00:46] Connecting to http://171.64.65.56:8080/
[22:21:55] Posted data.
[22:21:55] Initial: 0000; - Uploaded at ~19 kB/s
[22:22:01] - Averaged speed for that direction ~19 kB/s
[22:22:01] + Results successfully sent
[22:22:01] Thank you for your contribution to Folding@Home.
[22:22:01] + Number of Units Completed: 55

[22:22:08] - Warning: Could not delete all work unit files (1): Core file absent
[22:22:08] Trying to send all finished work units
[22:22:08] + No unsent completed units remaining.
[22:22:08] - Preparing to get new work unit...
[22:22:08] + Attempting to get work packet
[22:22:08] - Will indicate memory of 3816 MB
[22:22:08] - Connecting to assignment server
[22:22:08] Connecting to http://assign.stanford.edu:8080/
[22:22:10] Posted data.
[22:22:10] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[22:22:10] + News From Folding@Home: Welcome to Folding@Home
[22:22:10] Loaded queue successfully.
[22:22:10] Connecting to http://171.64.65.56:8080/
[22:22:15] Posted data.
[22:22:15] Initial: 0000; - Receiving payload (expected size: 4855336)
[22:23:45] - Downloaded at ~52 kB/s
[22:23:45] - Averaged speed for that direction ~52 kB/s
[22:23:45] + Received work.
[22:23:45] Trying to send all finished work units
[22:23:45] + No unsent completed units remaining.
[22:23:45] + Closed connections
[22:23:45]
[22:23:45] + Processing work unit
[22:23:45] At least 4 processors must be requested.Core required: FahCore_a2.exe
[22:23:45] Core found.
[22:23:45] Working on queue slot 02 [January 16 22:23:45 UTC]
[22:23:45] + Working ...
[22:23:45] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 3742 -version 624'

[22:23:45]
[22:23:45] *------------------------------*
[22:23:45] Folding@Home Gromacs SMP Core
[22:23:45] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[22:23:45]
[22:23:45] Preparing to commence simulation
[22:23:45] - Ensuring status. Please wait.
[22:23:46] - Digital signature verAssembly optimizations on if available.
[22:23:46] Entering Entering M.D.
[22:23:52] Using Gromacs checkpoints
[22:23:56]  M.D.
[22:24:02] Using Gromacs checkpoints
NNODES=4, MYRANK=0, HOSTNAME=racermach
NNODES=4, MYRANK=1, HOSTNAME=racermach
NNODES=4, MYRANK=3, HOSTNAME=racermach
NODEID=0 argc=23
NODEID=1 argc=23
Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
NNODES=4, MYRANK=2, HOSTNAME=racermach
NODEID=2 argc=23
NODEID=3 argc=23
Note: tpx file_version 48, software version 68

Reading checkpoint file work/wudata_02.cpt generated: Sat Dec 19 22:31:39 2009


-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: checkpoint.c, line: 1196

Fatal error:
Checkpoint file is for a system of 147986 atoms, while the current system consists of 147170 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[22:24:12] CoreStatus = FF (255)
[22:24:12] Sending work to server
[22:24:12] Project: 2662 (Run 0, Clone 428, Gen 52)
[22:24:12] - Error: Could not get length of results file work/wuresults_02.dat
[22:24:12] - Error: Could not read unit 02 file. Removing from queue.
[22:24:12] Trying to send all finished work units
[22:24:12] + No unsent completed units remaining.
[22:24:12] - Preparing to get new work unit...
[22:24:12] + Attempting to get work packet
[22:24:12] - Will indicate memory of 3816 MB
[22:24:12] - Connecting to assignment server
[22:24:12] Connecting to http://assign.stanford.edu:8080/
[22:24:13] Posted data.
[22:24:13] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[22:24:13] + News From Folding@Home: Welcome to Folding@Home
[22:24:13] Loaded queue successfully.
[22:24:13] Connecting to http://171.64.65.56:8080/
[22:24:18] Posted data.
[22:24:18] Initial: 0000; - Receiving payload (expected size: 4855336)
[22:25:48] - Downloaded at ~52 kB/s
[22:25:48] - Averaged speed for that direction ~52 kB/s
[22:25:48] + Received work.
[22:25:48] Trying to send all finished work units
[22:25:48] + No unsent completed units remaining.
[22:25:48] + Closed connections
[22:25:53]
[22:25:53] + Processing work unit
[22:25:53] At least 4 processors must be requested.Core required: FahCore_a2.exe
[22:25:53] Core found.
[22:25:53] Working on queue slot 03 [January 16 22:25:53 UTC]
[22:25:53] + Working ...
[22:25:53] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 03 -checkpoint 15 -verbose -lifeline 3742 -version 624'

[22:25:53]
[22:25:53] *------------------------------*
[22:25:53] Folding@Home Gromacs SMP Core
[22:25:53] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[22:25:53]
[22:25:53] Preparing to commence simulation
[22:25:53] - Ensuring status. Please wait.
[22:26:02] - Looking at optimizations...
[22:26:02] - Working with standard loops on this execution.
[22:26:02] - Files status OK
[22:26:04] - Expanded 4854824 -> 24045785 (decompressed 495.2 percent)
[22:26:04] Called DecompressByteArray: compressed_data_size=4854824 data_size=24045785, decompressed_data_size=24045785 diff=0
[22:26:05] - Digital signature verified
[22:26:05]
[22:26:05] Project: 2662 (Run 0, Clone 428, Gen 52)
[22:26:05]
[22:26:05] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=racermach
NODEID=0 argc=20
Reading file work/wudata_03.tpr, VERSION 3.3.99_development_20070618 (single precision)
NNODES=4, MYRANK=1, HOSTNAME=racermach
NODEID=1 argc=20
NNODES=4, MYRANK=2, HOSTNAME=racermach
NODEID=2 argc=20
NNODES=4, MYRANK=3, HOSTNAME=racermach
NODEID=3 argc=20
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun 'IBX in water'
13250000 steps,  26500.0 ps (continuing from step 13000000,  26000.0 ps).
[22:26:19] Completed 0 out of 250000 steps  (0%)
[22:49:13] Completed 2500 out of 250000 steps  (1%)
[23:12:05] Completed 5000 out of 250000 steps  (2%)
[23:34:57] Completed 7500 out of 250000 steps  (3%)
[23:57:48] Completed 10000 out of 250000 steps  (4%)
[00:20:39] Completed 12500 out of 250000 steps  (5%)
[00:43:32] Completed 15000 out of 250000 steps  (6%)
[01:06:23] Completed 17500 out of 250000 steps  (7%)
[01:29:14] Completed 20000 out of 250000 steps  (8%)
[01:52:02] Completed 22500 out of 250000 steps  (9%)
[02:14:52] Completed 25000 out of 250000 steps  (10%)
[02:37:19] - Autosending finished units... [January 17 02:37:19 UTC]
[02:37:19] Trying to send all finished work units
[02:37:19] + No unsent completed units remaining.
[02:37:19] - Autosend completed
[02:37:28] Completed 27500 out of 250000 steps  (11%)
[03:05:54] Completed 30000 out of 250000 steps  (12%)
[03:31:02] Completed 32500 out of 250000 steps  (13%)
[03:54:36] Completed 35000 out of 250000 steps  (14%)
[04:18:00] Completed 37500 out of 250000 steps  (15%)
[04:42:50] Completed 40000 out of 250000 steps  (16%)
[05:09:18] Completed 42500 out of 250000 steps  (17%)
[05:36:06] Completed 45000 out of 250000 steps  (18%)
[06:02:34] Completed 47500 out of 250000 steps  (19%)
[06:28:28] Completed 50000 out of 250000 steps  (20%)
[06:56:35] Completed 52500 out of 250000 steps  (21%)
[07:25:13] Completed 55000 out of 250000 steps  (22%)


My current setup for those interesting is a AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ running Slackware64 13.0. At present root is ext4, /home is back to jfs from ext4 and that is where fah resides. I plan on dumping ext4 on root and going back to jfs, that is what I normally run with for a file system anyway. I start fah6 with an rc.folding script, and I have since added a lines to stop fah6 my rc.local_shutdown script. Just to be on the safe side after killing fah6 with a signal 15, added a "sleep 15" and then sync.

Since shifting to jfs I have not had any more problems with loosing WU's if they were shutdown while running.

On the left over queue slot files in the work directory, I am deleting all of the wudata_## files except for the active WU in slot 03. It does appear that I can leave wudata_##_prev.cpt and logfile_##.txt. I also plan on deleting the old core78.sta, plus those wuinfo_##.dat files. Here is s listing of my work directory.
Code: Select all
-rw-r--r-- 1 folder users        0 2010-01-17 06:23 core78.sta         
-rw-r--r-- 1 folder users   476137 2009-12-17 23:12 current.xyz       
-rwxr-x--- 1 folder users      221 2010-01-15 15:22 logfile_00.txt*   
-rwxr-x--- 1 folder users      221 2010-01-17 06:22 logfile_01.txt*   
-rwxr-x--- 1 folder users      560 2010-01-17 06:24 logfile_02.txt*   
-rwxr-x--- 1 folder users     1463 2010-01-17 14:56 logfile_03.txt*   
-rwxr-x--- 1 folder users      221 2010-01-07 08:41 logfile_04.txt*   
-rwxr-x--- 1 folder users     3651 2010-01-09 09:30 logfile_05.txt*   
-rwxr-x--- 1 folder users      560 2010-01-10 05:27 logfile_06.txt*   
-rwxr-x--- 1 folder users      221 2010-01-11 21:10 logfile_07.txt*   
-rwxr-x--- 1 folder users      221 2010-01-13 21:15 logfile_08.txt*   
-rwxr-x--- 1 folder users      559 2010-01-13 21:17 logfile_09.txt*   
-rw-r--r-- 1 folder users  3526352 2010-01-15 14:45 wudata_00_prev.cpt
-rw-r--r-- 1 folder users  3526352 2010-01-17 05:52 wudata_01_prev.cpt
-rw-r--r-- 1 folder users    75160 2009-12-19 22:31 wudata_02.ckp     
-rw-r--r-- 1 folder users  3553400 2009-12-19 22:31 wudata_02.cpt     
-rw-r--r-- 1 folder users  4855336 2010-01-17 06:23 wudata_02.dat     
-rw-r--r-- 1 folder users        0 2009-12-19 22:43 wudata_02.dyn     
-rw-r--r-- 1 folder users    45672 2009-12-19 22:30 wudata_02.edr     
-rw-r--r-- 1 folder users    64338 2009-12-19 22:31 wudata_02.log     
-rw-r--r-- 1 folder users 24045785 2010-01-17 06:23 wudata_02.tpr     
-rw-r--r-- 1 folder users  7103568 2009-12-19 18:24 wudata_02.trr     
-rw-r--r-- 1 folder users  1480532 2009-12-19 22:30 wudata_02.xtc     
-rw-r--r-- 1 folder users  3553400 2009-12-19 22:16 wudata_02_prev.cpt
-rw-r--r-- 1 folder users    75160 2010-01-17 14:56 wudata_03.ckp     
-rw-r--r-- 1 folder users  3533816 2010-01-17 14:56 wudata_03.cpt     
-rw-r--r-- 1 folder users  4855336 2010-01-17 06:25 wudata_03.dat     
-rw-r--r-- 1 folder users    32056 2010-01-17 14:51 wudata_03.edr     
-rw-r--r-- 1 folder users    48074 2010-01-17 14:56 wudata_03.log     
-rw-r--r-- 1 folder users 24045785 2010-01-17 06:26 wudata_03.tpr     
-rw-r--r-- 1 folder users  7064400 2010-01-17 14:28 wudata_03.trr     
-rw-r--r-- 1 folder users   959868 2010-01-17 14:28 wudata_03.xtc     
-rw-r--r-- 1 folder users  3533816 2010-01-17 14:41 wudata_03_prev.cpt
-rw-r--r-- 1 folder users  3533336 2010-01-07 07:32 wudata_04_prev.cpt
-rw-r--r-- 1 folder users    75160 2010-01-09 09:30 wudata_05.ckp     
-rw-r--r-- 1 folder users  3534976 2010-01-09 09:30 wudata_05.cpt     
-rw-r--r-- 1 folder users  4830045 2010-01-07 08:43 wudata_05.dat     
-rw-r--r-- 1 folder users        0 2010-01-08 10:27 wudata_05.dyn
-rw-r--r-- 1 folder users   149512 2010-01-09 09:30 wudata_05.edr
-rw-r--r-- 1 folder users 10160241 2010-01-09 09:30 wudata_05.gro
-rw-r--r-- 1 folder users   189407 2010-01-09 09:30 wudata_05.log
-rw-r--r-- 1 folder users 24054765 2010-01-08 10:31 wudata_05.tpr
-rw-r--r-- 1 folder users 24738672 2010-01-09 09:30 wudata_05.trr
-rw-r--r-- 1 folder users 27702512 2010-01-09 09:30 wudata_05.xtc
-rw-r--r-- 1 folder users  3535712 2010-01-09 09:16 wudata_05_prev.cpt
-rw-r--r-- 1 folder users        0 2010-01-10 04:48 wudata_06.ckp
-rw-r--r-- 1 folder users        0 2010-01-10 04:48 wudata_06.cpt
-rw-r--r-- 1 folder users  4835244 2010-01-09 09:32 wudata_06.dat
-rw-r--r-- 1 folder users    76808 2010-01-10 04:41 wudata_06.edr
-rw-r--r-- 1 folder users   101398 2010-01-10 04:48 wudata_06.log
-rw-r--r-- 1 folder users 24042365 2010-01-10 05:27 wudata_06.tpr
-rw-r--r-- 1 folder users  7064448 2010-01-10 00:49 wudata_06.trr
-rw-r--r-- 1 folder users 13580804 2010-01-10 04:16 wudata_06.xtc
-rw-r--r-- 1 folder users  3533840 2010-01-10 04:33 wudata_06_prev.cpt
-rw-r--r-- 1 folder users  3533840 2010-01-11 20:14 wudata_07_prev.cpt
-rw-r--r-- 1 folder users  3553400 2010-01-13 20:48 wudata_08_prev.cpt
-rw-r--r-- 1 folder users    75160 2009-12-31 16:00 wudata_09.ckp
-rw-r--r-- 1 folder users  3533816 2009-12-31 16:00 wudata_09.cpt
-rw-r--r-- 1 folder users  4843124 2010-01-13 21:17 wudata_09.dat
-rw-r--r-- 1 folder users        0 2009-12-30 17:55 wudata_09.dyn
-rw-r--r-- 1 folder users   149272 2009-12-31 16:00 wudata_09.edr
-rw-r--r-- 1 folder users 10154781 2009-12-31 16:00 wudata_09.gro
-rw-r--r-- 1 folder users   188947 2009-12-31 16:00 wudata_09.log
-rw-r--r-- 1 folder users 24003985 2010-01-13 21:17 wudata_09.tpr
-rw-r--r-- 1 folder users 21193200 2009-12-31 16:00 wudata_09.trr
-rw-r--r-- 1 folder users  4513328 2009-12-31 16:00 wudata_09.xtc
-rw-r--r-- 1 folder users  3533816 2009-12-31 15:56 wudata_09_prev.cpt
-rwxr-x--- 1 folder users      512 2010-01-17 06:23 wuinfo_02.dat*
-rwxr-x--- 1 folder users      512 2010-01-17 14:56 wuinfo_03.dat*
-rwxr-x--- 1 folder users      512 2010-01-09 09:30 wuinfo_05.dat*
-rwxr-x--- 1 folder users      512 2010-01-10 05:27 wuinfo_06.dat*
-rwxr-x--- 1 folder users      512 2010-01-13 21:17 wuinfo_09.dat*
Last edited by chrisretusn on Sun Jan 17, 2010 8:35 am, edited 3 times in total.
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Sun Jan 17, 2010 7:21 am

bruce wrote:Is it reasonable to assume that this is only a problem if (A) Linux, and (B) FahCore_a2, and (C) ext4? If that's true, development will have a lot better chance of isolating the problem.

Yes, I think it reasonable. I am using jfs now and the problem has gone away, though I think I remember someone mentioning problems with XFS on either write time or loosing WU's if shutdown.

I think it has a lot to do with the ext4 file system's delayed allocation (data=writeback) problem.

Delayed allocation and the zero-length file problem | Thoughts by Ted

Delayed allocation and the zero-length file problem | The Linux Foundation

Linus Torvalds Upset over Ext3 and Ext4 - Linux Magazine Online
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: - Error: Could not get length of results file work/wuresult

Postby tear » Sun Jan 17, 2010 12:43 pm

Chris,

I think one of your observations deserves to be emphasized and defined as a separate issue:
chrisretusn wrote:Other observations. The core_a2 when a core fails it leaves all of it's data behind.


I've seen it at my end as well -- it basically makes affected queue slot unusable (unless
files are manually removed). Now, imagine what happens after 10 such failures...


tear


P.S.
And while we're at it, A3 (and A1, perhaps?) cores should get examined too
One man's ceiling is another man's floor.
Image
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Mon Jan 18, 2010 3:02 am

tear wrote:Chris,

I think one of your observations deserves to be emphasized and defined as a separate issue:
chrisretusn wrote:Other observations. The core_a2 when a core fails it leaves all of it's data behind.

I agree. My logs above show this problem. Specifically this excerpt with non relevant portions removed and comments added. This log is the one in which I redirected stderr and stdout to a separate fah.log It shows a bit more detail.
Code: Select all
[22:22:01] Thank you for your contribution to Folding@Home.
[22:22:01] + Number of Units Completed: 55
[22:22:08] - Warning: Could not delete all work unit files (1): Core file absent
/* The WU completed was using queue slot 01. In looking at the directory listing in my post above, all of the slot 01 files have been deleted except logfile_01.txt and wudata_01_prev.cpt. Somewhere along the way between the completion of that WU and this log entry, the core data files were already deleted. Seems to me there is a mix up on when to delete the work files.
*/
[22:23:45] + Received work.
[22:23:45] + Processing work unit
[22:23:45] At least 4 processors must be requested.Core required: FahCore_a2.exe
[22:23:45] Core found.
[22:23:45] Working on queue slot 02 [January 16 22:23:45 UTC]
[22:23:45] + Working ...
[22:23:45] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 3742 -version 624'

[22:23:45] Folding@Home Gromacs SMP Core
[22:23:45] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[22:23:45]
[22:23:45] Preparing to commence simulation
[22:23:45] - Ensuring status. Please wait.
[22:23:46] - Digital signature verAssembly optimizations on if available.
[22:23:46] Entering Entering M.D.
[22:23:52] Using Gromacs checkpoints
[22:23:56]  M.D.
[22:24:02] Using Gromacs checkpoints
NNODES=4, MYRANK=0, HOSTNAME=racermach
NNODES=4, MYRANK=1, HOSTNAME=racermach
NNODES=4, MYRANK=3, HOSTNAME=racermach
NODEID=0 argc=23
NODEID=1 argc=23
Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
NNODES=4, MYRANK=2, HOSTNAME=racermach
NODEID=2 argc=23
NODEID=3 argc=23
Note: tpx file_version 48, software version 68

Reading checkpoint file work/wudata_02.cpt generated: Sat Dec 19 22:31:39 2009

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: checkpoint.c, line: 1196

Fatal error:
Checkpoint file is for a system of 147986 atoms, while the current system consists of 147170 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day
/* Here we have a problem, the checkpoint file is from a previous WU was read, it was not deleted. Of course it did not match newly downloaded WU so an error was generated. What would happen if it did match?

That previous WU (slot 02) failed after 100%, it was shut down at least once during processing (running under ext4 fs). When it was restarted it passed it's checks and continued on until completion. When the wudata_02.dat file was read prior to sending the results, a problem was encountered:
#### Previous WU
[01:30:52] Sending work to server
[01:30:52] - Error: Could not get length of results file work/wuresults_02.dat
[01:30:52] - Error: Could not read unit 02 file. Removing from queue.
####
*/
Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[22:24:12] CoreStatus = FF (255)
[22:24:12] Sending work to server
[22:24:12] Project: 2662 (Run 0, Clone 428, Gen 52)
[22:24:12] - Error: Could not get length of results file work/wuresults_02.dat
[22:24:12] - Error: Could not read unit 02 file. Removing from queue.
/* Since there was an error, slot 02 removed, but the files remain. The next WU went on to slot 03 where only the files logfile_03.txt and wudata_03_prev.cpt existed; so the WU went on to process and it currently at 68%


I've seen it at my end as well -- it basically makes affected queue slot unusable (unless
files are manually removed). Now, imagine what happens after 10 such failures...

And while we're at it, A3 (and A1, perhaps?) cores should get examined too

Looking at what I had in my work directory (see my previous post) Slots 02, 05, 06 and 09 would not have been usable. I am working slot 03 now, next WU would start fine. After that was finished, a now job would be downloaded and started in slot 05, that would have errored out, and removed from the queue (files would remain), then a new job would be downloaded, and a repeat of 05 on slot 06. Then another job downloaded for slot 07. If all of the slot were full nothing would be done except repeated downloading of new WU's. Not good.

I have since removed all of those extra files. Here is a file list as it stands now.
Code: Select all
-rw-r--r-- 1 folder users        0 2010-01-17 19:19 core78.sta
-rw-r--r-- 1 folder users   476137 2009-12-17 23:12 current.xyz
-rwxr-x--- 1 folder users      221 2010-01-15 15:22 logfile_00.txt*
-rwxr-x--- 1 folder users      221 2010-01-17 06:22 logfile_01.txt*
-rwxr-x--- 1 folder users      560 2010-01-17 06:24 logfile_02.txt*
-rwxr-x--- 1 folder users     2490 2010-01-18 10:30 logfile_03.txt*
-rwxr-x--- 1 folder users      221 2010-01-07 08:41 logfile_04.txt*
-rwxr-x--- 1 folder users     3651 2010-01-09 09:30 logfile_05.txt*
-rwxr-x--- 1 folder users      560 2010-01-10 05:27 logfile_06.txt*
-rwxr-x--- 1 folder users      221 2010-01-11 21:10 logfile_07.txt*
-rwxr-x--- 1 folder users      221 2010-01-13 21:15 logfile_08.txt*
-rwxr-x--- 1 folder users      559 2010-01-13 21:17 logfile_09.txt*
-rw-r--r-- 1 folder users  3526352 2010-01-15 14:45 wudata_00_prev.cpt
-rw-r--r-- 1 folder users  3526352 2010-01-17 05:52 wudata_01_prev.cpt
-rw-r--r-- 1 folder users  3553400 2009-12-19 22:16 wudata_02_prev.cpt
-rw-r--r-- 1 folder users    75160 2010-01-18 10:34 wudata_03.ckp
-rw-r--r-- 1 folder users  3533816 2010-01-18 10:34 wudata_03.cpt
-rw-r--r-- 1 folder users  4855336 2010-01-17 06:25 wudata_03.dat
-rw-r--r-- 1 folder users        0 2010-01-17 19:02 wudata_03.dyn
-rw-r--r-- 1 folder users   104280 2010-01-18 10:46 wudata_03.edr
-rw-r--r-- 1 folder users   135054 2010-01-18 10:46 wudata_03.log
-rw-r--r-- 1 folder users 24045785 2010-01-17 19:19 wudata_03.tpr
-rw-r--r-- 1 folder users 14128800 2010-01-18 06:50 wudata_03.trr
-rw-r--r-- 1 folder users  3066668 2010-01-18 10:04 wudata_03.xtc
-rw-r--r-- 1 folder users  3533816 2010-01-18 10:19 wudata_03_prev.cpt
-rw-r--r-- 1 folder users  3533336 2010-01-07 07:32 wudata_04_prev.cpt
-rw-r--r-- 1 folder users  3535712 2010-01-09 09:16 wudata_05_prev.cpt
-rw-r--r-- 1 folder users  3533840 2010-01-10 04:33 wudata_06_prev.cpt
-rw-r--r-- 1 folder users  3533840 2010-01-11 20:14 wudata_07_prev.cpt
-rw-r--r-- 1 folder users  3553400 2010-01-13 20:48 wudata_08_prev.cpt
-rw-r--r-- 1 folder users  3533816 2009-12-31 15:56 wudata_09_prev.cpt
-rwxr-x--- 1 folder users      512 2010-01-18 10:30 wuinfo_03.dat*


To continue with the ext4 problem. I am running on jfs now, here is a continuation of my current log showing a when a power failure occurred. While the WU is not yet finished I am confident that it will complete and send it's data. Note: I have my systems on UPS's that auto shutdown my machines.
Code: Select all
[07:50:41] Completed 57500 out of 250000 steps  (23%)
[08:15:44] Completed 60000 out of 250000 steps  (24%)
[08:37:19] - Autosending finished units... [January 17 08:37:19 UTC]
[08:37:19] Trying to send all finished work units
[08:37:19] + No unsent completed units remaining.
[08:37:19] - Autosend completed
[08:39:33] Completed 62500 out of 250000 steps  (25%)
[09:03:08] Completed 65000 out of 250000 steps  (26%)
[09:29:54] Completed 67500 out of 250000 steps  (27%)
[10:08:45] Completed 70000 out of 250000 steps  (28%)
[10:40:11] Completed 72500 out of 250000 steps  (29%)
[11:01:58] ***** Got a SIGTERM signal (15)
[11:01:58] Killing all core threads

Folding@Home Client Shutdown.


Received the TERM signal, stopping at the next step



Received the TERM signal, stopping at the next step



Received the TERM signal, stopping at the next step



Received the TERM signal, stopping at the next step


 Average load imbalance: 4.5 %
 Part of the total run time spent waiting due to load imbalance: 2.7 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Z 0 %


   Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:  45342.678  45342.678    100.0
                       12h35:42
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:     85.400      3.589      0.284     84.591

gcq#0: Thanx for Using GROMACS - Have a Nice Day

Attempting to use an MPI routine after finalizing MPICH

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

2 cores detected


--- Opening Log file [January 17 11:19:08 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/folder/fah6-1
Executable: ./fah6
Arguments: -verbosity 9 -smp

[11:19:08] - Ask before connecting: No
[11:19:08] - User name: chrisretusn (Team 2291)
[11:19:08] - User ID: 83719AB3EDA1FA2
[11:19:08] - Machine ID: 1
[11:19:08]
[11:19:08] Loaded queue successfully.
[11:19:08] - Autosending finished units... [January 17 11:19:08 UTC]
[11:19:08] Trying to send all finished work units
[11:19:08] + No unsent completed units remaining.
[11:19:08] - Autosend completed
[11:19:08]
[11:19:08] + Processing work unit
[11:19:08] At least 4 processors must be requested.Core required: FahCore_a2.exe
[11:19:08] Core found.
[11:19:08] Working on queue slot 03 [January 17 11:19:08 UTC]
[11:19:08] + Working ...
[11:19:08] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 03 -checkpoint 15 -verbose -lifeline 3260 -version 624'

[11:19:08]
[11:19:08] *------------------------------*
[11:19:08] Folding@Home Gromacs SMP Core
[11:19:08] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[11:19:08]
[11:19:08] Preparing to commence simulation
[11:19:08] - Ensuring status. Please wait.
[11:19:08] Files status OK
[11:19:10] - Expanded 4854824 -> 24045785 (decompressed 495.2 percent)
[11:19:10] Called DecompressByteArray: compressed_data_size=4854824 data_size=24045785, decompressed_data_size=24045785 diff=0
[11:19:10] - Digital signature verified
[11:19:10]
[11:19:10] Project: 2662 (Run 0, Clone 428, Gen 52)
[11:19:10]
[11:19:10] Assembly optimizations on if available.
[11:19:10] Entering M.D.
[11:19:18] Using Gromacs checkpoints
[11:19:20]
[11:19:20] Entering M.D.
[11:19:26] Using Gromacs checkpoints
NNODES=4, MYRANK=0, HOSTNAME=racermach
NNODES=4, MYRANK=3, HOSTNAME=racermach
NNODES=4, MYRANK=1, HOSTNAME=racermach
NNODES=4, MYRANK=2, HOSTNAME=racermach
NODEID=0 argc=23
NODEID=1 argc=23
NODEID=2 argc=23
NODEID=3 argc=23
Reading file work/wudata_03.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

Reading checkpoint file work/wudata_03.cpt generated: Sun Jan 17 19:01:59 2010


NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun 'IBX in water'
13250000 steps,  26500.0 ps (continuing from step 13074447,  26148.9 ps).
[11:19:34] Resuming from checkpoint
[11:19:34] Verified work/wudata_03.log
[11:19:34] Verified work/wudata_03.trr
[11:19:34] Verified work/wudata_03.xtc
[11:19:34] Verified work/wudata_03.edr
[11:19:35] Completed 74447 out of 250000 steps  (29%)
[11:25:24] Completed 75000 out of 250000 steps  (30%)
[11:50:36] Completed 77500 out of 250000 steps  (31%)
[12:14:49] Completed 80000 out of 250000 steps  (32%)
[12:39:36] Completed 82500 out of 250000 steps  (33%)
[13:04:32] Completed 85000 out of 250000 steps  (34%)
[13:30:25] Completed 87500 out of 250000 steps  (35%)
[13:54:14] Completed 90000 out of 250000 steps  (36%)
[14:17:03] Completed 92500 out of 250000 steps  (37%)
[14:39:38] Completed 95000 out of 250000 steps  (38%)
[15:02:12] Completed 97500 out of 250000 steps  (39%)
[15:24:46] Completed 100000 out of 250000 steps  (40%)
[15:47:19] Completed 102500 out of 250000 steps  (41%)
[16:09:52] Completed 105000 out of 250000 steps  (42%)
[16:32:25] Completed 107500 out of 250000 steps  (43%)
[16:54:58] Completed 110000 out of 250000 steps  (44%)
[17:17:28] Completed 112500 out of 250000 steps  (45%)
[17:19:08] - Autosending finished units... [January 17 17:19:08 UTC]
[17:19:08] Trying to send all finished work units
[17:19:08] + No unsent completed units remaining.
[17:19:08] - Autosend completed
[17:40:02] Completed 115000 out of 250000 steps  (46%)
[18:02:35] Completed 117500 out of 250000 steps  (47%)
[18:25:08] Completed 120000 out of 250000 steps  (48%)
[18:47:42] Completed 122500 out of 250000 steps  (49%)
[19:10:16] Completed 125000 out of 250000 steps  (50%)
[19:32:48] Completed 127500 out of 250000 steps  (51%)
[19:55:21] Completed 130000 out of 250000 steps  (52%)
[20:17:53] Completed 132500 out of 250000 steps  (53%)
[20:40:27] Completed 135000 out of 250000 steps  (54%)
[21:03:01] Completed 137500 out of 250000 steps  (55%)
[21:24:35] Completed 140000 out of 250000 steps  (56%)
[21:46:06] Completed 142500 out of 250000 steps  (57%)
[22:07:37] Completed 145000 out of 250000 steps  (58%)
[22:29:08] Completed 147500 out of 250000 steps  (59%)
[22:50:39] Completed 150000 out of 250000 steps  (60%)
[23:12:11] Completed 152500 out of 250000 steps  (61%)
[23:19:08] - Autosending finished units... [January 17 23:19:08 UTC]
[23:19:08] Trying to send all finished work units
[23:19:08] + No unsent completed units remaining.
[23:19:08] - Autosend completed
[23:35:01] Completed 155000 out of 250000 steps  (62%)
[23:59:57] Completed 157500 out of 250000 steps  (63%)
[00:25:23] Completed 160000 out of 250000 steps  (64%)
[00:50:14] Completed 162500 out of 250000 steps  (65%)
[01:14:06] Completed 165000 out of 250000 steps  (66%)
[01:38:41] Completed 167500 out of 250000 steps  (67%)
[02:04:23] Completed 170000 out of 250000 steps  (68%)
[02:30:52] Completed 172500 out of 250000 steps  (69%)
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: - Error: Could not get length of results file work/wuresult

Postby tear » Mon Jan 18, 2010 1:25 pm

chrisretusn wrote:
tear wrote:Chris,

I think one of your observations deserves to be emphasized and defined as a separate issue:
chrisretusn wrote:Other observations. The core_a2 when a core fails it leaves all of it's data behind.

I agree. My logs above show this problem. Specifically this excerpt with non relevant portions removed and comments added. This log is the one in which I redirected stderr and stdout to a separate fah.log It shows a bit more detail.
Code: Select all
[22:22:01] Thank you for your contribution to Folding@Home.
[22:22:01] + Number of Units Completed: 55
[22:22:08] - Warning: Could not delete all work unit files (1): Core file absent
/* The WU completed was using queue slot 01. In looking at the directory listing in my post above, all of the slot 01 files have been deleted except logfile_01.txt and wudata_01_prev.cpt. Somewhere along the way between the completion of that WU and this log entry, the core data files were already deleted. Seems to me there is a mix up on when to delete the work files.
*/
[22:23:45] + Received work.
[22:23:45] + Processing work unit
[22:23:45] At least 4 processors must be requested.Core required: FahCore_a2.exe
[22:23:45] Core found.
[22:23:45] Working on queue slot 02 [January 16 22:23:45 UTC]
[22:23:45] + Working ...
[22:23:45] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 3742 -version 624'

[22:23:45] Folding@Home Gromacs SMP Core
[22:23:45] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[22:23:45]
[22:23:45] Preparing to commence simulation
[22:23:45] - Ensuring status. Please wait.
[22:23:46] - Digital signature verAssembly optimizations on if available.
[22:23:46] Entering Entering M.D.
[22:23:52] Using Gromacs checkpoints
[22:23:56]  M.D.
[22:24:02] Using Gromacs checkpoints
NNODES=4, MYRANK=0, HOSTNAME=racermach
NNODES=4, MYRANK=1, HOSTNAME=racermach
NNODES=4, MYRANK=3, HOSTNAME=racermach
NODEID=0 argc=23
NODEID=1 argc=23
Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
NNODES=4, MYRANK=2, HOSTNAME=racermach
NODEID=2 argc=23
NODEID=3 argc=23
Note: tpx file_version 48, software version 68

Reading checkpoint file work/wudata_02.cpt generated: Sat Dec 19 22:31:39 2009

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: checkpoint.c, line: 1196

Fatal error:
Checkpoint file is for a system of 147986 atoms, while the current system consists of 147170 atoms
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day
/* Here we have a problem, the checkpoint file is from a previous WU was read, it was not deleted. Of course it did not match newly downloaded WU so an error was generated. What would happen if it did match?

That previous WU (slot 02) failed after 100%, it was shut down at least once during processing (running under ext4 fs). When it was restarted it passed it's checks and continued on until completion. When the wudata_02.dat file was read prior to sending the results, a problem was encountered:
#### Previous WU
[01:30:52] Sending work to server
[01:30:52] - Error: Could not get length of results file work/wuresults_02.dat
[01:30:52] - Error: Could not read unit 02 file. Removing from queue.
####
*/
Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[22:24:12] CoreStatus = FF (255)
[22:24:12] Sending work to server
[22:24:12] Project: 2662 (Run 0, Clone 428, Gen 52)
[22:24:12] - Error: Could not get length of results file work/wuresults_02.dat
[22:24:12] - Error: Could not read unit 02 file. Removing from queue.
/* Since there was an error, slot 02 removed, but the files remain. The next WU went on to slot 03 where only the files logfile_03.txt and wudata_03_prev.cpt existed; so the WU went on to process and it currently at 68%


I've seen it at my end as well -- it basically makes affected queue slot unusable (unless
files are manually removed). Now, imagine what happens after 10 such failures...

And while we're at it, A3 (and A1, perhaps?) cores should get examined too

Looking at what I had in my work directory (see my previous post) Slots 02, 05, 06 and 09 would not have been usable. I am working slot 03 now, next WU would start fine. After that was finished, a now job would be downloaded and started in slot 05, that would have errored out, and removed from the queue (files would remain), then a new job would be downloaded, and a repeat of 05 on slot 06. Then another job downloaded for slot 07. If all of the slot were full nothing would be done except repeated downloading of new WU's. Not good.


Agreed. I think the log you pasted cleanly depicts discussed failure.

I forgot to mention one more bit (just so everyone's on the same page).

"core fails" means failure not only due to unclean shutdown but also due to core crash (bad WU, segfault, etc.);
it may also affect 24/7 folders with rock-solid setups.



tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: - Error: Could not get length of results file work/wuresult

Postby tear » Mon Jan 18, 2010 1:34 pm

Chris,

On a different note; pardon me for my ignorance, but can you please provide
list of steps that you follow to reproduce the "filesystem" problem? I'm having
hard time determining them from your reports.

I'd appreciate it.


Thanks,
tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: - Error: Could not get length of results file work/wuresult

Postby VijayPande » Mon Jan 18, 2010 3:02 pm

chrisretusn wrote:
bruce wrote:Is it reasonable to assume that this is only a problem if (A) Linux, and (B) FahCore_a2, and (C) ext4? If that's true, development will have a lot better chance of isolating the problem.

Yes, I think it reasonable. I am using jfs now and the problem has gone away, though I think I remember someone mentioning problems with XFS on either write time or loosing WU's if shutdown.

I think it has a lot to do with the ext4 file system's delayed allocation (data=writeback) problem.

Delayed allocation and the zero-length file problem | Thoughts by Ted

Delayed allocation and the zero-length file problem | The Linux Foundation

Linus Torvalds Upset over Ext3 and Ext4 - Linux Magazine Online



ok, thanks for the info. The details you found are very useful (the more details the better for tracking bugs down). I'll pass this to the devs.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
User avatar
VijayPande
Pande Group Member
 
Posts: 2651
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Tue Jan 19, 2010 8:55 am

tear wrote:Chris,

On a different note; pardon me for my ignorance, but can you please provide
list of steps that you follow to reproduce the "filesystem" problem? I'm having
hard time determining them from your reports.

I'd appreciate it.


Thanks,
tear

Slackware64 13.0 with 32-bit libs installed. Processor is AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
1. File system = ext4
2. Run fah6 with the -smp option
3. On a job running the a2 core, after it's gained some percentages, stop fah6, this can be with Ctrl-C if run from CLI, kill or killall.
4. Restart fah6. There can be a waiting period before you restart if you like makes no difference.
5. Let the job go to completion.
6. When the wudata_##.dat file is read to send to Stanford you will get the following in the log:
Code: Select all
[01:30:46] Completed 250001 out of 250001 steps  (100%)
[01:30:52] CoreStatus = 1 (1)
[01:30:52] Sending work to server
[01:30:52] Project: 2677 (Run 25, Clone 46, Gen 66)
[01:30:52] - Error: Could not get length of results file work/wuresults_05.dat
[01:30:52] - Error: Could not read unit 05 file. Removing from queue.


I also noted the amount of time it took to write the data to disk under ext4. Log extracts of successful runs were posted in my original post but I will post them again here so you don't have to go back:
Code: Select all
[08:02:32] Project: 2662 (Run 0, Clone 467, Gen 44)
[00:15:11] - Writing 26056329 bytes of core data to disk...
[00:15:13] ... Done.
[00:24:38] - Shutting down core
-- 9 minutes

[00:53:10] Project: 2662 (Run 0, Clone 148, Gen 65)
[16:18:19] - Writing 26011656 bytes of core data to disk...
[16:18:21] ... Done.
[16:29:51] - Shutting down core
-- 13 minutes

[17:00:39] Project: 2669 (Run 3, Clone 182, Gen 186)
[08:52:20] - Writing 26007282 bytes of core data to disk...
[08:52:21] ... Done.
[09:03:52] - Shutting down core
-- 11 minutes

[09:32:31] Project: 2677 (Run 39, Clone 80, Gen 64)
[23:36:09] - Writing 49211080 bytes of core data to disk...
[23:36:11] ... Done.
[23:51:59] - Shutting down core
-- 15 minutes


I came to the conclusion that problem was ext4 after some research and also some threads here in the forum. Specifically this thread Folding Forum • View topic - Core hangs: jbd2_log_wait_commit.

Since switching over to jfs I have not had anymore problems with loosing WU's. You can also see the difference in write times under jfs in my post HERE

A little background. Power outages are a normal part of life here, all of my machines are protect by an UPS. I also recently I upgraded to Slackware64 13.0 from Slackware 12.2. I started out using two instances of the 6.24beta client, those jobs used the 78 and b4 cores. I had no problems until I switch to using the -smp option. The only core that's run since going -smp is the a2 core. Prior to upgrading to Slackware64 I was using two instances of the 6.02 client, without problems. Before upgrading to Slackware64 my file system was jfs, on upgrading I changed it to ext4. I am now running with jfs on /home and ext4 on / ; later this evening I will be reformatting / back to jfs.

I am considering formating an unused partition to ext3, xfs and again to ext4 to see what happens. The process is time consuming since on my machines a WU takes about a day and a half to finish, plus the prospect of losing a WU does not exactly thrill me. :)

Hope this information is what you were looking for.
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: - Error: Could not get length of results file work/wuresult

Postby tear » Tue Jan 19, 2010 7:13 pm

Chris,

Yes, that does the trick, thanks!

I do have a question to ask you -- I went through the steps you listed and did not notice reboot of a PC.
Does it mean the problem can be reproduced without restarting the operating system?
(that would truly be shocking and could actually indicate a bug in kernel/filesystem)

Re excessive fsyncs (long delays after unit completion)
They affect XFS (own experience) similarly to ext4.


tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: - Error: Could not get length of results file work/wuresult

Postby linuxfah » Tue Jan 19, 2010 11:30 pm

Chris,

Thanks for the excellent detailed information in this thread.

I think I might swap one of my folding systems over to JFS from XFS to see how it handles the folding client.
linuxfah
 
Posts: 330
Joined: Tue Sep 22, 2009 9:28 pm

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Wed Jan 20, 2010 2:19 am

tear wrote:Chris,

Yes, that does the trick, thanks!

Great!

I do have a question to ask you -- I went through the steps you listed and did not notice reboot of a PC.
Does it mean the problem can be reproduced without restarting the operating system?
(that would truly be shocking and could actually indicate a bug in kernel/filesystem)

Yes, I reproduced this without restarting the operating system. In one instance I ran fah6 from the cli and terminated with Ctrl-C, waited a while, about 15 minutes and then restarted the client. Some of my testing was starting and stopping via my rc.folding script.

As I mentioned earlier we have frequent power outages where I live. (If it wasn't for those power outages I may not have noticed this problem.) So on some of my testing the system was shutdown (by UPS), then manually restarted when power came back on. I also tested with rebooting, to check out the addition of an rc.local_shutdown script to kill fah6 before the rc.6 script kicked in to see if it made a difference.

Re excessive fsyncs (long delays after unit completion)
They affect XFS (own experience) similarly to ext4.

XFS also uses delayed allocation like ext4. That said I have not heard of any data loss problems with XFS like with ext4. I can live with slow write times (well not really), but when it comes to data loss, I cannot tolerate that.

I am of the opinion that it's a combination of things, part file system, part programing and also file size. I had zero problems with running those other cores prior to my switching to -smp with ext4. Only when I switched to -smp and larger work units did the problem manifest it self.
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Wed Jan 20, 2010 2:34 am

linuxfah wrote:Chris,

Thanks for the excellent detailed information in this thread.

I think I might swap one of my folding systems over to JFS from XFS to see how it handles the folding client.

You're Welcome. I've been toying with the idea of trying XFS, possibly ext3.
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Next

Return to Linux CPU V6 Client

Who is online

Users browsing this forum: Google [Bot] and 2 guests

cron