WU's fail after 100%, left over work files, ext4 cause?

Moderators: Site Moderators, PandeGroup

Re: - Error: Could not get length of results file work/wuresult

Postby tear » Thu Jan 21, 2010 9:28 pm

Nah, it's not the delalloc that affects performance. These are iobarriers. Try mounting ext4
with barrier=0 (default is 1 for ext4) and see improved performance or, alternatively try
mounting ext3 with barrier=1 (default is 0 for ext3) and see degraded performance.
Barriers are critical for journalling filesystems' integrity (http://lwn.net/Articles/283161/); also see
viewtopic.php?f=55&t=12248&p=120089#p120078

excessive fsyncs* = lots of metadata(journal) and data updates = lots of seeks into metadata(journal) and data areas

*) fsyncs actually defeat the purpose of delalloc as they enforce the filsystem to perform the allocation (when called)


tear
One man's ceiling is another man's floor.
Image
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Fri Jan 22, 2010 1:52 am

Thanks for the links. Make sense.

This article also made a lot of sense to me.
Ext4 data loss; explanations and workarounds - The H Open Source: News and Features

What is puzzling to me is the data loss with fah6 on ext4. When one kills fah6, it signals the four core_a2's to quit. With ext4 I watched each instance of the core_a2 as it shut down, each in turn rather quickly except for one remaining core_a2. This instance of core_a2 was active for 10+ minutes. It is clear that writes are taking place; disk activity light and plenty of I/O writes. The puzzling part is once the cores are finished shutting down and fah6 exits, that wudata_##.dat is somehow corrupted. This is not discovered until after a WU has finished. I am not sure if wudata_##.dat is verified on a restart, it not listed as being verified in the logs. It could be that new data is corrupting the file, but since changing to jfs this has not been an issue.

I also assume (because it has not been mentioned as a problem) that with xfs, only the delay is present, the data loss problem does not exist.

What is it that makes the difference?
Image
Folding on Slackware Linux.
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: - Error: Could not get length of results file work/wuresult

Postby tear » Fri Jan 22, 2010 5:54 am

There's no corruption happening on XFS; disk gets thrashed but data are intact.

By saying wudata you mean wuresults file, right? What's interesting is that wuresults
file gets created at 100% completion (it doesn't exist when you interrupt the client
mid-simulation). So what I think gets corrupted are files that keep simulation state.
OTOH, core appears to be performing some kind of verification* of checkpoints
upon restart so that makes the issue doubly puzzling.

*) (side note) I'm wondering about reliability of the method used

I have a F12 system around -- I'll create ext4 partition (I've been married to XFS
since 1999) and see what gives.


tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: - Error: Could not get length of results file work/wuresult

Postby tear » Fri Jan 22, 2010 8:08 am

Did the first part (reached 10%, then restarted) with ext4 + current Fedora 12. Log below.

Observations:
1) There is no delay upon hitting ^C
2) Checkpoint is _not_ created upon hitting ^C, as you can see below (checkpointing interval is at 15 minutes == second checkpoint == restores @9%)
3) I'm using fah6_alt 6.24 (glibc incompatibility) but will try original 6.24 in chroot soon

Chris, did you use custom checkpointing interval during your tests?


tear

Code: Select all
$ ./fah6

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

12 cores detected


--- Opening Log file [January 22 06:13:54 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /mnt/sdb6/tmp/fah
Executable: ./fah6
Arguments: -smp 12 -forceasm -verbosity 9

[06:13:54] - Ask before connecting: No
[06:13:54] - User name: Anonymous (Team 0)
[06:13:54] - User ID: 69129298638CA19F
[06:13:54] - Machine ID: 1
[06:13:54]
[06:13:54] Could not open work queue, generating new queue...
[06:13:54] - Preparing to get new work unit...
[06:13:54] - Autosending finished units... [January 22 06:13:54 UTC]
[06:13:54] + Attempting to get work packet
[06:13:54] Trying to send all finished work units
[06:13:54] + No unsent completed units remaining.
[06:13:54] - Will indicate memory of 12106 MB
[06:13:54] - Autosend completed
[06:13:54] - Connecting to assignment server
[06:13:54] Connecting to http://assign.stanford.edu:8080/
[06:13:55] Posted data.
[06:13:55] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[06:13:55] + News From Folding@Home: Welcome to Folding@Home
[06:13:55] Loaded queue successfully.
[06:13:55] Connecting to http://171.64.65.56:8080/
[06:13:59] Posted data.
[06:13:59] Initial: 0000; - Receiving payload (expected size: 4845336)
[06:14:02] - Downloaded at ~1577 kB/s
[06:14:02] - Averaged speed for that direction ~1577 kB/s
[06:14:02] + Received work.
[06:14:02] + Closed connections
[06:14:02]
[06:14:02] + Processing work unit
[06:14:02] Core required: FahCore_a2.exe
[06:14:02] Core found.
[06:14:02] Working on queue slot 01 [January 22 06:14:02 UTC]
[06:14:02] + Working ...
[06:14:02] - Calling './mpiexec -np 12 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 15 -forceasm -verbose -lifeline 6557 -version 624'
[06:14:02]
[06:14:02] *------------------------------*
[06:14:02] Folding@Home Gromacs SMP Core
[06:14:02] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[06:14:02]
[06:14:02] Preparing to commence simulation
[06:14:02] - Ensuring status. Please wait.
[06:14:12] - Assembly optimizations manually forced on.
[06:14:12] - Not checking prior termination.
[06:14:13] - Expanded 4844824 -> 24025869 (decompressed 495.9 percent)
[06:14:13] Called DecompressByteArray: compressed_data_size=4844824 data_size=24025869, decompressed_data_size=24025869 diff=0
[06:14:13] - Digital signature verified
[06:14:13]
[06:14:13] Project: 2677 (Run 6, Clone 57, Gen 73)
[06:14:13]
[06:14:13] Assembly optimizations on if available.
[06:14:13] Entering M.D.
NNODES=12, MYRANK=0, HOSTNAME=tentacle
NODEID=0 argc=20
NNODES=12, MYRANK=5, HOSTNAME=tentacle
NODEID=5 argc=20
NNODES=12, MYRANK=8, HOSTNAME=tentacle
NNODES=12, MYRANK=10, HOSTNAME=tentacle
NNODES=12, MYRANK=11, HOSTNAME=tentacle
NNODES=12, MYRANK=1, HOSTNAME=tentacle
NODEID=1 argc=20
NNODES=12, MYRANK=2, HOSTNAME=tentacle
NODEID=2 argc=20
NNODES=12, MYRANK=3, HOSTNAME=tentacle
NODEID=3 argc=20
NNODES=12, MYRANK=4, HOSTNAME=tentacle
NODEID=4 argc=20
NNODES=12, MYRANK=6, HOSTNAME=tentacle
NODEID=6 argc=20
NNODES=12, MYRANK=7, HOSTNAME=tentacle
NODEID=7 argc=20
NNODES=12, MYRANK=9, HOSTNAME=tentacle
NODEID=8 argc=20
NODEID=9 argc=20
NODEID=10 argc=20
NODEID=11 argc=20
Reading file work/wudata_01.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 3D domain decomposition 2 x 2 x 3
starting mdrun 'IBX in water'
18500002 steps,  37000.0 ps (continuing from step 18250002,  36500.0 ps).
[06:14:22] Completed 0 out of 250000 steps  (0%)
[06:17:27] Completed 2500 out of 250000 steps  (1%)
[06:20:17] Completed 5000 out of 250000 steps  (2%)
[06:23:21] Completed 7500 out of 250000 steps  (3%)
[06:26:27] Completed 10000 out of 250000 steps  (4%)
[06:29:20] Completed 12500 out of 250000 steps  (5%)
[06:32:18] Completed 15000 out of 250000 steps  (6%)
[06:35:38] Completed 17500 out of 250000 steps  (7%)
[06:38:48] Completed 20000 out of 250000 steps  (8%)
[06:41:52] Completed 22500 out of 250000 steps  (9%)
[06:44:43] Completed 25000 out of 250000 steps  (10%)
^C[06:47:24] ***** Got an Activate signal (2)
application called MPI_Abort(MPI_COMM_WORLD, 102) - process 0
[06:47:24] Killing all core threads

Folding@Home Client Shutdown.
$
$
$ [0]0:Return code = 102
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[0]4:Return code = 0, signaled with Quit
[0]5:Return code = 0, signaled with Quit
[0]6:Return code = 0, signaled with Quit
[0]7:Return code = 0, signaled with Quit
[0]8:Return code = 0, signaled with Quit
[0]9:Return code = 0, signaled with Quit
[0]10:Return code = 0, signaled with Quit
[0]11:Return code = 0, signaled with Quit

$ ./fah6

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

12 cores detected


--- Opening Log file [January 22 06:47:44 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /mnt/sdb6/tmp/fah
Executable: ./fah6
Arguments: -smp 12 -forceasm -verbosity 9

[06:47:44] - Ask before connecting: No
[06:47:44] - User name: Anonymous (Team 0)
[06:47:44] - User ID: 69129298638CA19F
[06:47:44] - Machine ID: 1
[06:47:44]
[06:47:44] Loaded queue successfully.
[06:47:44]
[06:47:44] + Processing work unit
[06:47:44] - Autosending finished units... [January 22 06:47:44 UTC]
[06:47:44] Core required: FahCore_a2.exe
[06:47:44] Trying to send all finished work units
[06:47:44] Core found.
[06:47:44] + No unsent completed units remaining.
[06:47:44] - Autosend completed
[06:47:44] Working on queue slot 01 [January 22 06:47:44 UTC]
[06:47:44] + Working ...
[06:47:44] - Calling './mpiexec -np 12 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 15 -forceasm -verbose -lifeline 10247 -version 624'
[06:47:44]
[06:47:44] *------------------------------*
[06:47:44] Folding@Home Gromacs SMP Core
[06:47:44] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[06:47:44]
[06:47:44] Preparing to commence simulation
[06:47:44] - Ensuring status. Please wait.
[06:47:45] Called DecompressByteArray: compressed_data_size=4844824 data_size=24025869, decompressed_data_size=24025869 diff=0
[06:47:45] - Digital signature verified
[06:47:45]
[06:47:45] Project: 2677 (Run 6, Clone 57, Gen 73)
[06:47:45]
[06:47:45] Assembly optimizations on if available.
[06:47:45] Entering M.D.
[06:47:51] Using Gromacs checkpoints
[06:47:55]  M.D.
[06:48:01] Using Gromacs checkpoints
NNODES=12, MYRANK=2, HOSTNAME=tentacle
NNODES=12, MYRANK=3, HOSTNAME=tentacle
NODEID=3 argc=23
NNODES=12, MYRANK=7, HOSTNAME=tentacle
NODEID=7 argc=23
NNODES=12, MYRANK=10, HOSTNAME=tentacle
NNODES=12, MYRANK=0, HOSTNAME=tentacle
NODEID=0 argc=23
NNODES=12, MYRANK=1, HOSTNAME=tentacle
NODEID=1 argc=23
NODEID=2 argc=23
NNODES=12, MYRANK=4, HOSTNAME=tentacle
NODEID=4 argc=23
NNODES=12, MYRANK=5, HOSTNAME=tentacle
NODEID=5 argc=23
NNODES=12, MYRANK=6, HOSTNAME=tentacle
NODEID=6 argc=23
NNODES=12, MYRANK=8, HOSTNAME=tentacle
NODEID=8 argc=23
NNODES=12, MYRANK=9, HOSTNAME=tentacle
NODEID=9 argc=23
NNODES=12, MYRANK=11, HOSTNAME=tentacle
Reading file work/wudata_01.tpr, VERSION 3.3.99_development_20070618 (single precision)
NODEID=10 argc=23
NODEID=11 argc=23
Note: tpx file_version 48, software version 68

Reading checkpoint file work/wudata_01.cpt generated: Thu Jan 21 23:44:23 2010


NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 3D domain decomposition 2 x 2 x 3
starting mdrun 'IBX in water'
18500002 steps,  37000.0 ps (continuing from step 18274730,  36549.5 ps).
[06:48:05] d work/wudata_01.log
[06:48:05] Verified work/wudata_01.trr
[06:48:05] Verified work/wudata_01.xtc
[06:48:05] Verified work/wudata_01.edr
[06:48:05] Completed 24728 out of 250000 steps  (9%)
[06:48:24] Completed 25000 out of 250000 steps  (10%)
[06:51:05] Completed 27500 out of 250000 steps  (11%)
[06:53:45] Completed 30000 out of 250000 steps  (12%)
[06:56:39] Completed 32500 out of 250000 steps  (13%)
[06:59:32] Completed 35000 out of 250000 steps  (14%)
[07:02:30] Completed 37500 out of 250000 steps  (15%)
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: - Error: Could not get length of results file work/wuresult

Postby tear » Fri Jan 22, 2010 3:08 pm

... and it completed.

9 minute fsync delay afer unit completion, 5 minute fsync delay after returning results
so it seems pretty much consistent with your observations.

No corruption. Log below.

Chris, can you tell me what md5sum and size of the client binary you used are?
EDIT: and another question -- how easy is it to reproduce at your end? (100% of attempts? less?)

Thanks,
tear

Code: Select all
[10:46:49] Completed 245000 out of 250000 steps  (98%)
[10:49:21] Completed 247500 out of 250000 steps  (99%)
[10:51:53] Completed 250000 out of 250000 steps  (100%)

Writing final coordinates.

 Average load imbalance: 3.6 %
 Part of the total run time spent waiting due to load imbalance: 2.0 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 3 %


   Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:  14629.505  14629.505    100.0
                       4h03:49
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:    799.413     33.565      2.661      9.020

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[10:51:54] DynamicWrapper: Finished Work Unit: sleep=10000
[10:52:04]
[10:52:04] Finished Work Unit:
[10:52:04] - Reading up to 21175200 from "work/wudata_01.trr": Read 21175200
[10:52:05] trr file hash check passed.
[10:52:05] - Reading up to 27157052 from "work/wudata_01.xtc": Read 27157052
[10:52:05] xtc file hash check passed.
[10:52:05] edr file hash check passed.
[10:52:05] logfile size: 191208
[10:52:05] Leaving Run
[10:52:06] - Writing 48673596 bytes of core data to disk...
[10:52:07]   ... Done.
[11:01:29] - Shutting down core
[11:01:29]
[11:01:29] Folding@home Core Shutdown: FINISHED_UNIT
Attempting to use an MPI routine after finalizing MPICH
[11:02:10] CoreStatus = 64 (100)
[11:02:10] Unit 1 finished with 93 percent of time to deadline remaining.
[11:02:10] Updated performance fraction: 0.933302
[11:02:10] Sending work to server
[11:02:10] Project: 2677 (Run 6, Clone 57, Gen 73)


[11:02:10] + Attempting to send results [January 22 11:02:10 UTC]
[11:02:10] - Reading file work/wuresults_01.dat from core
[11:02:10]   (Read 48673596 bytes from disk)
[11:02:10] Connecting to http://171.64.65.56:8080/
[11:04:58] Posted data.
[11:04:58] Initial: 0000; - Uploaded at ~268 kB/s
[11:05:07] - Averaged speed for that direction ~268 kB/s
[11:05:07] + Results successfully sent
[11:05:07] Thank you for your contribution to Folding@Home.
[11:05:07] + Starting local stats count at 1
[11:10:23] - Warning: Could not delete all work unit files (1): Core file absent
[11:10:23] Trying to send all finished work units
[11:10:23] + No unsent completed units remaining.
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: - Error: Could not get length of results file work/wuresult

Postby bruce » Fri Jan 22, 2010 10:20 pm

Is anybody running the new client yet? Kasson said he fixed a small but important bug butg didn't say what it was.
viewtopic.php?f=24&p=127026#p127026
bruce
 
Posts: 21752
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Sat Jan 23, 2010 2:48 pm

tear wrote:By saying wudata you mean wuresults file, right?
Yes. Mind was thinking data.

What's interesting is that wuresults file gets created at 100% completion (it doesn't exist when you interrupt the client mid-simulation). So what I think gets corrupted are files that keep simulation state. OTOH, core appears to be performing some kind of verification* of checkpoints
upon restart so that makes the issue doubly puzzling.

Yes that is the puzzling part.
tear wrote:Did the first part (reached 10%, then restarted) with ext4 + current Fedora 12. Log below.

Observations:
1) There is no delay upon hitting ^C
2) Checkpoint is _not_ created upon hitting ^C, as you can see below (checkpointing interval is at 15 minutes == second checkpoint == restores @9%)
3) I'm using fah6_alt 6.24 (glibc incompatibility) but will try original 6.24 in chroot soon

Chris, did you use custom checkpointing interval during your tests?
No, using the defaults.

Your results are interesting, it appears that all the core_a2 processes stopped clean when I tried that with ext4 one of the core_a2 processes stayed active for several minutes. I am using the client provided from the downloads page.

tear wrote:... and it completed.

9 minute fsync delay afer unit completion, 5 minute fsync delay after returning results
so it seems pretty much consistent with your observations.

No corruption. Log below.

Yes the delays are indeed consistent, I am happy to see no corruption. Of course this raises more questions.

Chris, can you tell me what md5sum and size of the client binary you used are?
EDIT: and another question -- how easy is it to reproduce at your end? (100% of attempts? less?)

I got this client from the download page. The md5 of the fah6 I was using is 7adee68e2d48b4d7a99e0df829dcf352, the size is 912040

On reproducing, every time I tested except for one time the wu failed after it reached 100% if the client was killed. The one exception was with a Ctrl-C kill of the client from the cli. In this one the error occurred on the startup. Here is the log, I snipped out the middle stuff. This was before I started redirected stdout and stderr to a separate log file to capture all messages.
Code: Select all
[14:30:47] Completed 75000 out of 250000 steps  (30%)
[14:43:44] ***** Got an Activate signal (2)
[14:43:44] Killing all core threads

Folding@Home Client Shutdown.
<snip>
[14:46:43] Entering M.D.
[14:46:49] Using Gromacs checkpoints
[14:46:55] CoreStatus = 0 (0)
[14:46:55] Sending work to server
[14:46:55] Project: 2662 (Run 2, Clone 214, Gen 50)
[14:46:55] - Error: Could not get length of results file work/wuresults_02.dat
[14:46:55] - Error: Could not read unit 02 file. Removing from queue.


bruce wrote:Is anybody running the new client yet? Kasson said he fixed a small but important bug butg didn't say what it was.
viewtopic.php?f=24&p=127026#p127026

I am running the new client now, at 24% now. This is first run with new client. I am going to let this run to completion. My current plan is once this job is finished to move it over to an ext4 partition and see what happens.
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Sun Jan 24, 2010 11:32 pm

On the new 6.29 client, it completed it's first wu with no problems on the jfs file system, as expected. I can't see any difference between the old .6.24beta and the 6.29 client. I have moved it over to an ext4 partition and am currently running a core_a2. For this run I am going to let this run without stopping, unless a power failure forces the issue.

Edit: I did notice something new with the new 6.29 client. Noticed this entry in the logs:

Code: Select all
[03:55:48] - Preparing to get new work unit...
[03:55:48] Cleaning up work directory  <<------------- This is new, cannot find it in any old logs.
[03:55:49] + Attempting to get work packet
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Mon Jan 25, 2010 7:12 am

chrisretusn wrote:For this run I am going to let this run without stopping, unless a power failure forces the issue.

Oh well we had a power outage. It will be interesting to see what happens when its' over. Here is the log.
Code: Select all
[05:25:59] Completed 37500 out of 250000 steps  (15%)
[05:34:17] ***** Got a SIGTERM signal (15)
[05:34:17] Killing all core threads

Folding@Home Client Shutdown.


Received the TERM signal, stopping at the next step



Received the TERM signal, stopping at the next step



Received the TERM signal, stopping at the next step



Received the TERM signal, stopping at the next step

application called MPI_Abort(MPI_COMM_WORLD, 103) - process 0
[0]0:Return code = 103
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

2 cores detected


--- Opening Log file [January 25 06:15:23 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.29

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/stuff/fah6-1
Executable: ./fah6
Arguments: -verbosity 9 -smp -oneunit

[06:15:23] - Ask before connecting: No
[06:15:23] - User name: chrisretusn (Team 2291)
[06:15:23] - User ID: XXXX
[06:15:23] - Machine ID: 1
[06:15:23]
[06:15:23] Loaded queue successfully.
[06:15:23] - Autosending finished units... [January 25 06:15:23 UTC]
[06:15:23] Trying to send all finished work units
[06:15:23] + No unsent completed units remaining.
[06:15:23] - Autosend completed
[06:15:23]
[06:15:23] + Processing work unit
[06:15:23] At least 4 processors must be requested; read 2.
[06:15:23] Core required: FahCore_a2.exe
[06:15:23] Core found.
[06:15:23] Working on queue slot 07 [January 25 06:15:23 UTC]
[06:15:23] + Working ...
[06:15:23] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 07 -checkpoint 15 -verbose -lifeline 3596 -version 629'

[06:15:23]
[06:15:23] *------------------------------*
[06:15:23] Folding@Home Gromacs SMP Core
[06:15:23] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[06:15:23]
[06:15:23] Preparing to commence simulation
[06:15:23] - Ensuring status. Please wait.
[06:15:23] Working with standard loops on this execution.
[06:15:23] - Files status OK
[06:15:25] - Expanded Called DecompressByteArray: compressed_data_size=Called DecompressByteArray: compressed_data_size=4826763 data_size=- Digital signature verified
[06:15:25]
[06:15:25] Project: 2677 (Run - Digital signature verAssembly optimizations on if available.
[06:15:25] Entering Entering M.D.
[06:15:31] Using Gromacs checkpoints
[06:15:34]  M.D.
[06:15:40] Using Gromacs checkpoints
NNODES=4, MYRANK=3, HOSTNAME=racermach
NNODES=4, MYRANK=2, HOSTNAME=racermach
NNODES=4, MYRANK=1, HOSTNAME=racermach
NNODES=4, MYRANK=0, HOSTNAME=racermach
NODEID=0 argc=23
NODEID=1 argc=23
NODEID=2 argc=23
NODEID=3 argc=23
Reading file work/wudata_07.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

Reading checkpoint file work/wudata_07.cpt generated: Mon Jan 25 13:34:18 2010


NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22869 system in water'
19000002 steps,  38000.0 ps (continuing from step 18788385,  37576.8 ps).
[06:15:53] d work/wudata_07.log
[06:15:53] Verified work/wudata_07.trr
[06:15:53] Verified work/wudata_07.xtc
[06:15:53] Verified work/wudata_07.edr
[06:15:54] Completed 38383 out of 250000 steps  (15%)
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Tue Jan 26, 2010 11:41 pm

The WU running on ext4 using the 6.29 client is finished, even with a stoppage it finished successfully. Which is great! One thing consistent with ext4 is the write time for wuresults_##.dat; it took 11 minutes. With my next WU I am going back to the 6.24beta client, stop the client somewhere in the middle, and then restart and see if the WU completes or if it fails like in my other test.

Code: Select all
[16:49:49] Completed 250000 out of 250000 steps  (100%)

Writing final coordinates.

 Average load imbalance: 3.6 %
 Part of the total run time spent waiting due to load imbalance: 2.3 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Z 0 %


   Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time: 124439.498 124439.498    100.0
                       1d10h33:59
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:     88.461      3.712      0.294     81.672

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[16:49:53] DynamicWrapper: Finished Work Unit: sleep=10000
[16:50:03]
[16:50:03] Finished Work Unit:
[16:50:03] - Reading up to 21198528 from "work/wudata_07.trr": Read 21198528
[16:50:03] trr file hash check passed.
[16:50:03] - Reading up to 27180336 from "work/wudata_07.xtc": Read 27180336
[16:50:04] xtc file hash check passed.
[16:50:04] edr file hash check passed.
[16:50:04] logfile size: 194825
[16:50:04] Leaving Run
[16:50:07] - Writing 48723825 bytes of core data to disk...
[16:50:10]   ... Done.
[17:01:02] - Shutting down core
[17:01:02]
[17:01:02] Folding@home Core Shutdown: FINISHED_UNIT
Attempting to use an MPI routine after finalizing MPICH
[17:01:39] CoreStatus = 64 (100)
[17:01:39] Unit 7 finished with 42 percent of time to deadline remaining.
[17:01:39] Updated performance fraction: 0.436429
[17:01:39] Sending work to server
[17:01:39] Project: 2677 (Run 17, Clone 59, Gen 75)


[17:01:39] + Attempting to send results [January 26 17:01:39 UTC]
[17:01:39] - Reading file work/wuresults_07.dat from core
[17:01:39]   (Read 48723825 bytes from disk)
[17:01:39] Connecting to http://171.64.65.56:8080/
[17:41:28] Posted data.
[17:41:29] Initial: 0000; - Uploaded at ~19 kB/s
[17:41:30] - Averaged speed for that direction ~19 kB/s
[17:41:30] + Results successfully sent
[17:41:30] Thank you for your contribution to Folding@Home.
[17:41:30] + Number of Units Completed: 60

[17:47:37] - Warning: Could not delete all work unit files (7): Core file absent
[17:47:37] Trying to send all finished work units
[17:47:37] + No unsent completed units remaining.
[17:47:37] + -oneunit flag given and have now finished a unit. Exiting.- Preparing to get new work unit...
[17:47:37] Cleaning up work directory
[17:47:37] ***** Got a SIGTERM signal (15)
[17:47:37] Killing all core threads

Folding@Home Client Shutdown.
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Sat Jan 30, 2010 1:30 pm

Well I had two runs so far with the older 6.24beta client on ext4 and no failures. The only difference between previous test that failed and now is root is jfs, the partition I am folding on is ext4. In the previous test root and the partition used for folding was ext4. Not sure if it's worth putting root back to ext4 to see what happens.
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: - Error: Could not get length of results file work/wuresult

Postby tear » Sat Jan 30, 2010 4:31 pm

Chris,


Thanks for keeping us in the loop. I've been a little busy lately (vacation can't last forever); I did,
however run 3 tests with regular 6.24 and 3 tests with _alt 6.24 -- all worked fine.

By the way, I'm happy to see you're trying to make it break again (majority of folks would have run
out of steam by now) to reach solid conclusion.

Would rootfs be a factor? Hard to say. I know client uses a lock-file-wannabe (/tmp/fah/f<machineid>)
so that _could_ be a variable (though I seriously doubt it == would be a butterfly effect but hey,
we've seen those).


Thanks!
tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Sat Jan 30, 2010 10:27 pm

I was thinking it might be the root file system and I noticed that lock file in /tmp/fah, but like you I just can't see how that would cause problems. I sort of want to get busy folding with the new 6.29 client but, this is driving me nuts. So my plan is to go back to ext4 system wide and see what happens. The only other factor that is different is I am using the -oneunit switch and I really doubt that would make any difference.

The one thing that remains constant is the long write times with ext4.

[07:32:11] - Writing 48677542 bytes of core data to disk...
[07:32:13] ... Done.
[07:43:09] - Shutting down core

[15:33:07] - Writing 26386077 bytes of core data to disk...
[15:33:08] ... Done.
[15:41:34] - Shutting down core
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: - Error: Could not get length of results file work/wuresult

Postby chrisretusn » Sat Feb 06, 2010 3:51 am

My root and folding partition are both ext4 now. The first WU was an A2 core, it successfully completed after being stopped while in progress. It took about 11 minutes for the work/wuresults_01.dat to be written to disk.
Code: Select all
Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

2 cores detected


--- Opening Log file [February 1 04:57:50 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/stuff/fah6-1
Executable: ./fah6
Arguments: -verbosity 9 -smp -oneunit

[04:57:50] - Ask before connecting: No
[04:57:50] - User name: chrisretusn (Team 2291)
[04:57:50] - User ID: 83719AB3EDA1FA2
[04:57:50] - Machine ID: 1
[04:57:50]
[04:57:50] Loaded queue successfully.
[04:57:50] - Autosending finished units... [February 1 04:57:50 UTC]
[04:57:50] Trying to send all finished work units
[04:57:50] + No unsent completed units remaining.
[04:57:50] - Autosend completed
[04:57:50] - Preparing to get new work unit...
[04:57:50] + Attempting to get work packet
[04:57:50] - Will indicate memory of 3816 MB
[04:57:50] - Connecting to assignment server
[04:57:50] Connecting to http://assign.stanford.edu:8080/
[04:57:51] Posted data.
[04:57:51] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[04:57:51] + News From Folding@Home: Welcome to Folding@Home
[04:57:51] Loaded queue successfully.
[04:57:51] Connecting to http://171.64.65.56:8080/
[04:57:57] Posted data.
[04:57:57] Initial: 0000; - Receiving payload (expected size: 4844000)
[04:59:27] - Downloaded at ~52 kB/s
[04:59:27] - Averaged speed for that direction ~50 kB/s
[04:59:27] + Received work.
[04:59:27] + Closed connections
[04:59:27]
[04:59:27] + Processing work unit
[04:59:27] At least 4 processors must be requested.Core required: FahCore_a2.exe
[04:59:27] Core found.
[04:59:27] Working on queue slot 01 [February 1 04:59:27 UTC]
[04:59:27] + Working ...
[04:59:27] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 15 -verbose -lifeline 3210 -version 624'

[04:59:27]
[04:59:27] *------------------------------*
[04:59:27] Folding@Home Gromacs SMP Core
[04:59:27] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[04:59:27]
[04:59:27] Preparing to commence simulation
[04:59:27] - Looking at optimizations...
[04:59:27] - Working with standard loops on this execution.
[04:59:27] - Files status OK
[04:59:29] - Expanded Called DecompressByteArray: compressed_data_size=Called DecompressByteArray: compressed_data_size=4843488 data_size=24027949, decompressed_data_size=24027949 diff=0
[04:59:29] - Digital signature veAssembly optimizations on if available.
[04:59:29] Entering M.D.
[04:59:29] ing M.D.
[04:59:38] lone 34, Gen 79)
[04:59:38]
[04:59:38] Entering M.D.
NNODES=4, MYRANK=1, HOSTNAME=racermach
NNODES=4, MYRANK=2, HOSTNAME=racermach
NNODES=4, MYRANK=3, HOSTNAME=racermach
NNODES=4, MYRANK=0, HOSTNAME=racermach
NODEID=0 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
NODEID=1 argc=20
Reading file work/wudata_01.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun 'IBX in water'
20000002 steps,  40000.0 ps (continuing from step 19750002,  39500.0 ps).
[04:59:52] Completed 0 out of 250000 steps  (0%)
[05:25:23] Completed 2500 out of 250000 steps  (1%)
[05:51:33] Completed 5000 out of 250000 steps  (2%)
[06:14:49] Completed 7500 out of 250000 steps  (3%)
[06:37:36] Completed 10000 out of 250000 steps  (4%)
[07:00:22] Completed 12500 out of 250000 steps  (5%)
[07:23:11] Completed 15000 out of 250000 steps  (6%)
[07:46:00] Completed 17500 out of 250000 steps  (7%)
[08:08:48] Completed 20000 out of 250000 steps  (8%)
[08:31:36] Completed 22500 out of 250000 steps  (9%)
[08:54:23] Completed 25000 out of 250000 steps  (10%)
[09:17:03] Completed 27500 out of 250000 steps  (11%)
[09:38:42] Completed 30000 out of 250000 steps  (12%)
[10:00:21] Completed 32500 out of 250000 steps  (13%)
[10:21:59] Completed 35000 out of 250000 steps  (14%)
[10:43:38] Completed 37500 out of 250000 steps  (15%)
[10:57:50] - Autosending finished units... [February 1 10:57:50 UTC]
[10:57:50] Trying to send all finished work units
[10:57:50] + No unsent completed units remaining.
[10:57:50] - Autosend completed
[11:05:17] Completed 40000 out of 250000 steps  (16%)
[11:27:08] Completed 42500 out of 250000 steps  (17%)
[11:49:54] Completed 45000 out of 250000 steps  (18%)
[12:12:39] Completed 47500 out of 250000 steps  (19%)
[12:35:27] Completed 50000 out of 250000 steps  (20%)
[12:58:12] Completed 52500 out of 250000 steps  (21%)
[13:20:59] Completed 55000 out of 250000 steps  (22%)
[13:43:44] Completed 57500 out of 250000 steps  (23%)
[14:06:31] Completed 60000 out of 250000 steps  (24%)
[14:29:16] Completed 62500 out of 250000 steps  (25%)
[14:51:12] Completed 65000 out of 250000 steps  (26%)
[15:12:50] Completed 67500 out of 250000 steps  (27%)
[15:34:29] Completed 70000 out of 250000 steps  (28%)
[15:57:03] Completed 72500 out of 250000 steps  (29%)
[16:19:50] Completed 75000 out of 250000 steps  (30%)
[16:42:41] Completed 77500 out of 250000 steps  (31%)
[16:57:50] - Autosending finished units... [February 1 16:57:50 UTC]
[16:57:50] Trying to send all finished work units
[16:57:50] + No unsent completed units remaining.
[16:57:50] - Autosend completed
[17:04:20] Completed 80000 out of 250000 steps  (32%)
[17:25:57] Completed 82500 out of 250000 steps  (33%)
[17:47:36] Completed 85000 out of 250000 steps  (34%)
[18:10:25] Completed 87500 out of 250000 steps  (35%)
[18:33:14] Completed 90000 out of 250000 steps  (36%)
[18:56:01] Completed 92500 out of 250000 steps  (37%)
[19:18:49] Completed 95000 out of 250000 steps  (38%)
[19:41:37] Completed 97500 out of 250000 steps  (39%)
[20:04:26] Completed 100000 out of 250000 steps  (40%)
[20:27:11] Completed 102500 out of 250000 steps  (41%)
[20:50:00] Completed 105000 out of 250000 steps  (42%)
[21:12:48] Completed 107500 out of 250000 steps  (43%)
[21:35:36] Completed 110000 out of 250000 steps  (44%)
[21:58:21] Completed 112500 out of 250000 steps  (45%)
[22:21:11] Completed 115000 out of 250000 steps  (46%)
[22:43:58] Completed 117500 out of 250000 steps  (47%)
[22:57:50] - Autosending finished units... [February 1 22:57:50 UTC]
[22:57:50] Trying to send all finished work units
[22:57:50] + No unsent completed units remaining.
[22:57:50] - Autosend completed
[23:07:21] Completed 120000 out of 250000 steps  (48%)
[23:09:15] ***** Got a SIGTERM signal (15)
[23:09:15] Killing all core threads

Folding@Home Client Shutdown.


Received the TERM signal, stopping at the next step



Received the TERM signal, stopping at the next step



Received the TERM signal, stopping at the next step



Received the TERM signal, stopping at the next step

application called MPI_Abort(MPI_COMM_WORLD, 103) - process 0
[0]0:Return code = 103
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

2 cores detected


--- Opening Log file [February 1 23:09:54 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/stuff/fah6-1
Executable: ./fah6
Arguments: -verbosity 9 -smp -oneunit

[23:09:54] - Ask before connecting: No
[23:09:54] - User name: chrisretusn (Team 2291)
[23:09:54] - User ID: 83719AB3EDA1FA2
[23:09:54] - Machine ID: 1
[23:09:54]
[23:09:54] Loaded queue successfully.
[23:09:54] - Autosending finished units... [February 1 23:09:54 UTC]
[23:09:54] Trying to send all finished work units
[23:09:54] + No unsent completed units remaining.
[23:09:54] - Autosend completed
[23:09:54]
[23:09:54] + Processing work unit
[23:09:54] At least 4 processors must be requested.Core required: FahCore_a2.exe
[23:09:54] Core found.
[23:09:54] Working on queue slot 01 [February 1 23:09:54 UTC]
[23:09:54] + Working ...
[23:09:54] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 15 -verbose -lifeline 7875 -version 624'

[23:09:54]
[23:09:54] *------------------------------*
[23:09:54] Folding@Home Gromacs SMP Core
[23:09:54] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[23:09:54]
[23:09:54] Preparing to commence simulation
[23:09:54] - Ensuring status. Please wait.
[23:09:55] Called DecompressByteArray: compressed_data_size=4843488 data_size=24027949, decompressed_data_size=24027949 diff=0
[23:09:55] - Digital signature verified
[23:09:55]
[23:09:55] Project: 2677 (Run 0, Clone 34, Gen 79)
[23:09:55]
[23:09:55] Assembly optimizations on if available.
[23:09:55] Entering M.D.
[23:10:02] Using Gromacs checkpoints
[23:10:06]
[23:10:06] Entering M.D.
[23:10:12] Using Gromacs checkpoints
NNODES=4, MYRANK=2, HOSTNAME=racermach
NNODES=4, MYRANK=3, HOSTNAME=racermach
NNODES=4, MYRANK=0, HOSTNAME=racermach
NODEID=0 argc=23
NNODES=4, MYRANK=1, HOSTNAME=racermach
NODEID=1 argc=23
NODEID=2 argc=23
NODEID=3 argc=23
Reading file work/wudata_01.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68

Reading checkpoint file work/wudata_01.cpt generated: Tue Feb  2 07:09:16 2010


NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun 'IBX in water'
20000002 steps,  40000.0 ps (continuing from step 19870203,  39740.4 ps).
[23:10:19] Resuming from checkpoint
[23:10:19] Verified work/wudata_01.log
[23:10:19] Verified work/wudata_01.trr
[23:10:20] Verified work/wudata_01.xtc
[23:10:20] Verified work/wudata_01.edr
[23:10:21] Completed 120201 out of 250000 steps  (48%)
[23:34:22] Completed 122500 out of 250000 steps  (49%)
[00:02:22] Completed 125000 out of 250000 steps  (50%)
[00:28:02] Completed 127500 out of 250000 steps  (51%)
[00:53:18] Completed 130000 out of 250000 steps  (52%)
[01:18:26] Completed 132500 out of 250000 steps  (53%)
[01:46:03] Completed 135000 out of 250000 steps  (54%)
[02:12:03] Completed 137500 out of 250000 steps  (55%)
[02:38:19] Completed 140000 out of 250000 steps  (56%)
[03:04:10] Completed 142500 out of 250000 steps  (57%)
[03:29:54] Completed 145000 out of 250000 steps  (58%)
[03:53:54] Completed 147500 out of 250000 steps  (59%)
[04:18:45] Completed 150000 out of 250000 steps  (60%)
[04:45:50] Completed 152500 out of 250000 steps  (61%)
[05:08:45] Completed 155000 out of 250000 steps  (62%)
[05:09:54] - Autosending finished units... [February 2 05:09:54 UTC]
[05:09:54] Trying to send all finished work units
[05:09:54] + No unsent completed units remaining.
[05:09:54] - Autosend completed
[05:34:17] Completed 157500 out of 250000 steps  (63%)
[06:00:53] Completed 160000 out of 250000 steps  (64%)
[06:25:06] Completed 162500 out of 250000 steps  (65%)
[06:50:23] Completed 165000 out of 250000 steps  (66%)
[07:14:54] Completed 167500 out of 250000 steps  (67%)
[07:38:39] Completed 170000 out of 250000 steps  (68%)
[08:02:19] Completed 172500 out of 250000 steps  (69%)
[08:25:25] Completed 175000 out of 250000 steps  (70%)
[08:50:06] Completed 177500 out of 250000 steps  (71%)
[09:15:03] Completed 180000 out of 250000 steps  (72%)
[09:45:44] Completed 182500 out of 250000 steps  (73%)
[10:14:58] Completed 185000 out of 250000 steps  (74%)
[10:37:52] Completed 187500 out of 250000 steps  (75%)
[11:00:48] Completed 190000 out of 250000 steps  (76%)
[11:09:54] - Autosending finished units... [February 2 11:09:54 UTC]
[11:09:54] Trying to send all finished work units
[11:09:54] + No unsent completed units remaining.
[11:09:54] - Autosend completed
[11:23:45] Completed 192500 out of 250000 steps  (77%)
[11:48:58] Completed 195000 out of 250000 steps  (78%)
[12:16:32] Completed 197500 out of 250000 steps  (79%)
[12:40:15] Completed 200000 out of 250000 steps  (80%)
[13:04:22] Completed 202500 out of 250000 steps  (81%)
[13:27:24] Completed 205000 out of 250000 steps  (82%)
[13:50:14] Completed 207500 out of 250000 steps  (83%)
[14:13:05] Completed 210000 out of 250000 steps  (84%)
[14:35:55] Completed 212500 out of 250000 steps  (85%)
[14:58:46] Completed 215000 out of 250000 steps  (86%)
[15:21:36] Completed 217500 out of 250000 steps  (87%)
[15:44:27] Completed 220000 out of 250000 steps  (88%)
[16:07:15] Completed 222500 out of 250000 steps  (89%)
[16:30:03] Completed 225000 out of 250000 steps  (90%)
[16:52:50] Completed 227500 out of 250000 steps  (91%)
[17:09:54] - Autosending finished units... [February 2 17:09:54 UTC]
[17:09:54] Trying to send all finished work units
[17:09:54] + No unsent completed units remaining.
[17:09:54] - Autosend completed
[17:15:38] Completed 230000 out of 250000 steps  (92%)
[17:38:25] Completed 232500 out of 250000 steps  (93%)
[18:01:13] Completed 235000 out of 250000 steps  (94%)
[18:24:00] Completed 237500 out of 250000 steps  (95%)
[18:46:48] Completed 240000 out of 250000 steps  (96%)
[19:09:35] Completed 242500 out of 250000 steps  (97%)
[19:32:24] Completed 245000 out of 250000 steps  (98%)
[19:55:14] Completed 247500 out of 250000 steps  (99%)
[20:18:04] Completed 250000 out of 250000 steps  (100%)

Writing final coordinates.

 Average load imbalance: 3.9 %
 Part of the total run time spent waiting due to load imbalance: 2.4 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Z 0 %


   Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:  76068.166  76068.166    100.0
                       21h07:48
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:     88.595      3.721      0.295     81.395

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[20:18:07] DynamicWrapper: Finished Work Unit: sleep=10000
[20:18:17]
[20:18:17] Finished Work Unit:
[20:18:17] - Reading up to 21177360 from "work/wudata_01.trr": Read 21177360
[20:18:17] trr file hash check passed.
[20:18:17] - Reading up to 27147728 from "work/wudata_01.xtc": Read 27147728
[20:18:17] xtc file hash check passed.
[20:18:17] edr file hash check passed.
[20:18:17] logfile size: 194380
[20:18:17] Leaving Run
[20:18:20] - Writing 48669604 bytes of core data to disk...
[20:18:22]   ... Done.
[20:29:15] - Shutting down core
[20:29:15]
[20:29:15] Folding@home Core Shutdown: FINISHED_UNIT
Attempting to use an MPI routine after finalizing MPICH
[20:29:51] CoreStatus = 64 (100)
[20:29:51] Unit 1 finished with 45 percent of time to deadline remaining.
[20:29:51] Updated performance fraction: 0.444399
[20:29:51] Sending work to server
[20:29:51] Project: 2677 (Run 0, Clone 34, Gen 79)


[20:29:51] + Attempting to send results [February 2 20:29:51 UTC]
[20:29:51] - Reading file work/wuresults_01.dat from core
[20:29:52]   (Read 48669604 bytes from disk)
[20:29:52] Connecting to http://171.64.65.56:8080/
[21:09:35] Posted data.
[21:09:35] Initial: 0000; - Uploaded at ~19 kB/s
[21:09:42] - Averaged speed for that direction ~19 kB/s
[21:09:42] + Results successfully sent
[21:09:42] Thank you for your contribution to Folding@Home.
[21:09:42] + Number of Units Completed: 64

[21:15:49] - Warning: Could not delete all work unit files (1): Core file absent
[21:15:49] Trying to send all finished work units
[21:15:49] + No unsent completed units remaining.
[21:15:49] + -oneunit flag given and have now finished a unit. Exiting.- Preparing to get new work unit...
[21:15:49] + Attempting to get work packet
[21:15:49] ***** Got a SIGTERM signal (15)
[21:15:49] - Will indicate memory of 3816 MB
[21:15:49] Killing all core threads
[21:15:49] - Connecting to assignment server

Folding@Home Client Shutdown.
[21:15:49] Connecting to http://assign.stanford.edu:8080/
[21:15:49] Connecting to http://assign.stanford.edu:8080/

As fate would have it, Stanford has decided to give me A1 cores. The next WU was an A1 core. This WU was stopped while in progress, this WU failed after 100% completion. Note the fatal error after the "[05:23:14] Killing all core threads" log entry. Also note the log entries at the end. This core successfully completed, but in th end it still failed prior to the results being sent to Stanford.
Code: Select all
Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

2 cores detected


--- Opening Log file [February 3 01:12:31 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/stuff/fah6-1
Executable: ./fah6
Arguments: -verbosity 9 -smp

[01:12:31] - Ask before connecting: No
[01:12:31] - User name: chrisretusn (Team 2291)
[01:12:31] - User ID: 83719AB3EDA1FA2
[01:12:31] - Machine ID: 1
[01:12:31]
[01:12:31] Loaded queue successfully.
[01:12:31] - Autosending finished units... [February 3 01:12:31 UTC]
[01:12:31] Trying to send all finished work units
[01:12:31] + No unsent completed units remaining.
[01:12:31] - Autosend completed
[01:12:31] - Preparing to get new work unit...
[01:12:31] + Attempting to get work packet
[01:12:31] - Will indicate memory of 3816 MB
[01:12:31] - Connecting to assignment server
[01:12:31] Connecting to http://assign.stanford.edu:8080/
[01:12:33] Posted data.
[01:12:33] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[01:12:33] + News From Folding@Home: Welcome to Folding@Home
[01:12:33] Loaded queue successfully.
[01:12:33] Connecting to http://171.64.65.64:8080/
[01:12:37] Posted data.
[01:12:37] Initial: 0000; - Receiving payload (expected size: 2432699)
[01:13:22] - Downloaded at ~52 kB/s
[01:13:22] - Averaged speed for that direction ~51 kB/s
[01:13:22] + Received work.
[01:13:22] + Closed connections
[01:13:22]
[01:13:22] + Processing work unit
[01:13:22] Work type a1 not eligible for variable processors
[01:13:22] Core required: FahCore_a1.exe
[01:13:22] Core not found.
[01:13:22] - Core is not present or corrupted.
[01:13:22] - Attempting to download new core...
[01:13:22] + Downloading new core: FahCore_a1.exe
[01:13:22] Downloading core (/~pande/Linux/AMD64/Core_a1.fah from http://www.stanford.edu)
[01:13:23] Initial: AFDE; + 10240 bytes downloaded
[01:13:24] Initial: B54E; + 20480 bytes downloaded
[01:13:24] Initial: D6C2; + 30720 bytes downloaded
[01:13:24] Initial: 9F08; + 40960 bytes downloaded
[01:13:25] Initial: C6C3; + 51200 bytes downloaded
[01:13:25] Initial: EBA8; + 61440 bytes downloaded
[01:13:25] Initial: 3141; + 71680 bytes downloaded
[01:13:25] Initial: D218; + 81920 bytes downloaded
[01:13:25] Initial: F7AC; + 92160 bytes downloaded
[01:13:25] Initial: 820B; + 102400 bytes downloaded
[01:13:26] Initial: 1B1E; + 112640 bytes downloaded
[01:13:26] Initial: C249; + 122880 bytes downloaded
[01:13:26] Initial: 5EBD; + 133120 bytes downloaded
[01:13:26] Initial: CD6C; + 143360 bytes downloaded
[01:13:26] Initial: 221C; + 153600 bytes downloaded
[01:13:27] Initial: DB18; + 163840 bytes downloaded
[01:13:27] Initial: 237E; + 174080 bytes downloaded
[01:13:27] Initial: AEEC; + 184320 bytes downloaded
[01:13:27] Initial: 4C66; + 194560 bytes downloaded
[01:13:27] Initial: AE1E; + 204800 bytes downloaded
[01:13:28] Initial: A37E; + 215040 bytes downloaded
[01:13:28] Initial: 8193; + 225280 bytes downloaded
[01:13:28] Initial: 9F05; + 235520 bytes downloaded
[01:13:28] Initial: AAA5; + 245760 bytes downloaded
[01:13:28] Initial: 6400; + 256000 bytes downloaded
[01:13:28] Initial: 6E3D; + 266240 bytes downloaded
[01:13:29] Initial: EA6B; + 276480 bytes downloaded
[01:13:29] Initial: 820A; + 286720 bytes downloaded
[01:13:29] Initial: DE6D; + 296960 bytes downloaded
[01:13:29] Initial: B97B; + 307200 bytes downloaded
[01:13:29] Initial: 9D5D; + 317440 bytes downloaded
[01:13:30] Initial: 91D7; + 327680 bytes downloaded
[01:13:30] Initial: BB3B; + 337920 bytes downloaded
[01:13:30] Initial: 611B; + 348160 bytes downloaded
[01:13:30] Initial: B290; + 358400 bytes downloaded
[01:13:30] Initial: B0AA; + 368640 bytes downloaded
[01:13:31] Initial: 6A85; + 378880 bytes downloaded
[01:13:31] Initial: BF10; + 389120 bytes downloaded
[01:13:31] Initial: A818; + 399360 bytes downloaded
[01:13:31] Initial: 90E1; + 409600 bytes downloaded
[01:13:31] Initial: 2869; + 419840 bytes downloaded
[01:13:32] Initial: CAFE; + 430080 bytes downloaded
[01:13:32] Initial: 414B; + 440320 bytes downloaded
[01:13:32] Initial: 9B7A; + 450560 bytes downloaded
[01:13:32] Initial: 33AA; + 460800 bytes downloaded
[01:13:32] Initial: B1D5; + 471040 bytes downloaded
[01:13:32] Initial: 0206; + 481280 bytes downloaded
[01:13:33] Initial: 11F4; + 491520 bytes downloaded
[01:13:33] Initial: 31B5; + 501760 bytes downloaded
[01:13:33] Initial: 46B2; + 512000 bytes downloaded
[01:13:33] Initial: 3113; + 522240 bytes downloaded
[01:13:33] Initial: 525A; + 532480 bytes downloaded
[01:13:34] Initial: 66F9; + 542720 bytes downloaded
[01:13:34] Initial: 9672; + 552960 bytes downloaded
[01:13:34] Initial: 9058; + 563200 bytes downloaded
[01:13:34] Initial: 49ED; + 573440 bytes downloaded
[01:13:34] Initial: 515D; + 583680 bytes downloaded
[01:13:35] Initial: CAC0; + 593920 bytes downloaded
[01:13:35] Initial: 0B15; + 604160 bytes downloaded
[01:13:35] Initial: 5A89; + 614400 bytes downloaded
[01:13:35] Initial: 0F31; + 624640 bytes downloaded
[01:13:35] Initial: 2BC3; + 634880 bytes downloaded
[01:13:35] Initial: 3C06; + 645120 bytes downloaded
[01:13:36] Initial: 89C7; + 655360 bytes downloaded
[01:13:36] Initial: 6C54; + 665600 bytes downloaded
[01:13:36] Initial: 8D4D; + 675840 bytes downloaded
[01:13:36] Initial: EA59; + 686080 bytes downloaded
[01:13:36] Initial: C563; + 696320 bytes downloaded
[01:13:37] Initial: 8D45; + 706560 bytes downloaded
[01:13:37] Initial: 9BD0; + 716800 bytes downloaded
[01:13:37] Initial: 130C; + 727040 bytes downloaded
[01:13:37] Initial: CDA1; + 737280 bytes downloaded
[01:13:37] Initial: 7681; + 747520 bytes downloaded
[01:13:37] Initial: 1110; + 757760 bytes downloaded
[01:13:38] Initial: EE35; + 768000 bytes downloaded
[01:13:38] Initial: E5E1; + 778240 bytes downloaded
[01:13:38] Initial: 4B97; + 788480 bytes downloaded
[01:13:38] Initial: 4D75; + 798720 bytes downloaded
[01:13:38] Initial: E268; + 808960 bytes downloaded
[01:13:39] Initial: FAC6; + 819200 bytes downloaded
[01:13:39] Initial: A625; + 829440 bytes downloaded
[01:13:39] Initial: A12A; + 839680 bytes downloaded
[01:13:39] Initial: 83A3; + 849920 bytes downloaded
[01:13:39] Initial: 3BEA; + 860160 bytes downloaded
[01:13:40] Initial: 5298; + 870400 bytes downloaded
[01:13:40] Initial: 4811; + 880640 bytes downloaded
[01:13:40] Initial: EB07; + 890880 bytes downloaded
[01:13:40] Initial: 83FC; + 901120 bytes downloaded
[01:13:40] Initial: FA4E; + 911360 bytes downloaded
[01:13:40] Initial: 2945; + 921600 bytes downloaded
[01:13:46] Initial: 6BC9; + 931840 bytes downloaded
[01:13:46] Initial: E495; + 942080 bytes downloaded
[01:13:46] Initial: 1050; + 952320 bytes downloaded
[01:13:47] Initial: 2070; + 962560 bytes downloaded
[01:13:47] Initial: 1083; + 972800 bytes downloaded
[01:13:47] Initial: 96E5; + 983040 bytes downloaded
[01:13:47] Initial: 3EEE; + 993280 bytes downloaded
[01:13:47] Initial: 84AC; + 1003520 bytes downloaded
[01:13:47] Initial: 3B6B; + 1013760 bytes downloaded
[01:13:48] Initial: 3030; + 1024000 bytes downloaded
[01:13:48] Initial: 4B95; + 1034240 bytes downloaded
[01:13:48] Initial: D9BC; + 1044480 bytes downloaded
[01:13:48] Initial: C5B8; + 1054720 bytes downloaded
[01:13:48] Initial: A5EF; + 1064960 bytes downloaded
[01:13:48] Initial: 28DC; + 1075200 bytes downloaded
[01:13:49] Initial: 0943; + 1085440 bytes downloaded
[01:13:49] Initial: 338A; + 1095680 bytes downloaded
[01:13:49] Initial: ADFC; + 1105920 bytes downloaded
[01:13:49] Initial: ED39; + 1116160 bytes downloaded
[01:13:49] Initial: D284; + 1126400 bytes downloaded
[01:13:49] Initial: 0057; + 1136640 bytes downloaded
[01:13:49] Initial: 3E65; + 1146880 bytes downloaded
[01:13:49] Initial: FCB5; + 1157120 bytes downloaded
[01:13:49] Initial: A7D8; + 1167360 bytes downloaded
[01:13:49] Initial: A564; + 1177600 bytes downloaded
[01:13:49] Initial: 7654; + 1187840 bytes downloaded
[01:13:49] Initial: 0848; + 1198080 bytes downloaded
[01:13:49] Initial: 471E; + 1208320 bytes downloaded
[01:13:49] Initial: A7F3; + 1218560 bytes downloaded
[01:13:49] Initial: FA59; + 1228800 bytes downloaded
[01:13:49] Initial: FBF2; + 1239040 bytes downloaded
[01:13:49] Initial: F54E; + 1249280 bytes downloaded
[01:13:49] Initial: 3023; + 1259520 bytes downloaded
[01:13:49] Initial: AB37; + 1269760 bytes downloaded
[01:13:49] Initial: 0896; + 1280000 bytes downloaded
[01:13:49] Initial: 756D; + 1290240 bytes downloaded
[01:13:50] Initial: C1E7; + 1300480 bytes downloaded
[01:13:50] Initial: 9AAC; + 1310720 bytes downloaded
[01:13:50] Initial: E5AF; + 1320960 bytes downloaded
[01:13:50] Initial: BBE3; + 1331200 bytes downloaded
[01:13:50] Initial: 3596; + 1341440 bytes downloaded
[01:13:51] Initial: 924C; + 1351680 bytes downloaded
[01:13:51] Initial: 30B7; + 1361920 bytes downloaded
[01:13:51] Initial: AEB7; + 1372160 bytes downloaded
[01:13:51] Initial: 7D25; + 1382400 bytes downloaded
[01:13:51] Initial: 0FEB; + 1392640 bytes downloaded
[01:13:51] Initial: 3131; + 1402880 bytes downloaded
[01:13:52] Initial: 755F; + 1413120 bytes downloaded
[01:13:52] Initial: 4800; + 1423360 bytes downloaded
[01:13:52] Initial: 1282; + 1433600 bytes downloaded
[01:13:53] Initial: B2A3; + 1443840 bytes downloaded
[01:13:53] Initial: 21E9; + 1454080 bytes downloaded
[01:13:53] Initial: 789E; + 1464320 bytes downloaded
[01:13:53] Initial: 8542; + 1474560 bytes downloaded
[01:13:53] Initial: 3A56; + 1484800 bytes downloaded
[01:13:53] Initial: D4FE; + 1490945 bytes downloaded
[01:13:53] Verifying core Core_a1.fah...
[01:13:53] Signature is VALID
[01:13:53]
[01:13:53] Trying to unzip core FahCore_a1.exe
[01:13:53] Decompressed FahCore_a1.exe (3625104 bytes) successfully
[01:13:53] + Core successfully engaged
[01:14:10]
[01:14:10] + Processing work unit
[01:14:10] Work type a1 not eligible for variable processors
[01:14:10] Core required: FahCore_a1.exe
[01:14:10] Core found.
[01:14:10] Working on queue slot 02 [February 3 01:14:10 UTC]
[01:14:10] + Working ...
[01:14:10] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 22960 -version 624'

[01:14:10]
[01:14:10] *------------------------------*
[01:14:10] Folding@Home Gromacs SMP Core
[01:14:10] Version 1.74 (November 27, 2006)
[01:14:10]
[01:14:10] Preparing to commence simulation
[01:14:10] - Ensuring status. Please wait.
[01:14:11] - Starting from initial work packet
[01:14:11]
[01:14:11] Project: 2653 (Run 27, Clone 189, Gen 139)
[01:14:11]
[01:14:11] Assembly optimizations on if available.
[01:14:11] Entering M.D.
[01:14:28]  work packet
[01:14:28]
[01:14:28] Project: 2653 (Run 27, Clone 189, Gen 139)
[01:14:28]
[01:14:28]  (Run 27, Clone 189, Gen 139)
[01:14:28]
[01:14:28] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=racermach
NNODES=4, MYRANK=1, HOSTNAME=racermach
NNODES=4, MYRANK=2, HOSTNAME=racermach
NNODES=4, MYRANK=3, HOSTNAME=racermach
NODEID=1 argc=15
NODEID=2 argc=15
NODEID=0 argc=15
NODEID=3 argc=15
starting mdrun 'Protein in POPC'
500000 steps,   1000.0 ps.

[01:14:36] Protein: ProteiWriting local files
[01:14:36] cal files
[01:14:36] boost OK.
[01:14:36] boost OK.
[01:14:37] cal files
[01:14:37] Completed 0 out of 500000 steps  (0 percent)
[01:29:38] Timered checkpoint triggered.
[01:44:09] Writing local files
[01:44:09] Completed 5000 out of 500000 steps  (1 percent)
[01:59:09] Timered checkpoint triggered.
[02:12:41] Writing local files
[02:12:41] Completed 10000 out of 500000 steps  (2 percent)
[02:27:42] Timered checkpoint triggered.
[02:41:12] Writing local files
[02:41:12] Completed 15000 out of 500000 steps  (3 percent)
[02:56:13] Timered checkpoint triggered.
[03:09:43] Writing local files
[03:09:43] Completed 20000 out of 500000 steps  (4 percent)
[03:24:44] Timered checkpoint triggered.
[03:38:16] Writing local files
[03:38:17] Completed 25000 out of 500000 steps  (5 percent)
[03:53:17] Timered checkpoint triggered.
[04:06:48] Writing local files
[04:06:48] Completed 30000 out of 500000 steps  (6 percent)
[04:21:49] Timered checkpoint triggered.
[04:35:25] Writing local files
[04:35:25] Completed 35000 out of 500000 steps  (7 percent)
[04:50:26] Timered checkpoint triggered.
[05:05:26] Timered checkpoint triggered.
[05:08:04] Writing local files
[05:08:04] Completed 40000 out of 500000 steps  (8 percent)
[05:23:05] Timered checkpoint triggered.
[05:23:14] ***** Got a SIGTERM signal (15)
[05:23:14] Killing all core threads

Folding@Home Client Shutdown.
[cli_3]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
[0]0:Return code = 102
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 1
      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2004, The GROMACS development team,
            check out http://www.gromacs.org for more information.

        This inclusion of Gromacs code in the Folding@Home Core is under
        a special license (see http://folding.stanford.edu/gromacs.html)
         specially granted to Stanford by the copyright holders. If you
          are interested in using Gromacs, visit http://www.gromacs.org where
                you can download a free version of Gromacs under
         the terms of the GNU General Public License (GPL) as published
       by the Free Software Foundation; either version 2 of the License,
                     or (at your option) any later version.


Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

2 cores detected


--- Opening Log file [February 3 05:23:19 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/stuff/fah6-1
Executable: ./fah6
Arguments: -verbosity 9 -smp

[05:23:19] - Ask before connecting: No
[05:23:19] - User name: chrisretusn (Team 2291)
[05:23:19] - User ID: 83719AB3EDA1FA2
[05:23:19] - Machine ID: 1
[05:23:19]
[05:23:20] Loaded queue successfully.
[05:23:20] - Autosending finished units... [February 3 05:23:20 UTC]
[05:23:20] Trying to send all finished work units
[05:23:20] + No unsent completed units remaining.
[05:23:20] - Autosend completed
[05:23:20]
[05:23:20] + Processing work unit
[05:23:20] Work type a1 not eligible for variable processors
[05:23:20] Core required: FahCore_a1.exe
[05:23:20] Core found.
[05:23:20] Working on queue slot 02 [February 3 05:23:20 UTC]
[05:23:20] + Working ...
[05:23:20] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 26287 -version 624'

[05:23:20]
[05:23:20] *------------------------------*
[05:23:20] Folding@Home Gromacs SMP Core
[05:23:20] Version 1.74 (November 27, 2006)
[05:23:20]
[05:23:20] Preparing to commence simulation
[05:23:20] - Ensuring status. Please wait.
[05:23:20]
[05:23:21] Project: 2653 (Run 27, Clone 189, Gen 139)
[05:23:21]
[05:23:21] Assembly optimizations on if available.
[05:23:21] Entering M.D.
[05:23:38] Expanded 2432187 -> 1288362
[05:23:38] Project: 2653 (Run 27, Clone 1
[05:23:38] Project: 2653Entering M.D.
[05:23:38] e 189, Gen 139)
[05:23:38]
[05:23:38] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=racermach
NNODES=4, MYRANK=1, HOSTNAME=racermach
NNODES=4, MYRANK=3, HOSTNAME=racermach
NNODES=4, MYRANK=2, HOSTNAME=racermach
NODEID=1 argc=15
NODEID=0 argc=15
NODEID=2 argc=15
NODEID=3 argc=15
[05:23:44] Calling FAH init
(single precision)
[05:23:45] in POPC
[05:23:45] Writing local files
[05:23:45]  checkpoint)
[05:23:45] Read checkpoint
[05:23:45] Protein: Protein in POPC
[05:23:45] Writing local files
starting mdrun 'Protein in POPC'
500000 steps,   1000.0 ps.

[05:23:45] Completed 42190 out of 500000 steps  (8 percent)
[05:23:45] Extra SSE boost OK.
[05:38:45] Timered checkpoint triggered.
[05:39:37] Writing local files
[05:39:37] Completed 45000 out of 500000 steps  (9 percent)
[05:54:37] Timered checkpoint triggered.
[06:07:43] Writing local files
[06:07:43] Completed 50000 out of 500000 steps  (10 percent)
[06:22:42] Timered checkpoint triggered.
[06:35:52] Writing local files
[06:35:52] Completed 55000 out of 500000 steps  (11 percent)
[06:50:52] Timered checkpoint triggered.
[07:03:57] Writing local files
[07:03:58] Completed 60000 out of 500000 steps  (12 percent)
[07:18:58] Timered checkpoint triggered.
[07:32:03] Writing local files
[07:32:03] Completed 65000 out of 500000 steps  (13 percent)
[07:47:03] Timered checkpoint triggered.
[08:00:08] Writing local files
[08:00:08] Completed 70000 out of 500000 steps  (14 percent)
[08:15:08] Timered checkpoint triggered.
[08:28:12] Writing local files
[08:28:12] Completed 75000 out of 500000 steps  (15 percent)
[08:43:12] Timered checkpoint triggered.
[08:58:12] Timered checkpoint triggered.
[08:58:15] Writing local files
[08:58:15] Completed 80000 out of 500000 steps  (16 percent)
[09:13:16] Timered checkpoint triggered.
[09:28:15] Timered checkpoint triggered.
[09:33:24] Writing local files
[09:33:24] Completed 85000 out of 500000 steps  (17 percent)
[09:48:23] Timered checkpoint triggered.
[10:03:23] Timered checkpoint triggered.
[10:07:55] Writing local files
[10:07:55] Completed 90000 out of 500000 steps  (18 percent)
[10:22:55] Timered checkpoint triggered.
[10:37:55] Timered checkpoint triggered.
[10:39:34] Writing local files
[10:39:34] Completed 95000 out of 500000 steps  (19 percent)
[10:54:35] Timered checkpoint triggered.
[11:09:34] Timered checkpoint triggered.
[11:11:33] Writing local files
[11:11:34] Completed 100000 out of 500000 steps  (20 percent)
[11:23:20] - Autosending finished units... [February 3 11:23:20 UTC]
[11:23:20] Trying to send all finished work units
[11:23:20] + No unsent completed units remaining.
[11:23:20] - Autosend completed
[11:26:33] Timered checkpoint triggered.
[11:41:33] Timered checkpoint triggered.
[11:44:14] Writing local files
[11:44:14] Completed 105000 out of 500000 steps  (21 percent)
[11:59:13] Timered checkpoint triggered.
[12:14:13] Timered checkpoint triggered.
[12:24:22] Writing local files
[12:24:22] Completed 110000 out of 500000 steps  (22 percent)
[12:39:22] Timered checkpoint triggered.
[12:54:22] Timered checkpoint triggered.
[13:08:41] Writing local files
[13:08:41] Completed 115000 out of 500000 steps  (23 percent)
[13:23:40] Timered checkpoint triggered.
[13:38:40] Timered checkpoint triggered.
[13:46:33] Writing local files
[13:46:33] Completed 120000 out of 500000 steps  (24 percent)
[14:01:32] Timered checkpoint triggered.
[14:16:32] Timered checkpoint triggered.
[14:21:17] Writing local files
[14:21:17] Completed 125000 out of 500000 steps  (25 percent)
[14:36:17] Timered checkpoint triggered.
[14:51:17] Timered checkpoint triggered.
[14:55:18] Writing local files
[14:55:19] Completed 130000 out of 500000 steps  (26 percent)
[15:10:18] Timered checkpoint triggered.
[15:24:18] Writing local files
[15:24:18] Completed 135000 out of 500000 steps  (27 percent)
[15:39:18] Timered checkpoint triggered.
[15:52:26] Writing local files
[15:52:27] Completed 140000 out of 500000 steps  (28 percent)
[16:07:27] Timered checkpoint triggered.
[16:20:34] Writing local files
[16:20:34] Completed 145000 out of 500000 steps  (29 percent)
[16:35:34] Timered checkpoint triggered.
[16:48:47] Writing local files
[16:48:48] Completed 150000 out of 500000 steps  (30 percent)
[17:03:47] Timered checkpoint triggered.
[17:16:57] Writing local files
[17:16:57] Completed 155000 out of 500000 steps  (31 percent)
[17:23:20] - Autosending finished units... [February 3 17:23:20 UTC]
[17:23:20] Trying to send all finished work units
[17:23:20] + No unsent completed units remaining.
[17:23:20] - Autosend completed
[17:31:57] Timered checkpoint triggered.
[17:45:07] Writing local files
[17:45:07] Completed 160000 out of 500000 steps  (32 percent)
[18:00:07] Timered checkpoint triggered.
[18:13:13] Writing local files
[18:13:13] Completed 165000 out of 500000 steps  (33 percent)
[18:28:13] Timered checkpoint triggered.
[18:41:19] Writing local files
[18:41:19] Completed 170000 out of 500000 steps  (34 percent)
[18:56:19] Timered checkpoint triggered.
[19:09:32] Writing local files
[19:09:32] Completed 175000 out of 500000 steps  (35 percent)
[19:24:31] Timered checkpoint triggered.
[19:37:44] Writing local files
[19:37:44] Completed 180000 out of 500000 steps  (36 percent)
[19:52:44] Timered checkpoint triggered.
[20:05:56] Writing local files
[20:05:56] Completed 185000 out of 500000 steps  (37 percent)
[20:20:57] Timered checkpoint triggered.
[20:34:09] Writing local files
[20:34:09] Completed 190000 out of 500000 steps  (38 percent)
[20:49:08] Timered checkpoint triggered.
[21:02:15] Writing local files
[21:02:15] Completed 195000 out of 500000 steps  (39 percent)
[21:17:16] Timered checkpoint triggered.
[21:30:27] Writing local files
[21:30:27] Completed 200000 out of 500000 steps  (40 percent)
[21:45:27] Timered checkpoint triggered.
[21:58:34] Writing local files
[21:58:35] Completed 205000 out of 500000 steps  (41 percent)
[22:13:34] Timered checkpoint triggered.
[22:26:44] Writing local files
[22:26:44] Completed 210000 out of 500000 steps  (42 percent)
[22:41:45] Timered checkpoint triggered.
[22:54:57] Writing local files
[22:54:57] Completed 215000 out of 500000 steps  (43 percent)
[23:09:57] Timered checkpoint triggered.
[23:23:05] Writing local files
[23:23:06] Completed 220000 out of 500000 steps  (44 percent)
[23:23:20] - Autosending finished units... [February 3 23:23:20 UTC]
[23:23:20] Trying to send all finished work units
[23:23:20] + No unsent completed units remaining.
[23:23:20] - Autosend completed
[23:38:05] Timered checkpoint triggered.
[23:51:17] Writing local files
[23:51:17] Completed 225000 out of 500000 steps  (45 percent)
[00:06:16] Timered checkpoint triggered.
[00:21:16] Timered checkpoint triggered.
[00:33:11] Writing local files
[00:33:11] Completed 230000 out of 500000 steps  (46 percent)
[00:48:12] Timered checkpoint triggered.
[01:03:12] Timered checkpoint triggered.
[01:05:58] Writing local files
[01:05:58] Completed 235000 out of 500000 steps  (47 percent)
[01:20:58] Timered checkpoint triggered.
[01:35:58] Timered checkpoint triggered.
[01:37:52] Writing local files
[01:37:52] Completed 240000 out of 500000 steps  (48 percent)
[01:52:52] Timered checkpoint triggered.
[02:07:52] Timered checkpoint triggered.
[02:08:11] Writing local files
[02:08:11] Completed 245000 out of 500000 steps  (49 percent)
[02:23:11] Timered checkpoint triggered.
[02:38:11] Timered checkpoint triggered.
[02:44:31] Writing local files
[02:44:31] Completed 250000 out of 500000 steps  (50 percent)
[02:59:31] Timered checkpoint triggered.
[03:14:31] Timered checkpoint triggered.
[03:18:07] Writing local files
[03:18:07] Completed 255000 out of 500000 steps  (51 percent)
[03:33:07] Timered checkpoint triggered.
[03:48:07] Timered checkpoint triggered.
[03:51:34] Writing local files
[03:51:34] Completed 260000 out of 500000 steps  (52 percent)
[04:06:34] Timered checkpoint triggered.
[04:21:35] Timered checkpoint triggered.
[04:27:15] Writing local files
[04:27:16] Completed 265000 out of 500000 steps  (53 percent)
[04:42:15] Timered checkpoint triggered.
[04:57:15] Timered checkpoint triggered.
[05:00:15] Writing local files
[05:00:15] Completed 270000 out of 500000 steps  (54 percent)
[05:15:15] Timered checkpoint triggered.
[05:23:20] - Autosending finished units... [February 4 05:23:20 UTC]
[05:23:20] Trying to send all finished work units
[05:23:20] + No unsent completed units remaining.
[05:23:20] - Autosend completed
[05:30:15] Timered checkpoint triggered.
[05:35:11] Writing local files
[05:35:11] Completed 275000 out of 500000 steps  (55 percent)
[05:50:12] Timered checkpoint triggered.
[06:05:11] Timered checkpoint triggered.
[06:15:38] Writing local files
[06:15:38] Completed 280000 out of 500000 steps  (56 percent)
[06:30:38] Timered checkpoint triggered.
[06:45:38] Timered checkpoint triggered.
[06:52:50] Writing local files
[06:52:50] Completed 285000 out of 500000 steps  (57 percent)
[07:07:50] Timered checkpoint triggered.
[07:22:50] Timered checkpoint triggered.
[07:25:03] Writing local files
[07:25:03] Completed 290000 out of 500000 steps  (58 percent)
[07:40:03] Timered checkpoint triggered.
[07:55:03] Timered checkpoint triggered.
[07:55:42] Writing local files
[07:55:42] Completed 295000 out of 500000 steps  (59 percent)
[08:10:42] Timered checkpoint triggered.
[08:25:42] Timered checkpoint triggered.
[08:26:20] Writing local files
[08:26:20] Completed 300000 out of 500000 steps  (60 percent)
[08:41:20] Timered checkpoint triggered.
[08:56:20] Timered checkpoint triggered.
[08:57:01] Writing local files
[08:57:01] Completed 305000 out of 500000 steps  (61 percent)
[09:12:01] Timered checkpoint triggered.
[09:27:01] Timered checkpoint triggered.
[09:27:42] Writing local files
[09:27:42] Completed 310000 out of 500000 steps  (62 percent)
[09:42:42] Timered checkpoint triggered.
[09:57:42] Timered checkpoint triggered.
[09:58:48] Writing local files
[09:58:48] Completed 315000 out of 500000 steps  (63 percent)
[10:13:48] Timered checkpoint triggered.
[10:28:48] Timered checkpoint triggered.
[10:29:28] Writing local files
[10:29:28] Completed 320000 out of 500000 steps  (64 percent)
[10:44:27] Timered checkpoint triggered.
[10:59:28] Timered checkpoint triggered.
[11:00:09] Writing local files
[11:00:09] Completed 325000 out of 500000 steps  (65 percent)
[11:15:09] Timered checkpoint triggered.
[11:23:20] - Autosending finished units... [February 4 11:23:20 UTC]
[11:23:20] Trying to send all finished work units
[11:23:20] + No unsent completed units remaining.
[11:23:20] - Autosend completed
[11:30:09] Timered checkpoint triggered.
[11:31:33] Writing local files
[11:31:33] Completed 330000 out of 500000 steps  (66 percent)
[11:46:32] Timered checkpoint triggered.
[12:01:32] Timered checkpoint triggered.
[12:04:08] Writing local files
[12:04:08] Completed 335000 out of 500000 steps  (67 percent)
[12:19:08] Timered checkpoint triggered.
[12:34:09] Timered checkpoint triggered.
[12:37:57] Writing local files
[12:37:57] Completed 340000 out of 500000 steps  (68 percent)
[12:52:57] Timered checkpoint triggered.
[13:07:57] Timered checkpoint triggered.
[13:14:35] Writing local files
[13:14:36] Completed 345000 out of 500000 steps  (69 percent)
[13:29:36] Timered checkpoint triggered.
[13:44:35] Timered checkpoint triggered.
[13:49:03] Writing local files
[13:49:03] Completed 350000 out of 500000 steps  (70 percent)
[14:04:03] Timered checkpoint triggered.
[14:19:03] Timered checkpoint triggered.
[14:24:25] Writing local files
[14:24:25] Completed 355000 out of 500000 steps  (71 percent)
[14:39:25] Timered checkpoint triggered.
[14:54:24] Timered checkpoint triggered.
[14:57:47] Writing local files
[14:57:47] Completed 360000 out of 500000 steps  (72 percent)
[15:12:47] Timered checkpoint triggered.
[15:25:59] Writing local files
[15:25:59] Completed 365000 out of 500000 steps  (73 percent)
[15:41:00] Timered checkpoint triggered.
[15:54:12] Writing local files
[15:54:12] Completed 370000 out of 500000 steps  (74 percent)
[16:09:12] Timered checkpoint triggered.
[16:22:22] Writing local files
[16:22:22] Completed 375000 out of 500000 steps  (75 percent)
[16:37:23] Timered checkpoint triggered.
[16:50:32] Writing local files
[16:50:32] Completed 380000 out of 500000 steps  (76 percent)
[17:05:32] Timered checkpoint triggered.
[17:18:41] Writing local files
[17:18:41] Completed 385000 out of 500000 steps  (77 percent)
[17:23:20] - Autosending finished units... [February 4 17:23:20 UTC]
[17:23:20] Trying to send all finished work units
[17:23:20] + No unsent completed units remaining.
[17:23:20] - Autosend completed
[17:33:41] Timered checkpoint triggered.
[17:46:49] Writing local files
[17:46:49] Completed 390000 out of 500000 steps  (78 percent)
[18:01:49] Timered checkpoint triggered.
[18:15:02] Writing local files
[18:15:02] Completed 395000 out of 500000 steps  (79 percent)
[18:30:02] Timered checkpoint triggered.
[18:43:16] Writing local files
[18:43:16] Completed 400000 out of 500000 steps  (80 percent)
[18:58:16] Timered checkpoint triggered.
[19:11:29] Writing local files
[19:11:30] Completed 405000 out of 500000 steps  (81 percent)
[19:26:29] Timered checkpoint triggered.
[19:39:40] Writing local files
[19:39:40] Completed 410000 out of 500000 steps  (82 percent)
[19:54:40] Timered checkpoint triggered.
[20:07:50] Writing local files
[20:07:50] Completed 415000 out of 500000 steps  (83 percent)
[20:22:49] Timered checkpoint triggered.
[20:36:03] Writing local files
[20:36:03] Completed 420000 out of 500000 steps  (84 percent)
[20:51:03] Timered checkpoint triggered.
[21:04:16] Writing local files
[21:04:17] Completed 425000 out of 500000 steps  (85 percent)
[21:19:16] Timered checkpoint triggered.
[21:32:24] Writing local files
[21:32:24] Completed 430000 out of 500000 steps  (86 percent)
[21:47:24] Timered checkpoint triggered.
[22:00:29] Writing local files
[22:00:30] Completed 435000 out of 500000 steps  (87 percent)
[22:15:29] Timered checkpoint triggered.
[22:28:39] Writing local files
[22:28:39] Completed 440000 out of 500000 steps  (88 percent)
[22:43:38] Timered checkpoint triggered.
[22:56:44] Writing local files
[22:56:44] Completed 445000 out of 500000 steps  (89 percent)
[23:11:44] Timered checkpoint triggered.
[23:23:20] - Autosending finished units... [February 4 23:23:20 UTC]
[23:23:20] Trying to send all finished work units
[23:23:20] + No unsent completed units remaining.
[23:23:20] - Autosend completed
[23:24:52] Writing local files
[23:24:52] Completed 450000 out of 500000 steps  (90 percent)
[23:39:52] Timered checkpoint triggered.
[23:54:52] Timered checkpoint triggered.
[23:55:32] Writing local files
[23:55:32] Completed 455000 out of 500000 steps  (91 percent)
[00:10:32] Timered checkpoint triggered.
[00:25:32] Timered checkpoint triggered.
[00:27:34] Writing local files
[00:27:34] Completed 460000 out of 500000 steps  (92 percent)
[00:42:34] Timered checkpoint triggered.
[00:57:34] Timered checkpoint triggered.
[00:58:59] Writing local files
[00:58:59] Completed 465000 out of 500000 steps  (93 percent)
[01:13:59] Timered checkpoint triggered.
[01:29:00] Timered checkpoint triggered.
[01:29:46] Writing local files
[01:29:46] Completed 470000 out of 500000 steps  (94 percent)
[01:44:47] Timered checkpoint triggered.
[01:59:29] Writing local files
[01:59:29] Completed 475000 out of 500000 steps  (95 percent)
[02:14:29] Timered checkpoint triggered.
[02:29:10] Writing local files
[02:29:10] Completed 480000 out of 500000 steps  (96 percent)
[02:44:10] Timered checkpoint triggered.
[02:59:10] Timered checkpoint triggered.
[03:00:31] Writing local files
[03:00:31] Completed 485000 out of 500000 steps  (97 percent)
[03:15:32] Timered checkpoint triggered.
[03:30:31] Timered checkpoint triggered.
[03:32:49] Writing local files
[03:32:49] Completed 490000 out of 500000 steps  (98 percent)
[03:47:50] Timered checkpoint triggered.
[04:02:50] Timered checkpoint triggered.
[04:05:31] Writing local files
[04:05:31] Completed 495000 out of 500000 steps  (99 percent)
[04:20:32] Timered checkpoint triggered.
[04:35:32] Timered checkpoint triggered.
[04:39:01] Writing local files
[04:39:01] Completed 500000 out of 500000 steps  (100 percent)
[04:39:01] Writing final coordinates.



   M E G A - F L O P S   A C C O U N T I N G

   Parallel run - timing based on wallclock.
   RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
   T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
   NF=No Forces

 Computing:                        M-Number         M-Flops  % of Flops
-----------------------------------------------------------------------
 VdW(T)                      1245631.593803 67264106.065362    16.5
 RF Coul                      494141.410722 16306666.553826     4.0
 RF Coul [W3]                   2046.646306   200571.337988     0.0
 RF Coul + VdW(T)             565557.765991 36761254.789415     9.0
 RF Coul + VdW(T) [W3]        255169.818395 33172076.391350     8.2
 RF Coul + VdW(T) [W3-W3]     707807.363122 226498356.199040    55.7
 Outer nonbonded loop         218478.844218  2184788.442180     0.5
 1,4 nonbonded interactions     1006.726389    90605.375010     0.0
 NS-Pairs                     457984.462472  9617673.711912     2.4
 Reset In Box                   3547.876090    31930.884810     0.0
 Shift-X                       70861.817824   425170.906944     0.1
 CG-CoM                         1615.097396    46837.824484     0.0
 Sum Forces                   106434.190335   106434.190335     0.0
 Bonds                         12119.172792   521124.430056     0.1
 Angles                        14273.631358  2326601.911354     0.6
 Propers                        4912.769841  1125024.293589     0.3
 Impropers                       919.284488   191211.173504     0.0
 RB-Dihedrals                  12937.281049  3195508.419103     0.8
 Virial                        35527.507033   639495.126594     0.2
 Update                        35478.063445  1099819.966795     0.3
 Stop-CM                       35477.985950   354779.859500     0.1
 P-Coupling                    35478.063445   212868.380670     0.1
 Calc-Ekin                     35478.140940   957909.805380     0.2
 Constraint-V                  35478.063445   212868.380670     0.1
 Constraint-Vir                23086.035297   554064.847128     0.1
 Settle                         7695.345099  2485596.466977     0.6
-----------------------------------------------------------------------
 Total                                      406583345.733976   100.0
-----------------------------------------------------------------------

               NODE (s)   Real (s)      (%)
       Time: 170117.000 170117.000    100.0
                       1d23h15:17
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:     19.224      2.390      0.508     47.255
[04:39:02] Past main M.D. loop
[04:39:02] Will end MPI now
[04:40:02]
[04:40:02] Finished Work Unit:
[04:40:02] - Reading up to 3720000 from "work/wudata_02.arc": Read 3720000
[04:40:02] - Reading up to 1774748 from "work/wudata_02.xtc": Read 1774748
[04:40:02] goefile size: 0
[04:40:02] logfile size: 26387
[04:40:02] Leaving Run
[04:40:04] - Writing 5525535 bytes of core data to disk...
[04:40:04]   ... Done.
[04:41:30] - Shutting down core
[04:41:30]
[04:41:30] Folding@home Core Shutdown: FINISHED_UNIT
[0]0:Return code = 0, signaled with Quit
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 18
[0]3:Return code = 0, signaled with Quit
      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2004, The GROMACS development team,
            check out http://www.gromacs.org for more information.

        This inclusion of Gromacs code in the Folding@Home Core is under
        a special license (see http://folding.stanford.edu/gromacs.html)
         specially granted to Stanford by the copyright holders. If you
          are interested in using Gromacs, visit http://www.gromacs.org where
                you can download a free version of Gromacs under
         the terms of the GNU General Public License (GPL) as published
       by the Free Software Foundation; either version 2 of the License,
                     or (at your option) any later version.

[04:43:27] CoreStatus = 12 (18)
[04:43:27] Client-core communications error: ERROR 0x12
[04:43:27] Deleting current work unit & continuing...
[04:47:49] - Warning: Could not delete all work unit files (2): Core returned invalid code
[04:47:49] Trying to send all finished work units
[04:47:49] + No unsent completed units remaining.

My current WU is also a A1 core. My plan with this one is to not stop it and see if the results are sent after it completes.

These failures after 100% completion are the most frustrating. All of that work for nothing. Testing is a bit frustration too because of the time it takes to complete a WU.

Also of note the A1 core WU that failed also left behind several work files, that will undoubtedly cause problems later when slot 02 comes around again.
Code: Select all
-rwxr-x--- 1 folder users  406 2010-02-05 12:47 logfile_02.txt* <<-- This one is normally left behind.
-rw-r--r-- 1 folder users    0 2010-02-05 12:45 wudata_02.arc
-rw-r--r-- 1 folder users    0 2010-02-05 12:45 wudata_02.bed
-rw-r--r-- 1 folder users    0 2010-02-05 12:45 wudata_02.goe
-rw-r--r-- 1 folder users    0 2010-02-05 12:45 wudata_02.pdo
-rw-r--r-- 1 folder users    0 2010-02-05 12:45 wudata_02.sas
-rw-r--r-- 1 folder users    0 2010-02-05 12:45 wudata_02.xtc
-rw-r--r-- 1 folder users    0 2010-02-05 12:45 wudata_02.xvg
-rw-r--r-- 1 folder users  13M 2010-02-05 12:43 wudata_02.xyz
-rw-r--r-- 1 folder users    0 2010-02-05 12:45 wudata_02CP.arc
-rw-r--r-- 1 folder users    0 2010-02-05 12:45 wudata_02CP.arc.b
-rw-r--r-- 1 folder users 3.4M 2009-12-19 22:16 wudata_02_prev.cpt <<-- This one is normally left behind.
-rwxr-x--- 1 folder users  512 2010-02-05 12:45 wuinfo_02.dat*
-rw-r--r-- 1 folder users 5.3M 2010-02-05 12:40 wuresults_02.dat <<-- This one I believe will cause the next WU to use slot 02 to fail.

There is a difference between the A2 and A1 in my setup I should point out. One you may have noticed with the A1 core the -oneunit flag is missing. I took it out. Another is I made a change to my rc.folding script. I had added those to commented out line during my previous testing.
Code: Select all
folding_stop() {
  killall fah6
# sleep 15 <<<<---- commented out
# sync <<<<------commented out
  echo "Folding@Home Stopped"
}


I am also thinking the subject of this thread should be renamed to "WU's failing after 100% completion, left over work files"
chrisretusn
 
Posts: 196
Joined: Sat Feb 02, 2008 10:12 am
Location: Philippines

Re: WU's fail after 100%, left over work files, ext4 cause?

Postby tmillic » Sat Feb 13, 2010 12:35 am

I checked my filesystem after reading this and was disappointed to find that I'm not running ext4, yet I'm having similar problems.
Code: Select all
user@pelican:~/folding$ df -T
Filesystem    Type   1K-blocks      Used Available Use% Mounted on
aufs          aufs     2031172    200964   1830208  10% /
tmpfs        tmpfs     2031172         0   2031172   0% /lib/init/rw
udev         tmpfs       10240       172     10068   2% /dev
tmpfs        tmpfs     2031172         4   2031168   1% /dev/shm
/dev/sr0   iso9660      490566    490566         0 100% /live/image
tmpfs        tmpfs     2031172    200964   1830208  10% /live/cow
tmpfs        tmpfs     2031172         0   2031172   0% /live
tmpfs        tmpfs     2031172        84   2031088   1% /tmp
/dev/ram1     ext2       96828     92690      4138  96% /home
tmillic
 
Posts: 89
Joined: Wed Sep 17, 2008 4:36 pm
Location: Fayetteville, Arkansas

PreviousNext

Return to Linux CPU V6 Client

Who is online

Users browsing this forum: No registered users and 1 guest

cron