Project: 2609 (Run 0, Clone 492, Gen 0)

Moderators: Site Moderators, FAHC Science Team

Project: 2609 (Run 0, Clone 492, Gen 0)

Postby klasseng » Wed Jan 02, 2008 1:11 am

My machine downloaded this WU, crashed, deleted it, downloaded it again, crashed, deleted it and is now going on successfully with a 3062 WU.

One instance of the cycle is quoted below:

[18:15:40] + Results successfully sent
[18:15:40] Thank you for your contribution to Folding@Home.
[18:15:40] + Number of Units Completed: 67

[18:19:49] - Warning: Could not delete all work unit files (7): Core returned invalid code
[18:19:49] Trying to send all finished work units
[18:19:49] + No unsent completed units remaining.
[18:19:49] - Preparing to get new work unit...
[18:19:49] + Attempting to get work packet
[18:19:49] - Will indicate memory of 4000 MB
[18:19:49] - Connecting to assignment server
[18:19:49] Connecting to http://assign.stanford.edu:8080/
[18:19:50] Posted data.
[18:19:50] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[18:19:50] + News From Folding@Home: Welcome to Folding@Home
[18:19:50] Loaded queue successfully.
[18:19:50] Connecting to http://171.64.65.56:8080/
[18:19:53] Posted data.
[18:19:53] Initial: 0000; - Receiving payload (expected size: 132347)
[18:19:58] - Downloaded at ~25 kB/s
[18:19:58] - Averaged speed for that direction ~110 kB/s
[18:19:58] + Received work.
[18:19:58] Trying to send all finished work units
[18:19:58] + No unsent completed units remaining.
[18:19:58] + Closed connections
[18:19:58]
[18:19:58] + Processing work unit
[18:19:58] Core required: FahCore_a1.exe
[18:19:58] Core found.
[18:19:58] Working on Unit 08 [January 1 18:19:58]
[18:19:58] + Working ...
[18:19:58] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 08 -checkpoint 15 -verbose -lifeline 4834 -version 600'

[18:19:58]
[18:19:58] *------------------------------*
[18:19:58] Folding@Home Gromacs SMP Core
[18:19:58] Version 1.74 (September 24, 2007)
[18:19:58]
[18:19:58] Preparing to commence simulation
[18:19:58] - Ensuring status. Please wait.
[18:19:58] - Starting from initial work packet
[18:19:58]
[18:19:58] Project: 2609 (Run 0, Clone 492, Gen 0)
[18:19:58]
[18:19:58] Assembly optimizations on if available.
[18:19:58] Entering M.D.
[18:20:15] 8 percent)
[18:20:16] cket
[18:20:16]
[18:20:16] Project: 2609 (Run 0, Clone 492, Gen 0)
[18:20:16]
[18:20:16] Entering M.D.
[18:20:16] one 492, Gen 0)
[18:20:16]
[18:20:16] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=klasseng.local
NNODES=4, MYRANK=2, HOSTNAME=klasseng.local
NNODES=4, MYRANK=1, HOSTNAME=klasseng.local
NNODES=4, MYRANK=3, HOSTNAME=klasseng.local
NODEID=1 argc=15
NODEID=0 argc=15
NODEID=2 argc=15
NODEID=3 argc=15
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2004, The GROMACS development team,
check out http://www.gromacs.org for more information.

This inclusion of Gromacs code in the Folding@Home Core is under
a special license (see http://folding.stanford.edu/gromacs.html)
specially granted to Stanford by the copyright holders. If you
are interested in using Gromacs, visit http://www.gromacs.org where
you can download a free version of Gromacs under
the terms of the GNU General Public License (GPL) as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.

[18:20:22] mdrunner cpfilename:
[18:20:22] Rejecting checkpoint
FahCore_a1.exe(24288,0x1801600) malloc: *** vm_allocate(size=3222790144) failed (error code=3)
FahCore_a1.exe(24288,0x1801600) malloc: *** error: can't allocate region
FahCore_a1.exe(24288,0x1801600) malloc: *** set a breakpoint in szone_error to debug
-------------------------------------------------------
Program Core_A1.exe, VERSION 3.3
Source code file: smalloc.c, line: 113

Fatal error:
calloc for ir->opts.nrdf (nelem=-1341786129, elsize=4, file tpxio.c, line 492)
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day
: Cannot allocate memory
Error on node 0, will try to stop all the nodes
[18:20:22] Gromacs error.
[18:20:22]
[18:20:22] Folding@home Core Shutdown: UNKNOWN_ERROR
[0]0:Return code = 121
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[18:20:27] CoreStatus = 79 (121)
[18:20:27] Client-core communications error: ERROR 0x79
[18:20:27] Deleting current work unit & continuing...
[18:24:53] - Warning: Could not delete all work unit files (8): Core returned invalid code
[18:24:53] Trying to send all finished work units
[18:24:53] + No unsent completed units remaining.
peace,
Grant
User avatar
klasseng
 
Posts: 125
Joined: Thu Dec 27, 2007 7:08 am
Location: Canada

Re: Project: 2609 (Run 0, Clone 492, Gen 0)

Postby bruce » Wed Jan 02, 2008 1:35 pm

CoreStatus = 79 isn't really very helpful in figuring out what's wrong. :(

I have two suggestions but they may not be helpful.

First, look in the WORK directory for any files that are not related to the current WU and delete them. (Probably the easiest way is to sort by date). Second, check the amount of RAM that FAH is reporting to the server. You'll probably need -verbosity 9 to see the value, but if it's a nonsensical value, set a realistic value in the Advanced configuration settings. (restart the client with either -config or -configonly).
bruce
 
Posts: 20122
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: Project: 2609 (Run 0, Clone 492, Gen 0)

Postby klasseng » Wed Jan 02, 2008 9:50 pm

Bruce:

"CoreStatus = 79" all that the cli client gives us . . . maybe those who write it should make error report more useful.

------
Support Customer: "No matter how many times I type 11, nothing happens"
Support Staff: " . . . . why are you typing 11?"
Support Customer: "When the machine crashed it said, 'Error type 11'"
-------

Since FAH6 already deleted this 2609 WU, I couldn't work with it. It went on and completed a 2605 OK.

It just now it tried Project: 2609 (Run 0, Clone 492, Gen 0) again - fah6 crashed out and proceeded to delete the current wu.

1. I did a control-c and exited fah6.

2. I moved all files except:
- client.cfg
- fah6
- FahCore_a1.exe
and
- mpiexec
to a new folder "aborted stuff" just to get it out of the way.

3. I ran with configonly and changed the memory setting to 2000MB (it had been at 4000MB - I have 9000MB)

4. I restarted fah6 but it grabbed a new 2605 . . . so we won't know if the change in the memory setting did the trick until it tries another 2609.

peace,
Grant
User avatar
klasseng
 
Posts: 125
Joined: Thu Dec 27, 2007 7:08 am
Location: Canada


Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 2 guests

cron