Project: 3062 (Run 2, Clone 101, Gen 11) crashing at 40%

Moderators: Site Moderators, FAHC Science Team

Post Reply
dschief
Posts: 146
Joined: Tue Dec 04, 2007 5:56 am
Hardware configuration: ASUS P5K-E, Q6600/ 8 gig ram Win-7

2X ASUS z97-K 16 G Ram Win-7_64

Project: 3062 (Run 2, Clone 101, Gen 11) crashing at 40%

Post by dschief »

this has died twice at 40%, and just down-loaded the same wu? I thought the server was set up to switch to a different wu after 2 failed attempts?

Code: Select all

23:26:28] Project: 3062 (Run 2, Clone 101, Gen 11)
[23:26:28] 
[23:26:28] Assembly optimizations on if available.
[23:26:28] Entering M.D.
[23:26:44]  percent)
[23:26:45] - Sta
[23:26:45] Project: 3062 (Run 2, Clone 101, Gen 11)
[23:26:45] 
[23:26:45] Entering M.D.
[23:26:45] ne 101, Gen 11)
[23:26:45] 
[23:26:45] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=2, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=1, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=3, HOSTNAME=localhost.localdomain
NODEID=1 argc=15
NODEID=0 argc=15
NODEID=2 argc=15
NODEID=3 argc=15
      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2004, The GROMACS development team,
            check out http://www.gromacs.org for more information.

        This inclusion of Gromacs code in the Folding@Home Core is under
        a special license (see http://folding.stanford.edu/gromacs.html)
         specially granted to Stanford by the copyright holders. If you
          are interested in using Gromacs, visit http://www.gromacs.org where
                you can download a free version of Gromacs under
         the terms of the GNU General Public License (GPL) as published
       by the Free Software Foundation; either version 2 of the License,
                     or (at your option) any later version.

starting mdrun 'p3062_lambda5_99sb'
5000000 steps,  10000.0 ps.

[23:26:51] files
[23:26:51] Completed 0 out of 5000000 steps  (0 percent)
[23:26:51] a SSE boost OK.
[23:41:51] nt triggered.
[23:42:52] Writing local files
[23:42:52] Completed 50000 out of 5000000 steps  (1 percent)
[23:57:52] Timered checkpoint triggered.
[23:58:51] Writing local files
[23:58:51] Completed 100000 out of 5000000 steps  (2 percent)

snip

[09:50:43] Completed 1950000 out of 5000000 steps  (39 percent)
[10:05:44] Timered checkpoint triggered.
[10:06:42] Writing local files
[10:06:42] Completed 2000000 out of 5000000 steps  (40 percent)
[10:21:42] Timered checkpoint triggered.
[10:22:21] Warning:  long 1-4 interactions
[0]0:Return code = 0, signaled with Segmentation fault
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Segmentation fault
[0]3:Return code = 0, signaled with Segmentation fault
[10:22:25] CoreStatus = 0 (0)
[10:22:25] Client-core communications error: ERROR 0x0
[10:22:25] Deleting current work unit & continuing...
[0]0:Return code = 0, signaled with Quit
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 18
[0]3:Return code = 0, signaled with Quit
[10:26:46] - Warning: Could not delete all work unit files (7): Core returned invalid code
[10:26:46] Trying to send all finished work units
[10:26:46] + No unsent completed units remaining.
[10:26:46] - Preparing to get new work unit...
[10:26:46] + Attempting to get work packet
[10:26:46] - Will indicate memory of 2013 MB
[10:26:46] - Connecting to assignment server
[10:26:46] Connecting to http://assign.stanford.edu:8080/
[10:26:47] Posted data.
[10:26:47] Initial: 40AB; - Successful: assigned to (171.64.65.63).
[10:26:47] + News From Folding@Home: Welcome to Folding@Home
[10:26:47] Loaded queue successfully.
[10:26:47] Connecting to http://171.64.65.63:8080/
[10:26:48] Posted data.
[10:26:48] Initial: 0000; - Receiving payload (expected size: 610425)
[10:26:52] - Downloaded at ~149 kB/s
[10:26:52] - Averaged speed for that direction ~151 kB/s
[10:26:52] + Received work.
[10:26:52] + Closed connections
[10:26:57] 
[10:26:57] + Processing work unit
[10:26:57] Core required: FahCore_a1.exe
[10:26:57] Core found.
[10:26:57] Working on Unit 08 [May 7 10:26:57]
[10:26:57] + Working ...
[10:26:57] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 08 -checkpoint 15 -verbose -lifeline 17884 -version 601'

[10:26:57] 
[10:26:57] *------------------------------*
[10:26:57] Folding@Home Gromacs SMP Core
[10:26:57] Version 1.74 (November 27, 2006)
[10:26:57] 
[10:26:57] Preparing to commence simulation
[10:26:57] - Ensuring status. Please wait.
[10:27:14] - Looking at optimizations...
[10:27:14] - Working with standard loops on this execution.
[10:27:14] - Previous termination of core was improper.
[10:27:14] - Going to use standard loops.
[10:27:14] - Files status OK
[10:27:14] arting from initial work packet
[10:27:14] 
[10:27:14] Project: 3062 (Run 2, Clone 101, Gen 11)
[10:27:14] 
[10:27:14] Entering M.D.
[10:27:14] cket
[10:27:14] 
[10:27:14] Project: 3062 (Run 2, Clone 101, Gen 11)
[10:27:14] 
[10:27:14] Entering M.D.
NNODES=4, MYRANK=2, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=3, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=0, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=1, HOSTNAME=localhost.localdomain
NODEID=2 argc=15
NODEID=3 argc=15
NODEID=0 argc=15
NODEID=1 argc=15

starting mdrun 'p3062_lambda5_99sb'
5000000 steps,  10000.0 ps.

[10:27:20]  files
[10:27:20] Extra SSE boost OK.
[10:27:20] Writing local files
[10:27:20] Completed 0 out of 5000000 steps  (0 percent)
[10:42:20] Timered checkpoint triggered.
[10:43:18] Writing local files
[10:43:18] Completed 50000 out of 5000000 steps  (1 percent)
[10:58:18] Timered checkpoint triggered.
[10:59:14] Writing local files
[10:59:14] Completed 100000 out of 5000000 steps  (2 percent)

snip

[20:49:32] Completed 1950000 out of 5000000 steps  (39 percent)
[21:04:32] Timered checkpoint triggered.
[21:05:31] Writing local files
[21:05:31] Completed 2000000 out of 5000000 steps  (40 percent)
[21:20:31] Timered checkpoint triggered.
[21:21:11] Warning:  long 1-4 interactions
[0]0:Return code = 0, signaled with Segmentation fault
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Segmentation fault
[0]3:Return code = 0, signaled with Segmentation fault
[21:21:15] CoreStatus = 0 (0)
[21:21:15] Client-core communications error: ERROR 0x0
[21:21:15] Deleting current work unit & continuing...
[0]0:Return code = 18
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[21:25:36] - Warning: Could not delete all work unit files (8): Core returned invalid code
[21:25:36] Trying to send all finished work units
[21:25:36] + No unsent completed units remaining.
[21:25:36] - Preparing to get new work unit...
[21:25:36] + Attempting to get work packet
[21:25:36] - Will indicate memory of 2013 MB
[21:25:36] - Connecting to assignment server
[21:25:36] Connecting to http://assign.stanford.edu:8080/
[21:25:36] Posted data.
[21:25:36] Initial: 40AB; - Successful: assigned to (171.64.65.63).
[21:25:36] + News From Folding@Home: Welcome to Folding@Home
[21:25:36] Loaded queue successfully.
[21:25:36] Connecting to http://171.64.65.63:8080/
[21:25:37] Posted data.
[21:25:37] Initial: 0000; - Receiving payload (expected size: 610425)
[21:25:41] - Downloaded at ~149 kB/s
[21:25:41] - Averaged speed for that direction ~151 kB/s
[21:25:41] + Received work.
[21:25:41] + Closed connections
[21:25:46] 
[21:25:46] + Processing work unit
[21:25:46] Core required: FahCore_a1.exe
[21:25:46] Core found.
[21:25:46] Working on Unit 09 [May 7 21:25:46]
[21:25:46] + Working ...
[21:25:46] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 09 -checkpoint 15 -verbose -lifeline 17884 -version 601'

[21:25:46] 
[21:25:46] *------------------------------*
[21:25:46] Folding@Home Gromacs SMP Core
[21:25:46] Version 1.74 (November 27, 2006)
[21:25:46] 
[21:25:46] Preparing to commence simulation
[21:25:46] - Ensuring status. Please wait.
[21:26:03] - Looking at optimizations...
[21:26:03] - Working with standard loops on this execution.
[21:26:03] - Created dyn
[21:26:03] - Files status OK
[21:26:03] as improper.
[21:26:03] - Going to use sta- Expanded 609913 -> 3263133 (decompressed 535.0 percent)
[21:26:04] cket
[21:26:04] 
[21:26:04] Project: 3062 (Run 2, Clone 101, Gen 11)
[21:26:04] 
[21:26:04] Entering M.D.
[21:26:04] cket
[21:26:04] 
[21:26:04] Project: 3062 (Run 2, Clone 101, Gen 11)
[21:26:04] 
[21:26:04] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=1, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=3, HOSTNAME=localhost.localdomain
NNODES=4, MYRANK=2, HOSTNAME=localhost.localdomain
NODEID=3 argc=15
NODEID=0 argc=15
NODEID=1 argc=15
NODEID=2 argc=15
      
starting mdrun 'p3062_lambda5_99sb'
5000000 steps,  10000.0 ps.

[21:26:10] t OK.
[21:26:10] n: p3062_laExtra SSE boost OK.
[21:26:10] ambda5_99sbExtra SSE boost OK.
[21:26:10] 
[21:26:10] Extra SSE boost OK.
[21:26:10] Writing local files
[21:26:10] Completed 0 out of 5000000 steps  (0 percent)
[21:41:10] Timered checkpoint triggered.
[21:42:02] Writing local files
[21:42:02] Completed 50000 out of 5000000 steps  (1 percent)
Snipped, fixed code tags, updated title. -7im
7im
Posts: 10189
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Project: 3062 (Run 2, Clone 101, Gen 11) crashing at 40%

Post by 7im »

3
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 3062 (Run 2, Clone 101, Gen 11) crashing at 40%

Post by bruce »

7im wrote:3
And I've seen reports of more than three.

Try qfix. If it finds anything to fix, the client will not download the WU again, no matter how many times the server might have given you the same assignment.
Post Reply