Project: 2675 (Run 3, Clone 123, Gen 79)

Moderators: Site Moderators, FAHC Science Team

Post Reply
bollix47
Posts: 2941
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Project: 2675 (Run 3, Clone 123, Gen 79)

Post by bollix47 »

FYI

I had to stop the WU to do some updating and I used ctrl-c but after restarting the following error was reported and the WU was trashed. I did wait until it had just finished a step before stopping.

Code: Select all

[13:43:22] Completed 85008 out of 250000 steps  (34%)
[cli_0]: aborting job:
application called MPI_Abor[13:43:26] ***** Got an Activate signal (2)
t(MPI_COMM_WORLD, 102) - process 0
[13:43:26] Killing all core threads

Folding@Home Client Shutdown.
bollix@Georgina:~/fah/smp$ [0]0:Return code = 102
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[0]4:Return code = 0, signaled with Quit
[0]5:Return code = 0, signaled with Quit
[0]6:Return code = 0, signaled with Quit
[0]7:Return code = 0, signaled with Quit

bollix@Georgina:~/fah/smp$ ./fah6 -advmethods -oneunit

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

8 cores detected


--- Opening Log file [March 2 13:43:54 UTC] 


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/bollix/fah/smp
Executable: ./fah6
Arguments: -advmethods -oneunit -smp 8 -verbosity 9 

[13:43:54] - Ask before connecting: No
[13:43:54] - User name: bollix47 (Team 39340)
[13:43:54] - User ID: xxxxxxxxxxxxxxxxx
[13:43:54] - Machine ID: 1
[13:43:54] 
[13:43:55] Loaded queue successfully.
[13:43:55] - Autosending finished units... [March 2 13:43:55 UTC]
[13:43:55] Trying to send all finished work units
[13:43:55] + No unsent completed units remaining.
[13:43:55] - Autosend completed
[13:43:55] 
[13:43:55] + Processing work unit
[13:43:55] Core required: FahCore_a2.exe
[13:43:55] Core found.
[13:43:55] Working on queue slot 02 [March 2 13:43:55 UTC]
[13:43:55] + Working ...
[13:43:55] - Calling './mpiexec -np 8 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -priority 96 -checkpoint 30 -verbose -lifeline 11069 -version 624'

[13:43:55] 
[13:43:55] *------------------------------*
[13:43:55] Folding@Home Gromacs SMP Core
[13:43:55] Version 2.04 (Thu Jan 29 16:43:57 PST 2009)
[13:43:55] 
[13:43:55] Preparing to commence simulation
[13:43:55] - Ensuring status. Please wait.
[13:43:56] Called DecompressByteArray: compressed_data_size=4842006 data_size=23994061, decompressed_data_size=23994061 diff=0
[13:43:56] - Digital signature verified
[13:43:56] 
[13:43:56] Project: 2675 (Run 3, Clone 123, Gen 79)
[13:43:56] 
[13:43:56] Assembly optimizations on if available.
[13:43:56] Entering M.D.
[13:44:02] Will resume from checkpoint file
[13:44:05] ng M.D.
[13:44:11] Will resume from checkpoint file
NNODES=8, MYRANK=0, HOSTNAME=Georgina
NNODES=8, MYRANK=1, HOSTNAME=Georgina
NNODES=8, MYRANK=2, HOSTNAME=Georgina
NNODES=8, MYRANK=5, HOSTNAME=Georgina
NNODES=8, MYRANK=4, HOSTNAME=Georgina
NNODES=8, MYRANK=6, HOSTNAME=Georgina
NNODES=8, MYRANK=7, HOSTNAME=Georgina
NNODES=8, MYRANK=3, HOSTNAME=Georgina
NODEID=0 argc=20
NODEID=1 argc=20
NODEID=6 argc=20
NODEID=7 argc=20
NODEID=4 argc=20
NODEID=5 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                          :-)  VERSION 4.0.3_pre  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 58

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 3D domain decomposition 2 x 2 x 2
starting mdrun '22878 system in water'
20000002 steps,  40000.0 ps (continuing from step 19750002,  39500.0 ps).

-------------------------------------------------------
Program mdrun, VERSION 4.0.3_pre
Source code file: md.c, line: 1107

Fatal error:
Checkpoint error on step 19835020

-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 8

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[13:44:14] Resuming from checkpoint
[13:44:14] fcSaveRestoreState: I/O failed dir=0, var=0000000001ED7640, varsize=393012
[13:44:14] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore state.
[0]0:Return code = 255
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[0]4:Return code = 0, signaled with Quit
[0]5:Return code = 0, signaled with Quit
[0]6:Return code = 0, signaled with Quit
[0]7:Return code = 0, signaled with Quit
[13:44:22] CoreStatus = FF (255)
[13:44:22] Sending work to server
[13:44:22] Project: 2675 (Run 3, Clone 123, Gen 79)
[13:44:22] - Error: Could not get length of results file work/wuresults_02.dat
[13:44:22] - Error: Could not read unit 02 file. Removing from queue.
[13:44:22] Trying to send all finished work units
[13:44:22] + No unsent completed units remaining.
[13:44:22] + -oneunit flag given and have now finished a unit. Exiting.- Preparing to get new work unit...
[13:44:22] ***** Got a SIGTERM signal (15)
[13:44:22] + Attempting to get work packet
[13:44:22] Killing all core threads

Folding@Home Client Shutdown.
[13:44:22] - Will indicate memory of 12026 MB
[13:44:22] - Connecting to assignment server
[13:44:22] - Connecting to assignment server
bollix@Georgina:~/fah/smp$ ls -l qfix
-rw-r--r-- 1 bollix bollix 10152 2008-04-26 09:37 qfix
bollix@Georgina:~/fah/smp$ chmod +x qfix
bollix@Georgina:~/fah/smp$ ./qfix
entry 3, status 0, address 0.0.0.0
entry 4, status 0, address 0.0.0.0
entry 5, status 0, address 0.0.0.0
entry 6, status 0, address 0.0.0.0
entry 7, status 0, address 0.0.0.0
entry 8, status 0, address 0.0.0.0
entry 9, status 0, address 0.0.0.0
entry 0, status 0, address 0.0.0.0
entry 1, status 0, address 171.64.65.64:8080
entry 2, status 0, address 171.64.65.56:8080
File is OK
bollix@Georgina:~/fah/smp$ ./fah6

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

8 cores detected


--- Opening Log file [March 2 13:45:49 UTC] 


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/bollix/fah/smp
Executable: ./fah6
Arguments: -smp 8 -verbosity 9 

[13:45:49] - Ask before connecting: No
[13:45:49] - User name: bollix47 (Team 39340)
[13:45:49] - User ID: xxxxxxxxxxxxxxxx
[13:45:49] - Machine ID: 1
[13:45:49] 
[13:45:49] Loaded queue successfully.
[13:45:49] - Autosending finished units... [March 2 13:45:49 UTC]
[13:45:49] Trying to send all finished work units
[13:45:49] + No unsent completed units remaining.
[13:45:49] - Autosend completed
[13:45:49] - Preparing to get new work unit...
[13:45:49] + Attempting to get work packet
[13:45:49] - Will indicate memory of 12026 MB
[13:45:49] - Connecting to assignment server
[13:45:49] Connecting to http://assign.stanford.edu:8080/
[13:45:49] Posted data.
[13:45:49] Initial: 40AB; - Successful: assigned to (171.64.65.63).
[13:45:49] + News From Folding@Home: Welcome to Folding@Home
[13:45:49] Loaded queue successfully.
[13:45:49] Connecting to http://171.64.65.63:8080/
[13:45:52] Posted data.
[13:45:52] Initial: 0000; - Receiving payload (expected size: 1650945)
[13:45:55] - Downloaded at ~537 kB/s
[13:45:55] - Averaged speed for that direction ~389 kB/s
[13:45:55] + Received work.
[13:45:55] + Closed connections
[13:45:55] 
[13:45:55] + Processing work unit
[13:45:55] Work type a1 not eligible for variable processors
[13:45:55] Core required: FahCore_a1.exe
[13:45:55] Core found.
[13:45:55] Working on queue slot 03 [March 2 13:45:55 UTC]
[13:45:55] + Working ...
[13:45:55] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 03 -priority 96 -checkpoint 30 -verbose -lifeline 11138 -version 624'

[13:45:55] 
[13:45:55] *------------------------------*
[13:45:55] Folding@Home Gromacs SMP Core
[13:45:55] Version 1.74 (November 27, 2006)
[13:45:55] 
[13:45:55] Preparing to commence simulation
[13:45:55] - Ensuring status. Please wait.
[13:46:12] - Looking at optimizations...
[13:46:12] - Working with standard loops on this execution.
[13:46:12] - Previous termination of core was improper.
[13:46:12] - Going to use standard loops.
[13:46:12] - Files status OK
[13:46:13] tarting from initial work pa- Starting from initial work pa- Starting from Entering M.D.
[13:46:13] acket
[13:46:13] 
[13:46:13] Project: 3Entering M.D.
[13:46:13] one 271, Gen 8)
[13:46:13] 
[13:46:13] Entering M.D.
NNODES=4, MYRANK=2, HOSTNAME=Georgina
NNODES=4, MYRANK=3, HOSTNAME=Georgina
NNODES=4, MYRANK=0, HOSTNAME=Georgina
NNODES=4, MYRANK=1, HOSTNAME=Georgina
NODEID=1 argc=15
NODEID=0 argc=15
NODEID=2 argc=15
NODEID=3 argc=15
      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2004, The GROMACS development team,
            check out http://www.gromacs.org for more information.

        This inclusion of Gromacs code in the Folding@Home Core is under
        a special license (see http://folding.stanford.edu/gromacs.html)
         specially granted to Stanford by the copyright holders. If you
          are interested in using Gromacs, visit www.gromacs.org where
                you can download a free version of Gromacs under
         the terms of the GNU General Public License (GPL) as published
       by the Free Software Foundation; either version 2 of the License,
                     or (at your option) any later version.

[13:46:19] Protein: 66728 p3065_lambda5_99sb_big
[13:46:19] Writing local files
starting mdrun '66728 p3065_lambda5_99sb_big'
2500000 steps,   5000.0 ps.

[13:46:19] Extra SSE boost OK.
[13:46:19] 
[13:46:19] Extra SSE boost OK.
[13:46:19] Writing local files
[13:46:19] Completed 0 out of 2500000 steps  (0 percent)


Maybe I just got unlucky with the timing of the shutdown :?:

This problem may have nothing to do with the WU itself but rather the 2.04 core so if a mod would like to move it elsewhere more appropriate please do.

Thanks,

Mike
Image
Post Reply