Page 1 of 2

Normal thekraken Behavior? [Yes]

Posted: Sat Mar 17, 2012 3:53 pm
by patonb
Just noticed when running bigadv, my client starts up, then after bout 30ish min into it, the core reboots and continues on as normal.

Code: Select all

 :42:03] Called DecompressByteArray: compressed_data_size=57245215 data_size=71846524, decompressed_data_size=71846524 diff=0
[12:42:04] - Digital signature verified
[12:42:04] 
[12:42:04] Project: 6903 (Run 4, Clone 4, Gen 50)
[12:42:04] 
[12:42:04] Assembly optimizations on if available.
[12:42:04] Entering M.D.
[12:42:13] Mapping NT from 24 to 24 
[12:42:44] Completed 0 out of 250000 steps  (0%)
[13:14:13] ng M.D.
[13:14:20] Using Gromacs checkpoints
[13:14:28] Mapping NT from 24 to 24 
[13:15:12] Resuming from checkpoint
[13:15:15] Verified work/wudata_07.log
[13:15:16] Verified work/wudata_07.trr
[13:15:16] Verified work/wudata_07.xtc
[13:15:16] Verified work/wudata_07.edr
[13:15:29] Completed 1360 out of 250000 steps  (0%)
[13:40:02] Completed 2500 out of 250000 steps  (1%)
[14:33:21] Completed 5000 out of 250000 steps  (2%)
[15:26:17] Completed 7500 out of 250000 steps  (3%)

Re: Normal Behavior?

Posted: Sat Mar 17, 2012 4:17 pm
by MtM
patonb wrote:Just noticed when running bigadv, my client starts up, then after bout 30ish min into it, the core reboots and continues on as normal.

Code: Select all

 :42:03] Called DecompressByteArray: compressed_data_size=57245215 data_size=71846524, decompressed_data_size=71846524 diff=0
[12:42:04] - Digital signature verified
[12:42:04] 
[12:42:04] Project: 6903 (Run 4, Clone 4, Gen 50)
[12:42:04] 
[12:42:04] Assembly optimizations on if available.
[12:42:04] Entering M.D.
[12:42:13] Mapping NT from 24 to 24 
[12:42:44] Completed 0 out of 250000 steps  (0%)
[13:14:13] ng M.D.
[13:14:20] Using Gromacs checkpoints
[13:14:28] Mapping NT from 24 to 24 
[13:15:12] Resuming from checkpoint
[13:15:15] Verified work/wudata_07.log
[13:15:16] Verified work/wudata_07.trr
[13:15:16] Verified work/wudata_07.xtc
[13:15:16] Verified work/wudata_07.edr
[13:15:29] Completed 1360 out of 250000 steps  (0%)
[13:40:02] Completed 2500 out of 250000 steps  (1%)
[14:33:21] Completed 5000 out of 250000 steps  (2%)
[15:26:17] Completed 7500 out of 250000 steps  (3%)
You got a to small core snippet there :) Show the last frame completed before the restart, the restart itself and the first frame after it. The snippet here shows only that it restarted at 1360 steps.

Re: Normal Behavior?

Posted: Sat Mar 17, 2012 5:54 pm
by patonb
Thats the entire log.. Its a bigadv unit. It started chugging at 0%

The restart is right there where it reenters MD, and maps the cores again. Notice its there twice.. I know it reboots as my system shows the cpu drops to 0% the after a few minutes pegs back to near 100%, and spits out the check point stuff.

Re: Normal Behavior?

Posted: Sat Mar 17, 2012 6:19 pm
by ChelseaOilman
Well, not quite an entire log. Can't tell if your running with the -verbosity 9 flag and we're seeing everything. Also can't tell if your running tear's Kraken. If your not I would. What I see looks normal though.

Here's the start of a 6903 WU on my 4P machine:
[09:27:52] Project: 6903 (Run 3, Clone 4, Gen 53)
[09:27:52]
[09:27:52] Assembly optimizations on if available.
[09:27:52] Entering M.D.
[09:28:00] Mapping NT from 48 to 48
[09:28:04] Completed 0 out of 250000 steps (0%)
[09:41:10] Completed 2500 out of 250000 steps (1%)
[09:46:06] int
[09:46:21] Verified work/wudata_07.log
[09:46:22] Verified work/wudata_07.trr
[09:46:22] Verified work/wudata_07.xtc
[09:46:22] Verified work/wudata_07.edr
[09:46:22] Completed 2900 out of 250000 steps (1%)
[09:56:26] Completed 5000 out of 250000 steps (2%)

Re: Normal Behavior?

Posted: Sat Mar 17, 2012 6:55 pm
by patonb
Okay... Yha i didnt set verb 9 and definitatly I UNLEASHED THEKRAKEN!

Gotta love it, your three time as fast! damn 4p

Re: Normal Behavior?

Posted: Sat Mar 17, 2012 7:29 pm
by ChelseaOilman
If you look in the terminal window you'll see more info than what prints out in the log. You can see what's happening during those pauses.
[10:38:29] Folding@Home Gromacs SMP Core
[10:38:29] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[10:38:29]
[10:38:29] Preparing to commence simulation
[10:38:29] - Assembly optimizations manually forced on.
[10:38:29] - Not checking prior termination.
[10:38:36] - Expanded 57239090 -> 71846524 (decompressed 50.4 percent)
[10:38:36] Called DecompressByteArray: compressed_data_size=57239090 data_size=71846524, decompressed_data_size=71846524 diff=0
[10:38:36] - Digital signature verified
[10:38:36]
[10:38:36] Project: 6903 (Run 4, Clone 19, Gen 38)
[10:38:36]
[10:38:36] Assembly optimizations on if available.
[10:38:36] Entering M.D.
:-) G R O M A C S (-:

Groningen Machine for Chemical Simulation

:-) VERSION 4.5.3 (-:

Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra,
Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff,
Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz,
Michael Shirts, Alfons Sijbers, Peter Tieleman,

Berk Hess, David van der Spoel, and Erik Lindahl.

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2010, The GROMACS development team at
Uppsala University & The Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.


:-) Gromacs (-:

Reading file work/wudata_06.tpr, VERSION 4.5.4-dev-20110530-cc815 (single precision)
[10:38:45] Mapping NT from 48 to 48
Starting 48 threads
Making 2D domain decomposition 8 x 6 x 1
starting mdrun 'Overlay'
9750000 steps, 39000.0 ps (continuing from step 9500000, 38000.0 ps).
[10:38:50] Completed 0 out of 250000 steps (0%)
[10:52:36] Completed 2500 out of 250000 steps (1%)
:-) G R O M A C S (-:

Groningen Machine for Chemical Simulation

:-) VERSION 4.5.3 (-:

Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra,
Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff,
Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz,
Michael Shirts, Alfons Sijbers, Peter Tieleman,

Berk Hess, David van der Spoel, and Erik Lindahl.

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2010, The GROMACS development team at
Uppsala University & The Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.


:-) Gromacs (-:

Reading file work/wudata_06.tpr, VERSION 4.5.4-dev-20110530-cc815 (single precision)
Starting 48 threads

Reading checkpoint file work/wudata_06.cpt generated: Sat Mar 17 04:53:52 2012


Making 2D domain decomposition 8 x 6 x 1
starting mdrun 'Overlay'
9750000 steps, 39000.0 ps (continuing from step 9502730, 38010.9 ps).
[10:56:33] int
[10:57:28] Verified work/wudata_06.log
[10:57:29] Verified work/wudata_06.trr
[10:57:29] Verified work/wudata_06.xtc
[10:57:29] Verified work/wudata_06.edr
[10:57:30] Completed 2730 out of 250000 steps (1%)

NOTE: Turning on dynamic load balancing

[11:09:18] Completed 5000 out of 250000 steps (2%)
[11:22:24] Completed 7500 out of 250000 steps (3%)
4 x 6174 CPUs @ 2,519 MHz with tears OC BIOS

Re: Normal Behavior?

Posted: Sat Mar 17, 2012 9:04 pm
by Grandpa_01
It is normal behaviour for the kraken at least it is on my 4P rigs. I am not sure if the kraken, fah or linux is turning on DLB but I do know that when DLB turns on and the kraken is running fah restarts, but fah does not restart when DLB starts if the the kraken is not running. I am guessing that the kraken has to re wrap the core but I do not know, perhaps tear can better answer what is happening if he comes around. I do know that your time per frame will drop when both of them are running. :D

Re: Normal Behavior?

Posted: Sat Mar 17, 2012 10:13 pm
by patonb
As long as it's normal, then I'm good.

Re: Normal Behavior?

Posted: Sun Mar 18, 2012 12:17 am
by musky
This is the correct behavior with thekraken using the autorestart functionality. The idea is that the core restarts after the first checkpoint is written, which usually causes dynamic load balancing to engage. DLB makes a significant difference in performance. If you installed thekraken with "thekraken -c autorestart=1 -i", that is what is going on. you can verify by going into your folding directory and typing "cat thekraken.cfg". If you see "autorestart=1" as the bottom of that file, that is what is happening.

Re: Normal Behavior?

Posted: Sun Mar 18, 2012 12:45 am
by patonb
Yup, it was your guide in the vbox, so if its wrong its your fault.

Re: Normal Behavior?

Posted: Sun Mar 18, 2012 1:18 am
by bollix47
Seems normal.

From thekraken guide:

Code: Select all

6.3. Autorestart feature

        Background: GROMACS employs Dynamic Load Balancing (DLB)
        feature that aims at improving performance.

        GROMACS configuration used by FahCores enables DLB the moment
        cumulative performance loss due to load imbalance exceeds 5%.

        When enabled, DLB reduces times of bigadv units by noticable
        amount of time. Reports include reduction of 30s with P6903
        and 45 seconds with P6904 (sometimes more).

        Depending on WU and system configuration (or even system state),
        DLB gets enabled in a way that may appear random (sometimes it's
        several minutes into WU; at other times it may be as late
        as 90% into WU, sometimes it doesn't engage at all).

        It has been determined that restarting WU from a checkpoint
        significantly increases probability of almost-instantaneous
        DLB engagement (with P6903 and P6904 units).

        Autorestart feature, when enabled, makes The Kraken restart
        FahCore upon completed write of first checkpoint (15 minutes
        in typical configuration).

        To enable autorestart feature add '-c autorestart=1' parameter
        to the command line, when installing, e.g. 'thekraken -i -c autorestart=1'.
        If already installed, uninstall, then install with '-c autorestart=1'.
        Stopping the client is not required.

        NOTE: when enabled, FahCore will appear to have "started twice"
              or restarted without user interaction; this is expected
              and normal

        NOTE: autorestart feature isn't guaranteed; DLB may not always engage

        NOTE: DLB enagagement on units other than P6903 and P6904
              is rare


Re: Normal thekraken Behavior? [Yes]

Posted: Mon Apr 01, 2013 5:46 pm
by Gehacktesmacher
Hi!

I am running Ubuntu 12.04 LTS with thekraken on an 4P System.
Got a strange issue when WU's finishes.

Code: Select all

[14:48:46] Completed 235000 out of 250000 steps  (94%)
[14:59:00] Completed 237500 out of 250000 steps  (95%)
[15:09:14] Completed 240000 out of 250000 steps  (96%)
[15:19:30] Completed 242500 out of 250000 steps  (97%)
[15:29:44] Completed 245000 out of 250000 steps  (98%)
[15:39:58] Completed 247500 out of 250000 steps  (99%)
[15:50:16] Completed 250000 out of 250000 steps  (100%)
[15:50:31] DynamicWrapper: Finished Work Unit: sleep=10000
[15:50:41]
[15:50:41] Finished Work Unit:
[15:50:41] - Reading up to 64407792 from "work/wudata_01.trr": Read 64407792
[15:50:42] trr file hash check passed.
[15:50:42] - Reading up to 31686692 from "work/wudata_01.xtc": Read 31686692
[15:50:42] xtc file hash check passed.
[15:50:42] edr file hash check passed.
[15:50:42] logfile size: 188597
[15:50:42] Leaving Run
[15:50:43] - Writing 96443957 bytes of core data to disk...
[15:51:12] Done: 96443445 -> 91694831 (compressed to 6.0 percent)
[15:51:12]   ... Done.
[16:38:01] - Shutting down core
[16:38:01]
[16:38:01] Folding@home Core Shutdown: FINISHED_UNIT
[16:43:41] CoreStatus = 64 (100)
[16:43:41] Unit 1 finished with 81 percent of time to deadline remaining.
[16:43:41] Updated performance fraction: 0.852459
[16:43:41] Sending work to server
[16:43:41] Project: 8103 (Run 1, Clone 69, Gen 4)


[16:43:41] + Attempting to send results [April 1 16:43:41 UTC]
[16:43:41] - Reading file work/wuresults_01.dat from core
[16:43:41]   (Read 91695343 bytes from disk)
[16:43:41] Connecting to http://128.143.231.201:8080/
[16:46:28] - Couldn't send HTTP request to server
[16:46:28] + Could not connect to Work Server (results)
[16:46:28]     (128.143.231.201:8080)
[16:46:28] + Retrying using alternative port
[16:46:28] Connecting to http://128.143.231.201:80/
[16:59:14] Posted data.
[16:59:14] Initial: 0000; + Results successfully sent
[16:59:14] Thank you for your contribution to Folding@Home.
[16:59:14] + Number of Units Completed: 3

[17:16:24] Trying to send all finished work units
[17:16:24] + No unsent completed units remaining.
[17:16:24] - Preparing to get new work unit...
[17:16:24] Cleaning up work directory
[17:16:24] + Attempting to get work packet
[17:16:24] Passkey found
[17:16:24] - Will indicate memory of 64346 MB
[17:16:24] - Connecting to assignment server
[17:16:24] Connecting to http://assign.stanford.edu:8080/
[17:16:25] Posted data.
[17:16:25] Initial: 8F80; - Successful: assigned to (128.143.199.96).
[17:16:26] + News From Folding@Home: Welcome to Folding@Home
[17:16:26] Loaded queue successfully.
[17:16:26] Sent data
[17:16:26] Connecting to http://128.143.199.96:8080/
[17:16:27] Posted data.
[17:16:27] Initial: 0000; - Receiving payload (expected size: 1764558)
[17:16:29] - Downloaded at ~861 kB/s
[17:16:29] - Averaged speed for that direction ~1059 kB/s
[17:16:29] + Received work.
[17:16:29] Trying to send all finished work units
[17:16:29] + No unsent completed units remaining.
[17:16:29] + Closed connections
I don't have any idea why it takes about 1 hour between "[15:51:12] ... Done." and "[16:38:01] - Shutting down core"
and about 15 minutes between "[16:59:14] + Number of Units Completed: 3" and "[17:16:24] + Attempting to get work packet"

I switched from Ubuntu 12.04 running on an ESXi to a native Ubuntu Installation. While am running Ubuntu in a VM I don't receive this error.

Re: Normal thekraken Behavior? [Yes]

Posted: Mon Apr 01, 2013 6:17 pm
by PantherX
What is the file system that you are using?

Re: Normal thekraken Behavior? [Yes]

Posted: Mon Apr 01, 2013 8:21 pm
by Nathan_P
By the looks of things something with barriers enabled, there was a piece of code that you could run to disable them but I can't find it.

Re: Normal thekraken Behavior? [Yes]

Posted: Mon Apr 01, 2013 8:24 pm
by bollix47
Perhaps this one?

Using that auto-fix will only work if the current file system is ext3 or ext4. If it's something else then you would have to edit /etc/fstab manually and add barrier=0 to the options for the disk containing the folding files.