Normal thekraken Behavior? [Yes]

This forum contains information about 3rd party applications which may be of use to those who run the FAH client and one place where you might be able to get help when using one of those apps.

Moderator: Site Moderators

Normal thekraken Behavior? [Yes]

Postby patonb » Sat Mar 17, 2012 3:53 pm

Just noticed when running bigadv, my client starts up, then after bout 30ish min into it, the core reboots and continues on as normal.

Code: Select all
 :42:03] Called DecompressByteArray: compressed_data_size=57245215 data_size=71846524, decompressed_data_size=71846524 diff=0
[12:42:04] - Digital signature verified
[12:42:04]
[12:42:04] Project: 6903 (Run 4, Clone 4, Gen 50)
[12:42:04]
[12:42:04] Assembly optimizations on if available.
[12:42:04] Entering M.D.
[12:42:13] Mapping NT from 24 to 24
[12:42:44] Completed 0 out of 250000 steps  (0%)
[13:14:13] ng M.D.
[13:14:20] Using Gromacs checkpoints
[13:14:28] Mapping NT from 24 to 24
[13:15:12] Resuming from checkpoint
[13:15:15] Verified work/wudata_07.log
[13:15:16] Verified work/wudata_07.trr
[13:15:16] Verified work/wudata_07.xtc
[13:15:16] Verified work/wudata_07.edr
[13:15:29] Completed 1360 out of 250000 steps  (0%)
[13:40:02] Completed 2500 out of 250000 steps  (1%)
[14:33:21] Completed 5000 out of 250000 steps  (2%)
[15:26:17] Completed 7500 out of 250000 steps  (3%)
WooHoo = L5639 @ 3.3Ghz Evga SR-2 6x2gb Corsair XMS3 CM 212+ Corsair 1050hx Blackhawk Ultra EVGA 560ti

Foldie = i7 950@ 4.0Ghz x58a-ud3r 216-216 @ 850/2000 3x2gb OCZ Gold NH-u12 Heatsink Corsair hx520 Antec 900
patonb
 
Posts: 943
Joined: Thu Oct 23, 2008 2:42 am

Re: Normal Behavior?

Postby MtM » Sat Mar 17, 2012 4:17 pm

patonb wrote:Just noticed when running bigadv, my client starts up, then after bout 30ish min into it, the core reboots and continues on as normal.

Code: Select all
 :42:03] Called DecompressByteArray: compressed_data_size=57245215 data_size=71846524, decompressed_data_size=71846524 diff=0
[12:42:04] - Digital signature verified
[12:42:04]
[12:42:04] Project: 6903 (Run 4, Clone 4, Gen 50)
[12:42:04]
[12:42:04] Assembly optimizations on if available.
[12:42:04] Entering M.D.
[12:42:13] Mapping NT from 24 to 24
[12:42:44] Completed 0 out of 250000 steps  (0%)
[13:14:13] ng M.D.
[13:14:20] Using Gromacs checkpoints
[13:14:28] Mapping NT from 24 to 24
[13:15:12] Resuming from checkpoint
[13:15:15] Verified work/wudata_07.log
[13:15:16] Verified work/wudata_07.trr
[13:15:16] Verified work/wudata_07.xtc
[13:15:16] Verified work/wudata_07.edr
[13:15:29] Completed 1360 out of 250000 steps  (0%)
[13:40:02] Completed 2500 out of 250000 steps  (1%)
[14:33:21] Completed 5000 out of 250000 steps  (2%)
[15:26:17] Completed 7500 out of 250000 steps  (3%)


You got a to small core snippet there :) Show the last frame completed before the restart, the restart itself and the first frame after it. The snippet here shows only that it restarted at 1360 steps.
MtM
 
Posts: 3054
Joined: Fri Jun 27, 2008 2:20 pm
Location: The Netherlands

Re: Normal Behavior?

Postby patonb » Sat Mar 17, 2012 5:54 pm

Thats the entire log.. Its a bigadv unit. It started chugging at 0%

The restart is right there where it reenters MD, and maps the cores again. Notice its there twice.. I know it reboots as my system shows the cpu drops to 0% the after a few minutes pegs back to near 100%, and spits out the check point stuff.
patonb
 
Posts: 943
Joined: Thu Oct 23, 2008 2:42 am

Re: Normal Behavior?

Postby ChelseaOilman » Sat Mar 17, 2012 6:19 pm

Well, not quite an entire log. Can't tell if your running with the -verbosity 9 flag and we're seeing everything. Also can't tell if your running tear's Kraken. If your not I would. What I see looks normal though.

Here's the start of a 6903 WU on my 4P machine:
[09:27:52] Project: 6903 (Run 3, Clone 4, Gen 53)
[09:27:52]
[09:27:52] Assembly optimizations on if available.
[09:27:52] Entering M.D.
[09:28:00] Mapping NT from 48 to 48
[09:28:04] Completed 0 out of 250000 steps (0%)
[09:41:10] Completed 2500 out of 250000 steps (1%)
[09:46:06] int
[09:46:21] Verified work/wudata_07.log
[09:46:22] Verified work/wudata_07.trr
[09:46:22] Verified work/wudata_07.xtc
[09:46:22] Verified work/wudata_07.edr
[09:46:22] Completed 2900 out of 250000 steps (1%)
[09:56:26] Completed 5000 out of 250000 steps (2%)
User avatar
ChelseaOilman
 
Posts: 1646
Joined: Sun Dec 02, 2007 3:47 pm
Location: Colorado @ 10,000 feet

Re: Normal Behavior?

Postby patonb » Sat Mar 17, 2012 6:55 pm

Okay... Yha i didnt set verb 9 and definitatly I UNLEASHED THEKRAKEN!

Gotta love it, your three time as fast! damn 4p
patonb
 
Posts: 943
Joined: Thu Oct 23, 2008 2:42 am

Re: Normal Behavior?

Postby ChelseaOilman » Sat Mar 17, 2012 7:29 pm

If you look in the terminal window you'll see more info than what prints out in the log. You can see what's happening during those pauses.
[10:38:29] Folding@Home Gromacs SMP Core
[10:38:29] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[10:38:29]
[10:38:29] Preparing to commence simulation
[10:38:29] - Assembly optimizations manually forced on.
[10:38:29] - Not checking prior termination.
[10:38:36] - Expanded 57239090 -> 71846524 (decompressed 50.4 percent)
[10:38:36] Called DecompressByteArray: compressed_data_size=57239090 data_size=71846524, decompressed_data_size=71846524 diff=0
[10:38:36] - Digital signature verified
[10:38:36]
[10:38:36] Project: 6903 (Run 4, Clone 19, Gen 38)
[10:38:36]
[10:38:36] Assembly optimizations on if available.
[10:38:36] Entering M.D.
:-) G R O M A C S (-:

Groningen Machine for Chemical Simulation

:-) VERSION 4.5.3 (-:

Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra,
Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff,
Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz,
Michael Shirts, Alfons Sijbers, Peter Tieleman,

Berk Hess, David van der Spoel, and Erik Lindahl.

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2010, The GROMACS development team at
Uppsala University & The Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.


:-) Gromacs (-:

Reading file work/wudata_06.tpr, VERSION 4.5.4-dev-20110530-cc815 (single precision)
[10:38:45] Mapping NT from 48 to 48
Starting 48 threads
Making 2D domain decomposition 8 x 6 x 1
starting mdrun 'Overlay'
9750000 steps, 39000.0 ps (continuing from step 9500000, 38000.0 ps).
[10:38:50] Completed 0 out of 250000 steps (0%)
[10:52:36] Completed 2500 out of 250000 steps (1%)
:-) G R O M A C S (-:

Groningen Machine for Chemical Simulation

:-) VERSION 4.5.3 (-:

Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra,
Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff,
Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz,
Michael Shirts, Alfons Sijbers, Peter Tieleman,

Berk Hess, David van der Spoel, and Erik Lindahl.

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2010, The GROMACS development team at
Uppsala University & The Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.


:-) Gromacs (-:

Reading file work/wudata_06.tpr, VERSION 4.5.4-dev-20110530-cc815 (single precision)
Starting 48 threads

Reading checkpoint file work/wudata_06.cpt generated: Sat Mar 17 04:53:52 2012


Making 2D domain decomposition 8 x 6 x 1
starting mdrun 'Overlay'
9750000 steps, 39000.0 ps (continuing from step 9502730, 38010.9 ps).
[10:56:33] int
[10:57:28] Verified work/wudata_06.log
[10:57:29] Verified work/wudata_06.trr
[10:57:29] Verified work/wudata_06.xtc
[10:57:29] Verified work/wudata_06.edr
[10:57:30] Completed 2730 out of 250000 steps (1%)

NOTE: Turning on dynamic load balancing

[11:09:18] Completed 5000 out of 250000 steps (2%)
[11:22:24] Completed 7500 out of 250000 steps (3%)


4 x 6174 CPUs @ 2,519 MHz with tears OC BIOS
User avatar
ChelseaOilman
 
Posts: 1646
Joined: Sun Dec 02, 2007 3:47 pm
Location: Colorado @ 10,000 feet

Re: Normal Behavior?

Postby Grandpa_01 » Sat Mar 17, 2012 9:04 pm

It is normal behaviour for the kraken at least it is on my 4P rigs. I am not sure if the kraken, fah or linux is turning on DLB but I do know that when DLB turns on and the kraken is running fah restarts, but fah does not restart when DLB starts if the the kraken is not running. I am guessing that the kraken has to re wrap the core but I do not know, perhaps tear can better answer what is happening if he comes around. I do know that your time per frame will drop when both of them are running. :D
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
User avatar
Grandpa_01
 
Posts: 1757
Joined: Wed Mar 04, 2009 7:36 am

Re: Normal Behavior?

Postby patonb » Sat Mar 17, 2012 10:13 pm

As long as it's normal, then I'm good.
patonb
 
Posts: 943
Joined: Thu Oct 23, 2008 2:42 am

Re: Normal Behavior?

Postby musky » Sun Mar 18, 2012 12:17 am

This is the correct behavior with thekraken using the autorestart functionality. The idea is that the core restarts after the first checkpoint is written, which usually causes dynamic load balancing to engage. DLB makes a significant difference in performance. If you installed thekraken with "thekraken -c autorestart=1 -i", that is what is going on. you can verify by going into your folding directory and typing "cat thekraken.cfg". If you see "autorestart=1" as the bottom of that file, that is what is happening.
musky
 
Posts: 13
Joined: Wed Aug 11, 2010 1:17 am

Re: Normal Behavior?

Postby patonb » Sun Mar 18, 2012 12:45 am

Yup, it was your guide in the vbox, so if its wrong its your fault.
patonb
 
Posts: 943
Joined: Thu Oct 23, 2008 2:42 am

Re: Normal Behavior?

Postby bollix47 » Sun Mar 18, 2012 1:18 am

Seems normal.

From thekraken guide:

Code: Select all
6.3. Autorestart feature

        Background: GROMACS employs Dynamic Load Balancing (DLB)
        feature that aims at improving performance.

        GROMACS configuration used by FahCores enables DLB the moment
        cumulative performance loss due to load imbalance exceeds 5%.

        When enabled, DLB reduces times of bigadv units by noticable
        amount of time. Reports include reduction of 30s with P6903
        and 45 seconds with P6904 (sometimes more).

        Depending on WU and system configuration (or even system state),
        DLB gets enabled in a way that may appear random (sometimes it's
        several minutes into WU; at other times it may be as late
        as 90% into WU, sometimes it doesn't engage at all).

        It has been determined that restarting WU from a checkpoint
        significantly increases probability of almost-instantaneous
        DLB engagement (with P6903 and P6904 units).

        Autorestart feature, when enabled, makes The Kraken restart
        FahCore upon completed write of first checkpoint (15 minutes
        in typical configuration).

        To enable autorestart feature add '-c autorestart=1' parameter
        to the command line, when installing, e.g. 'thekraken -i -c autorestart=1'.
        If already installed, uninstall, then install with '-c autorestart=1'.
        Stopping the client is not required.

        NOTE: when enabled, FahCore will appear to have "started twice"
              or restarted without user interaction; this is expected
              and normal

        NOTE: autorestart feature isn't guaranteed; DLB may not always engage

        NOTE: DLB enagagement on units other than P6903 and P6904
              is rare

bollix47
 
Posts: 3475
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Normal thekraken Behavior? [Yes]

Postby Gehacktesmacher » Mon Apr 01, 2013 5:46 pm

Hi!

I am running Ubuntu 12.04 LTS with thekraken on an 4P System.
Got a strange issue when WU's finishes.

Code: Select all
[14:48:46] Completed 235000 out of 250000 steps  (94%)
[14:59:00] Completed 237500 out of 250000 steps  (95%)
[15:09:14] Completed 240000 out of 250000 steps  (96%)
[15:19:30] Completed 242500 out of 250000 steps  (97%)
[15:29:44] Completed 245000 out of 250000 steps  (98%)
[15:39:58] Completed 247500 out of 250000 steps  (99%)
[15:50:16] Completed 250000 out of 250000 steps  (100%)
[15:50:31] DynamicWrapper: Finished Work Unit: sleep=10000
[15:50:41]
[15:50:41] Finished Work Unit:
[15:50:41] - Reading up to 64407792 from "work/wudata_01.trr": Read 64407792
[15:50:42] trr file hash check passed.
[15:50:42] - Reading up to 31686692 from "work/wudata_01.xtc": Read 31686692
[15:50:42] xtc file hash check passed.
[15:50:42] edr file hash check passed.
[15:50:42] logfile size: 188597
[15:50:42] Leaving Run
[15:50:43] - Writing 96443957 bytes of core data to disk...
[15:51:12] Done: 96443445 -> 91694831 (compressed to 6.0 percent)
[15:51:12]   ... Done.
[16:38:01] - Shutting down core
[16:38:01]
[16:38:01] Folding@home Core Shutdown: FINISHED_UNIT
[16:43:41] CoreStatus = 64 (100)
[16:43:41] Unit 1 finished with 81 percent of time to deadline remaining.
[16:43:41] Updated performance fraction: 0.852459
[16:43:41] Sending work to server
[16:43:41] Project: 8103 (Run 1, Clone 69, Gen 4)


[16:43:41] + Attempting to send results [April 1 16:43:41 UTC]
[16:43:41] - Reading file work/wuresults_01.dat from core
[16:43:41]   (Read 91695343 bytes from disk)
[16:43:41] Connecting to http://128.143.231.201:8080/
[16:46:28] - Couldn't send HTTP request to server
[16:46:28] + Could not connect to Work Server (results)
[16:46:28]     (128.143.231.201:8080)
[16:46:28] + Retrying using alternative port
[16:46:28] Connecting to http://128.143.231.201:80/
[16:59:14] Posted data.
[16:59:14] Initial: 0000; + Results successfully sent
[16:59:14] Thank you for your contribution to Folding@Home.
[16:59:14] + Number of Units Completed: 3

[17:16:24] Trying to send all finished work units
[17:16:24] + No unsent completed units remaining.
[17:16:24] - Preparing to get new work unit...
[17:16:24] Cleaning up work directory
[17:16:24] + Attempting to get work packet
[17:16:24] Passkey found
[17:16:24] - Will indicate memory of 64346 MB
[17:16:24] - Connecting to assignment server
[17:16:24] Connecting to http://assign.stanford.edu:8080/
[17:16:25] Posted data.
[17:16:25] Initial: 8F80; - Successful: assigned to (128.143.199.96).
[17:16:26] + News From Folding@Home: Welcome to Folding@Home
[17:16:26] Loaded queue successfully.
[17:16:26] Sent data
[17:16:26] Connecting to http://128.143.199.96:8080/
[17:16:27] Posted data.
[17:16:27] Initial: 0000; - Receiving payload (expected size: 1764558)
[17:16:29] - Downloaded at ~861 kB/s
[17:16:29] - Averaged speed for that direction ~1059 kB/s
[17:16:29] + Received work.
[17:16:29] Trying to send all finished work units
[17:16:29] + No unsent completed units remaining.
[17:16:29] + Closed connections


I don't have any idea why it takes about 1 hour between "[15:51:12] ... Done." and "[16:38:01] - Shutting down core"
and about 15 minutes between "[16:59:14] + Number of Units Completed: 3" and "[17:16:24] + Attempting to get work packet"

I switched from Ubuntu 12.04 running on an ESXi to a native Ubuntu Installation. While am running Ubuntu in a VM I don't receive this error.
Gehacktesmacher
 
Posts: 7
Joined: Mon Aug 02, 2010 12:18 am

Re: Normal thekraken Behavior? [Yes]

Postby PantherX » Mon Apr 01, 2013 6:17 pm

What is the file system that you are using?
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Chrome Folding App (Beta) Ӂ Troubleshooting "Bad WUs" Ӂ Troubleshooting Server Connectivity Issues
User avatar
PantherX
Site Moderator
 
Posts: 6321
Joined: Wed Dec 23, 2009 9:33 am

Re: Normal thekraken Behavior? [Yes]

Postby Nathan_P » Mon Apr 01, 2013 8:21 pm

By the looks of things something with barriers enabled, there was a piece of code that you could run to disable them but I can't find it.
Image
Nathan_P
 
Posts: 1423
Joined: Wed Apr 01, 2009 9:22 pm
Location: Jersey, Channel islands

Re: Normal thekraken Behavior? [Yes]

Postby bollix47 » Mon Apr 01, 2013 8:24 pm

Perhaps this one?

Using that auto-fix will only work if the current file system is ext3 or ext4. If it's something else then you would have to edit /etc/fstab manually and add barrier=0 to the options for the disk containing the folding files.
bollix47
 
Posts: 3475
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Next

Return to 3rd party contributed software

Who is online

Users browsing this forum: No registered users and 1 guest

cron