130.237.232.237

Moderators: Site Moderators, PandeGroup

130.237.232.237

Postby bollix47 » Mon Jan 16, 2012 2:32 pm

Not absolutely sure but there appears to be a problem at the server end. Client takes the appropriate amount of time to upload but never gets an acknowledgement.

Have restarted at least 3 times and always the same. The server net load looks high so it could be happening to others.

No problem uploading other WUs to other servers.


Code: Select all
[12:15:46] Finished Work Unit:
[12:15:46] - Reading up to 121622496 from "work/wudata_03.trr": Read 121622496
[12:15:47] trr file hash check passed.
[12:15:47] - Reading up to 108809740 from "work/wudata_03.xtc": Read 108809740
[12:15:47] xtc file hash check passed.
[12:15:47] edr file hash check passed.
[12:15:47] logfile size: 202879
[12:15:47] Leaving Run
[12:15:48] - Writing 230808107 bytes of core data to disk...
[12:16:30] Done: 230807595 -> 222471202 (compressed to 3.3 percent)
[12:16:30]   ... Done.
[12:16:54] - Shutting down core
[12:16:54]
[12:16:54] Folding@home Core Shutdown: FINISHED_UNIT
[12:16:56] CoreStatus = 64 (100)
[12:16:56] Unit 3 finished with 82 percent of time to deadline remaining.
[12:16:56] Updated performance fraction: 0.944497
[12:16:56] Sending work to server
[12:16:56] Project: 6903 (Run 11, Clone 14, Gen 39)


[12:16:56] + Attempting to send results [January 16 12:16:56 UTC]
[12:16:56] - Reading file work/wuresults_03.dat from core
[12:16:57]   (Read 222471714 bytes from disk)
[12:16:57] Connecting to http://130.237.232.237:8080/
[12:51:27] ***** Got an Activate signal (2)
[12:51:27] Killing all core threads

bollix47
Site Moderator
 
Posts: 2805
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: 130.237.232.237

Postby tear » Mon Jan 16, 2012 3:25 pm

Yeah, it seems it's messed up. Retrieving a unit is impossible as well.

Code: Select all
[15:16:24] Loaded queue successfully.
[15:16:24] Sent data
[15:16:24] Connecting to http://130.237.232.237:8080/



I have also received a 1.6MB unit from that server several hours ago
which was tagged as P6903 (definitely too small). Crashed shortly
after FahCore launch.

I smell a b0rk. A Darth doody. Sithed pants.

Code: Select all
[08:06:04] Connecting to http://130.237.232.237:8080/
[08:06:05] Posted data.
[08:06:05] Initial: 0000; - Receiving payload (expected size: 1652190)
[08:06:11] - Downloaded at ~268 kB/s
[08:06:11] - Averaged speed for that direction ~303 kB/s
[08:06:11] + Received work.
(...)
[08:06:12] Working on queue slot 08 [January 16 08:06:12 UTC]
[08:06:12] + Working ...
[08:06:12] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 08 -np 48 -checkpoint 15 -forceasm -verbose -lifeline 2272 -version 634'

[08:06:12]
[08:06:12] *------------------------------*
[08:06:12] Folding@Home Gromacs SMP Core
[08:06:12] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[08:06:12]
[08:06:12] Preparing to commence simulation
[08:06:12] - Assembly optimizations manually forced on.
[08:06:12] - Not checking prior termination.
[08:06:12] - Expanded 1651678 -> 12713984 (decompressed 769.7 percent)
[08:06:12] Called DecompressByteArray: compressed_data_size=1651678 data_size=12713984, decompressed_data_size=12713984 diff=0
[08:06:12] - Digital signature verified
[08:06:12]
[08:06:12] Project: 6903 (Run 5, Clone 23, Gen 7)
[08:06:12]
[08:06:12] Assembly optimizations on if available.
[08:06:12] Entering M.D.
[08:06:19] Mapping NT from 48 to 48
(...)


UPDATE: WU return hangs as well -- the same way as at your end, bollix
One man's ceiling is another man's floor.
Image
tear
 
Posts: 924
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: 130.237.232.237

Postby GreyWhiskers » Mon Jan 16, 2012 4:12 pm

NOVEMBER 14, 2011
Planned changes to "Big Advanced" (BA) projects, effective January 16, 2012

Big Advanced (BA) is an experimental type of Folding@home WUs intended for the most powerful machines in FAH. However, as time goes on, technology advances, and the characteristics associated with the most powerful machines changes. Due to these advances in hardware capabilities, we will need to periodically change the BA minimum requirements. Thus, we are shortening the deadlines of the BA projects. As a result, assignments will have a 16 core minimum. To give donors some advance warning, we are announcing this now, but the change will take place in 2 months: no earlier than on Monday January 16, 2012.

We understand that any changes to how FAH works is a disruption for donors, and we have been trying to minimize such changes. For that reason, we are not changing the points system at this time.

However, we want to emphasize that the BA program is experimental and that donors should expect changes in the future, potentially without a lot of notice (although we will try our best to give as much notice as we can). In particular, as hardware evolves, it is expected that we will need to change the nature of the BA WUs again in the future.

Posted at 02:57 PM | Permalink


Could this be a harbinger of PG's announcement on BA's change to BA16? Monday, Jan 16 (today) was announced as the no-earlier-than date for elimination of BA8 and BA12. Looking at server stats this morning, I don't see any BA16 servers obviously identified, but this could be the time.
User avatar
GreyWhiskers
 
Posts: 780
Joined: Mon Oct 25, 2010 5:57 am
Location: Saratoga, California USA

Re: 130.237.232.237

Postby bollix47 » Mon Jan 16, 2012 7:51 pm

The WU finally uploaded but the client(running on 12c/24t) was unable to connect to get another bigadv so I switched to regular SMP and will try again later.
bollix47
Site Moderator
 
Posts: 2805
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: 130.237.232.237

Postby Nathan_P » Mon Jan 16, 2012 8:00 pm

GreyWhiskers wrote:
NOVEMBER 14, 2011
Planned changes to "Big Advanced" (BA) projects, effective January 16, 2012

Big Advanced (BA) is an experimental type of Folding@home WUs intended for the most powerful machines in FAH. However, as time goes on, technology advances, and the characteristics associated with the most powerful machines changes. Due to these advances in hardware capabilities, we will need to periodically change the BA minimum requirements. Thus, we are shortening the deadlines of the BA projects. As a result, assignments will have a 16 core minimum. To give donors some advance warning, we are announcing this now, but the change will take place in 2 months: no earlier than on Monday January 16, 2012.

We understand that any changes to how FAH works is a disruption for donors, and we have been trying to minimize such changes. For that reason, we are not changing the points system at this time.

However, we want to emphasize that the BA program is experimental and that donors should expect changes in the future, potentially without a lot of notice (although we will try our best to give as much notice as we can). In particular, as hardware evolves, it is expected that we will need to change the nature of the BA WUs again in the future.

Posted at 02:57 PM | Permalink


Could this be a harbinger of PG's announcement on BA's change to BA16? Monday, Jan 16 (today) was announced as the no-earlier-than date for elimination of BA8 and BA12. Looking at server stats this morning, I don't see any BA16 servers obviously identified, but this could be the time.



If you pull up psummary none of the -bigadv projects are even listed as being available
Censorship leads to dictatorship

Image
Nathan_P
 
Posts: 1320
Joined: Wed Apr 01, 2009 9:22 pm
Location: Jersey, Channel islands

Re: 130.237.232.237

Postby jondi_hanluc » Tue Jan 17, 2012 12:14 am

Same problem here on a p6904, seems to upload, but no confirmation message.
Tried restarting the client, it uploaded it again but the same thing, I'll just have to leave it and see what happens, not much more I can do.

EDIT: 8 hrs later still stuck in queue :(
Last edited by jondi_hanluc on Tue Jan 17, 2012 8:01 am, edited 2 times in total.
jondi_hanluc
 
Posts: 3
Joined: Thu Apr 07, 2011 7:27 pm

Re: 130.237.232.237

Postby Leonardo » Tue Jan 17, 2012 1:12 am

Same problem here. The computer completed the work unit and sat idle for 45 minutes. I shut down folding, rebooted the machine, and even though it's a 32-core rig, it downloaded an A3 P6903. It completed that WU and promptly downloaded another A3, P6099. I don't think any of my finished work units failed to upload, just no confirmation since the incident I described.

Problem server issues involving reconfiguration for the new 16-core/shortened deadline minimum hardware requirements.
Image
User avatar
Leonardo
 
Posts: 600
Joined: Tue Dec 04, 2007 5:09 am
Location: Eagle River, Alaska

Re: 130.237.232.237

Postby rhavern » Tue Jan 17, 2012 6:00 am

Same sort of issue here. Here's the log from finish to me manually shutting down and restarting. Note that this machine has only been running two weeks and has never encountered an a3 before.

Code: Select all
[23:53:21] Completed 250000 out of 250000 steps  (100%)

Writing final coordinates.

 Average load imbalance: 9.2 %
 Part of the total run time spent waiting due to load imbalance: 3.9 %


   Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time: 153720.126 153720.126    100.0
                       1d18h42:00
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:   1411.795     74.260      0.559     42.951

Thanx for Using GROMACS - Have a Nice Day

[23:53:42] DynamicWrapper: Finished Work Unit: sleep=10000
[23:53:52]
[23:53:52] Finished Work Unit:
[23:53:52] - Reading up to 121622496 from "work/wudata_05.trr": Read 121622496
[23:53:53] trr file hash check passed.
[23:53:53] - Reading up to 108720676 from "work/wudata_05.xtc": Read 108720676
[23:53:54] xtc file hash check passed.
[23:53:54] edr file hash check passed.
[23:53:54] logfile size: 208162
[23:53:54] Leaving Run
[23:53:57] - Writing 230724326 bytes of core data to disk...
[23:55:01] Done: 230723814 -> 222366262 (compressed to 3.3 percent)
[23:55:02]   ... Done.
[23:55:22] - Shutting down core
[23:55:22]
[23:55:22] Folding@home Core Shutdown: FINISHED_UNIT
[23:55:24] CoreStatus = 64 (100)
[23:55:24] Unit 5 finished with 85 percent of time to deadline remaining.
[23:55:24] Updated performance fraction: 0.779874
[23:55:24] Sending work to server
[23:55:24] Project: 6903 (Run 10, Clone 12, Gen 72)


[23:55:24] + Attempting to send results [January 16 23:55:24 UTC]
[23:55:24] - Reading file work/wuresults_05.dat from core
[23:55:24]   (Read 222366774 bytes from disk)
[23:55:24] Connecting to http://130.237.232.237:8080/
[00:26:31] - Couldn't send HTTP request to server
[00:26:31] + Could not connect to Work Server (results)
[00:26:31]     (130.237.232.237:8080)
[00:26:31] + Retrying using alternative port
[00:26:31] Connecting to http://130.237.232.237:80/
[00:26:31] - Couldn't send HTTP request to server
[00:26:31] + Could not connect to Work Server (results)
[00:26:31]     (130.237.232.237:80)
[00:26:31] - Error: Could not transmit unit 05 (completed January 16) to work server.
[00:26:31] - 1 failed uploads of this unit.
[00:26:31]   Keeping unit 05 in queue.
[00:26:31] Trying to send all finished work units
[00:26:31] Project: 6903 (Run 10, Clone 12, Gen 72)


[00:26:31] + Attempting to send results [January 17 00:26:31 UTC]
[00:26:31] - Reading file work/wuresults_05.dat from core
[00:26:31]   (Read 222366774 bytes from disk)
[00:26:31] Connecting to http://130.237.232.237:8080/
[00:26:32] - Couldn't send HTTP request to server
[00:26:32] + Could not connect to Work Server (results)
[00:26:32]     (130.237.232.237:8080)
[00:26:32] + Retrying using alternative port
[00:26:32] Connecting to http://130.237.232.237:80/
[00:26:32] - Couldn't send HTTP request to server
[00:26:32] + Could not connect to Work Server (results)
[00:26:32]     (130.237.232.237:80)
[00:26:32] - Error: Could not transmit unit 05 (completed January 16) to work server.
[00:26:32] - 2 failed uploads of this unit.


[00:26:32] + Attempting to send results [January 17 00:26:32 UTC]
[00:26:32] - Reading file work/wuresults_05.dat from core
[00:26:32]   (Read 222366774 bytes from disk)
[00:26:32] Connecting to http://130.237.165.141:8080/
[00:26:32] - Couldn't send HTTP request to server
[00:26:32] + Could not connect to Work Server (results)
[00:26:32]     (130.237.165.141:8080)
[00:26:32] + Retrying using alternative port
[00:26:32] Connecting to http://130.237.165.141:80/
[00:42:21] - Couldn't send HTTP request to server
[00:42:21] + Could not connect to Work Server (results)
[00:42:21]     (130.237.165.141:80)
[00:42:21]   Could not transmit unit 05 to Collection server; keeping in queue.
[00:42:21] + Sent 0 of 1 completed units to the server
[00:42:21] - Preparing to get new work unit...
[00:42:21] Cleaning up work directory
[00:42:21] + Attempting to get work packet
[00:42:21] Passkey found
[00:42:21] - Will indicate memory of 16075 MB
[00:42:21] - Connecting to assignment server
[00:42:21] Connecting to http://assign.stanford.edu:8080/
[00:42:22] Posted data.
[00:42:22] Initial: 8F80; - Successful: assigned to (128.143.199.96).
[00:42:22] + News From Folding@Home: Welcome to Folding@Home
[00:42:22] Loaded queue successfully.
[00:42:22] Sent data
[00:42:22] Connecting to http://128.143.199.96:8080/
[00:42:23] Posted data.
[00:42:23] Initial: 0000; - Receiving payload (expected size: 1767764)
[00:42:25] - Downloaded at ~863 kB/s
[00:42:25] - Averaged speed for that direction ~4971 kB/s
[00:42:25] + Received work.
[00:42:25] Trying to send all finished work units
[00:42:25] Project: 6903 (Run 10, Clone 12, Gen 72)


[00:42:25] + Attempting to send results [January 17 00:42:25 UTC]
[00:42:25] - Reading file work/wuresults_05.dat from core
[00:42:25]   (Read 222366774 bytes from disk)
[00:42:25] Connecting to http://130.237.232.237:8080/
[03:36:07] - Autosending finished units... [January 17 03:36:07 UTC]
[03:36:07] Trying to send all finished work units
[03:36:07] - Already sending work
[03:36:07] + Sent 0 of 1 completed units to the server
[03:36:07] - Autosend completed
^C[05:52:45] ***** Got an Activate signal (2)
[05:52:46] Killing all core threads

Folding@Home Client Shutdown.
rick@Server5:~/fah$ ./fah6

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

24 cores detected


--- Opening Log file [January 17 05:52:59 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/rick/fah
Executable: ./fah6
Arguments: -verbosity 9 -smp -bigadv

[05:52:59] - Ask before connecting: No
[05:52:59] - User name: rhavern (Team 33)
[05:52:59] - User ID: 7B5589202D84D214
[05:52:59] - Machine ID: 1
[05:52:59]
[05:52:59] Loaded queue successfully.
[05:52:59]
[05:52:59] + Processing work unit
[05:52:59] Core required: FahCore_a3.exe
[05:52:59] - Autosending finished units... [05:52:59]
[05:52:59] Core not found.
[05:52:59] Trying to send all finished work units
[05:52:59] - Core is not present or corrupted.
[05:52:59] Project: 6903 (Run 10, Clone 12, Gen 72)
[05:52:59] - Attempting to download new core...


[05:52:59] + Attempting to send results [January 17 05:52:59 UTC]
[05:52:59] + Downloading new core: FahCore_a3.exe
[05:52:59] - Reading file work/wuresults_05.dat from core
[05:52:59] Downloading core (/~pande/Linux/AMD64/Core_a3.fah from www.stanford.edu)
[05:52:59]   (Read 222366774 bytes from disk)
[05:52:59] Connecting to http://130.237.232.237:8080/
[05:53:12] Initial: AFDE; + 10240 bytes downloaded
<snip>
[05:53:13] Initial: 9274; + 2683199 bytes downloaded
[05:53:13] Verifying core Core_a3.fah...
[05:53:13] Signature is VALID
[05:53:13]
[05:53:13] Trying to unzip core FahCore_a3.exe
[05:53:13] Decompressed FahCore_a3.exe (6272504 bytes) successfully
[05:53:13] + Core successfully engaged
[05:53:18]
[05:53:18] + Processing work unit
[05:53:18] Core required: FahCore_a3.exe
[05:53:18] Core found.
[05:53:18] Working on queue slot 06 [January 17 05:53:18 UTC]
[05:53:18] + Working ...
[05:53:18] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 06 -np 24 -checkpoint 15 -verbose -lifeline 6888 -version 634'

[05:53:19]
[05:53:19] *------------------------------*
[05:53:19] Folding@Home Gromacs SMP Core
[05:53:19] Version 2.27 (Dec. 15, 2010)
[05:53:19]
[05:53:19] Preparing to commence simulation
[05:53:19] - Looking at optimizations...
[05:53:19] - Created dyn
[05:53:19] - Files status OK
[05:53:19] - Expanded 1767252 -> 1951112 (decompressed 110.4 percent)
[05:53:19] Called DecompressByteArray: compressed_data_size=1767252 data_size=1951112, decompressed_data_size=1951112 diff=0
[05:53:19] - Digital signature verified
[05:53:19]
[05:53:19] Project: 6941 (Run 0, Clone 83, Gen 452)
[05:53:19]
[05:53:19] Assembly optimizations on if available.
[05:53:19] Entering M.D.
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                            :-)  VERSION 4.5.3  (-:

        Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
      Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra,
        Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff,
           Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz,
                Michael Shirts, Alfons Sijbers, Peter Tieleman,

               Berk Hess, David van der Spoel, and Erik Lindahl.

       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
            Copyright (c) 2001-2010, The GROMACS development team at
        Uppsala University & The Royal Institute of Technology, Sweden.
            check out http://www.gromacs.org for more information.


                               :-)  Gromacs  (-:

Reading file work/wudata_06.tpr, VERSION 4.5.3-dev-20101113-8af87 (single precision)
Starting 24 threads
[05:53:25] Mapping NT from 24 to 24
Making 2D domain decomposition 6 x 4 x 1
starting mdrun 'Mutant_scan'
226500016 steps, 453000.0 ps (continuing from step 226000016, 452000.0 ps).
[05:53:26] Completed 0 out of 500000 steps  (0%)


Folding since 1 WU=1 point
ImageImage
User avatar
rhavern
 
Posts: 479
Joined: Mon Dec 03, 2007 8:45 am
Location: UK

Re: 130.237.232.237

Postby Trotador » Tue Jan 17, 2012 6:39 am

Unable also to upload a p6903 WU to this server and alternatives.
Trotador
 
Posts: 71
Joined: Sun Feb 17, 2008 6:41 pm

Re: 130.237.232.237

Postby -alias- » Tue Jan 17, 2012 11:50 am

I have the same experience with one P6901 on this server, and with P6973 and P6993 on server 128.143.199.96. It looks like I will lose them all, and I think it may be associated with the transition to the big-16 in progress. I do connect for upload of those WU`s but there it stops.
Folding with SR2 2P E5645, 3 x 4P G34 + 2 x 4P E5-4650 ~ 328 threads ~ a $34.000 investment, in use to actually fold for life to save the future for our children and grandchildren!
-alias-
 
Posts: 137
Joined: Sun Feb 22, 2009 1:20 pm
Location: Norway

Re: 130.237.232.237

Postby $ilent » Tue Jan 17, 2012 11:58 am

I am also getting problems with trying to upload a 6903 unit to this server, and afterwards all i could get was SMP work units. I hope I dont lose this Wu, worth about 250,000 points.

Code: Select all
[08:27:00] + Attempting to send results [January 17 08:27:00 UTC]
[08:59:11] - Couldn't send HTTP request to server
[08:59:11] + Could not connect to Work Server (results)
[08:59:11]     (130.237.232.237:8080)
[08:59:11] + Retrying using alternative port
[08:59:32] - Couldn't send HTTP request to server
[08:59:32] + Could not connect to Work Server (results)
[08:59:32]     (130.237.232.237:80)
[08:59:32] - Error: Could not transmit unit 04 (completed January 17) to work server.
[08:59:32]   Keeping unit 04 in queue.
[08:59:32] Project: 6903 (Run 4, Clone 10, Gen 45)


[08:59:32] + Attempting to send results [January 17 08:59:32 UTC]
$ilent
 
Posts: 21
Joined: Mon Dec 26, 2011 3:29 pm

Re: 130.237.232.237

Postby bollix47 » Tue Jan 17, 2012 12:00 pm

-alias- wrote:I have the same experience with one P6901 on this server, and with P6973 and P6993 on server 128.143.199.96. It looks like I will lose them all, and I think it may be associated with the transition to the big-16 in progress. I do connect for upload of those WU`s but there it stops.


Have you tried stopping the client and sending the regular smp WUs manually? Reason being it may be trying to send the bigadv first, gets hung and never gets to sending the others. I had this problem yesterday and had to send a few manually until the server was adjusted or the regular smp projects got higher in the queue.

Basically, find the slot ## for the P6973 and P6993 and use the send command.

On my setup that would look like:
Code: Select all
./fah6 -send ##
bollix47
Site Moderator
 
Posts: 2805
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: 130.237.232.237

Postby -alias- » Tue Jan 17, 2012 1:04 pm

Thanks bollix47
I tried it, but so is the same, maybe the server is down, or that it will not accept anything right now. I wonder how long a WU may be preserved in queue?

Edit: Checked server status and it seems to me that this server only do classic WUes. http://fah-web.stanford.edu/logs/130.23 ... 7.log.html
Last edited by -alias- on Tue Jan 17, 2012 1:14 pm, edited 1 time in total.
-alias-
 
Posts: 137
Joined: Sun Feb 22, 2009 1:20 pm
Location: Norway

Re: 130.237.232.237

Postby bollix47 » Tue Jan 17, 2012 1:11 pm

Strange, since I'm having the same problem today so I just sent a WU to .96 manually and it worked fine. (used ./fah6 -send 07 to send from slot 7)

Anyway, you have 10 slots before you start losing work but they will fill rather quickly since you're using a bigadv computer to do regular smp WUs. Some might send once they have a lower queue number than the bigadv WU that's holding things up but a client restart may be necessary.
bollix47
Site Moderator
 
Posts: 2805
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: 130.237.232.237

Postby bollix47 » Tue Jan 17, 2012 2:38 pm

Ignore the classic designation ... it has been a bigadv server for some time.

You should backup your entire fah folder before you get to the bigadv slot or you may lose it. :wink:
bollix47
Site Moderator
 
Posts: 2805
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Next

Return to Issues with a specific server

Who is online

Users browsing this forum: Google [Bot] and 1 guest