Page 5 of 10

Re: 171.64.65.56 ???

Posted: Tue Oct 05, 2010 4:42 am
by lobuxracer
You are not alone. HTTP 503 here too.

Re: 171.64.65.56 ???

Posted: Tue Oct 05, 2010 6:22 am
by susato
Netload's back at 200. Every time Dr. K. bumps it, it heads right back to 200 connections and trouble. Good to hear about the upcoming equipment upgrade.

Re: 171.64.65.56 ???

Posted: Tue Oct 05, 2010 6:48 am
by ThunderRd
susato wrote:Netload's back at 200. Every time Dr. K. bumps it, it heads right back to 200 connections and trouble. Good to hear about the upcoming equipment upgrade.
Actually it's at 202 ATM.

Additional server is welcomed. I've got 5 machines waiting for work now, 4 of them for over 12 hours.

Re: 171.64.65.56 ???

Posted: Tue Oct 05, 2010 11:11 am
by snapshot
I've got two WUs waiting for upload with their bonus points frittering away.....

Re: 171.64.65.56 ???

Posted: Tue Oct 05, 2010 12:34 pm
by Tobit
65.56 needs another kick please. At the time of this post, net load is over 200 again. :(

Re: 171.64.65.56 ???

Posted: Tue Oct 05, 2010 1:20 pm
by Datsun 1600
How can I get permanently assigned to Server 171.64.65.54, I have no problems with that one, but as soon as I get assigned a WU from Server 171.64.65.56 and go to return it, it is usually overloaded? GRRRRRRRRRRR

Re: 171.64.65.56 ???

Posted: Tue Oct 05, 2010 2:12 pm
by VijayPande
I'm sorry about this. We're monitoring the situation and have a longer term solution (new servers, work distributed amongst them), but that will take some time to get on line (sorry it's taking so long).

Re: 171.64.65.56 ???

Posted: Tue Oct 05, 2010 3:55 pm
by 7up1n3
Getting 503s here too now.

Code: Select all

[13:11:12] Completed 2000000 out of 2000000 steps  (100%)
[13:11:12] DynamicWrapper: Finished Work Unit: sleep=10000
[13:11:23] 
[13:11:23] Finished Work Unit:
[13:11:23] - Reading up to 687408 from "work/wudata_05.trr": Read 687408
[13:11:23] trr file hash check passed.
[13:11:23] - Reading up to 42672364 from "work/wudata_05.xtc": Read 42672364
[13:11:23] xtc file hash check passed.
[13:11:23] edr file hash check passed.
[13:11:23] logfile size: 279301
[13:11:23] Leaving Run
[13:11:23] - Writing 43641409 bytes of core data to disk...
[13:11:24]   ... Done.
[13:11:25] - Shutting down core
[13:11:25] 
[13:11:25] Folding@home Core Shutdown: FINISHED_UNIT
[13:11:29] CoreStatus = 64 (100)
[13:11:29] Unit 5 finished with 91 percent of time to deadline remaining.
[13:11:29] Updated performance fraction: 0.673357
[13:11:29] Sending work to server
[13:11:29] Project: 6701 (Run 56, Clone 11, Gen 58)


[13:11:29] + Attempting to send results [October 5 13:11:29 UTC]
[13:11:29] - Reading file work/wuresults_05.dat from core
[13:11:29]   (Read 43641409 bytes from disk)
[13:11:29] Connecting to http://171.64.65.56:8080/
[13:11:29] - Couldn't send HTTP request to server
[13:11:29]   (Got status 503)
[13:11:29] + Could not connect to Work Server (results)
[13:11:29]     (171.64.65.56:8080)
[13:11:29] + Retrying using alternative port
[13:11:29] Connecting to http://171.64.65.56:80/
[13:11:29] - Couldn't send HTTP request to server
[13:11:29]   (Got status 503)
[13:11:29] + Could not connect to Work Server (results)
[13:11:29]     (171.64.65.56:80)
[13:11:29] - Error: Could not transmit unit 05 (completed October 5) to work server.
[13:11:29] - 1 failed uploads of this unit.
[13:11:29]   Keeping unit 05 in queue.
[13:11:29] Trying to send all finished work units
[13:11:29] Project: 6701 (Run 56, Clone 11, Gen 58)


[13:11:29] + Attempting to send results [October 5 13:11:29 UTC]
[13:11:29] - Reading file work/wuresults_05.dat from core
[13:11:29]   (Read 43641409 bytes from disk)
[13:11:29] Connecting to http://171.64.65.56:8080/
[13:11:29] - Couldn't send HTTP request to server
[13:11:29]   (Got status 503)
[13:11:29] + Could not connect to Work Server (results)
[13:11:29]     (171.64.65.56:8080)
[13:11:29] + Retrying using alternative port
[13:11:29] Connecting to http://171.64.65.56:80/
[13:11:29] - Couldn't send HTTP request to server
[13:11:29]   (Got status 503)
[13:11:29] + Could not connect to Work Server (results)
[13:11:29]     (171.64.65.56:80)
[13:11:29] - Error: Could not transmit unit 05 (completed October 5) to work server.
[13:11:29] - 2 failed uploads of this unit.


[13:11:29] + Attempting to send results [October 5 13:11:29 UTC]
[13:11:29] - Reading file work/wuresults_05.dat from core
[13:11:29]   (Read 43641409 bytes from disk)
[13:11:29] Connecting to http://171.67.108.25:8080/
[13:11:29] - Couldn't send HTTP request to server
[13:11:29]   (Got status 503)
[13:11:29] + Could not connect to Work Server (results)
[13:11:29]     (171.67.108.25:8080)
[13:11:29] + Retrying using alternative port
[13:11:29] Connecting to http://171.67.108.25:80/
[13:11:29] - Couldn't send HTTP request to server
[13:11:29]   (Got status 503)
[13:11:29] + Could not connect to Work Server (results)
[13:11:29]     (171.67.108.25:80)
[13:11:29]   Could not transmit unit 05 to Collection server; keeping in queue.
[13:11:29] + Sent 0 of 1 completed units to the server
[13:11:29] - Preparing to get new work unit...
[13:11:29] Cleaning up work directory
[13:11:31] + Attempting to get work packet
[13:11:31] Passkey found
[13:11:31] - Will indicate memory of 12285 MB
[13:11:31] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 7, Stepping: 10
[13:11:31] - Connecting to assignment server
[13:11:31] Connecting to http://assign.stanford.edu:8080/
[13:11:32] Posted data.
[13:11:32] Initial: 40AB; - Successful: assigned to (171.64.65.54).
[13:11:32] + News From Folding@Home: Welcome to Folding@Home
[13:11:32] Loaded queue successfully.
[13:11:32] Sent data
[13:11:32] Connecting to http://171.64.65.54:8080/
[13:11:33] Posted data.
[13:11:33] Initial: 0000; - Receiving payload (expected size: 1765609)
[13:11:35] - Downloaded at ~862 kB/s
[13:11:35] - Averaged speed for that direction ~615 kB/s
[13:11:35] + Received work.
[13:11:35] Trying to send all finished work units
[13:11:35] Project: 6701 (Run 56, Clone 11, Gen 58)


[13:11:35] + Attempting to send results [October 5 13:11:35 UTC]
[13:11:35] - Reading file work/wuresults_05.dat from core
[13:11:35]   (Read 43641409 bytes from disk)
[13:11:35] Connecting to http://171.64.65.56:8080/
[13:11:36] - Couldn't send HTTP request to server
[13:11:36]   (Got status 503)
[13:11:36] + Could not connect to Work Server (results)
[13:11:36]     (171.64.65.56:8080)
[13:11:36] + Retrying using alternative port
[13:11:36] Connecting to http://171.64.65.56:80/
[13:11:36] - Couldn't send HTTP request to server
[13:11:36]   (Got status 503)
[13:11:36] + Could not connect to Work Server (results)
[13:11:36]     (171.64.65.56:80)
[13:11:36] - Error: Could not transmit unit 05 (completed October 5) to work server.
[13:11:36] - 3 failed uploads of this unit.


[13:11:36] + Attempting to send results [October 5 13:11:36 UTC]
[13:11:36] - Reading file work/wuresults_05.dat from core
[13:11:36]   (Read 43641409 bytes from disk)
[13:11:36] Connecting to http://171.67.108.25:8080/
[13:11:36] - Couldn't send HTTP request to server
[13:11:36]   (Got status 503)
[13:11:36] + Could not connect to Work Server (results)
[13:11:36]     (171.67.108.25:8080)
[13:11:36] + Retrying using alternative port
[13:11:36] Connecting to http://171.67.108.25:80/
[13:11:36] - Couldn't send HTTP request to server
[13:11:36]   (Got status 503)
[13:11:36] + Could not connect to Work Server (results)
[13:11:36]     (171.67.108.25:80)
[13:11:36]   Could not transmit unit 05 to Collection server; keeping in queue.
[13:11:36] + Sent 0 of 1 completed units to the server
[13:11:36] + Closed connections
[13:11:36] 
[13:11:36] + Processing work unit
[13:11:36] Core required: FahCore_a3.exe
[13:11:36] Core found.
[13:11:36] Working on queue slot 06 [October 5 13:11:36 UTC]
[13:11:36] + Working ...
[13:11:36] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 06 -np 8 -checkpoint 15 -verbose -lifeline 2668 -version 630'

[13:11:36] 
[13:11:36] *------------------------------*
[13:11:36] Folding@Home Gromacs SMP Core
[13:11:36] Version 2.22 (Mar 12, 2010)
[13:11:36] 
[13:11:36] Preparing to commence simulation
[13:11:36] - Looking at optimizations...
[13:11:36] - Created dyn
[13:11:36] - Files status OK
[13:11:36] - Expanded 1765097 -> 2251569 (decompressed 127.5 percent)
[13:11:36] Called DecompressByteArray: compressed_data_size=1765097 data_size=2251569, decompressed_data_size=2251569 diff=0
[13:11:36] - Digital signature verified
[13:11:36] 
[13:11:36] Project: 6055 (Run 1, Clone 160, Gen 45)
[13:11:36] 
[13:11:36] Assembly optimizations on if available.
[13:11:36] Entering M.D.
[13:11:43] Completed 0 out of 500000 steps  (0%)
What happens to WUs that fail to upload like this? Do they expire when the deadline passes? Is the bonus frittered away?

Re: 171.64.65.56 ???

Posted: Tue Oct 05, 2010 4:48 pm
by PantherX
7up1n3 wrote:...What happens to WUs that fail to upload like this? Do they expire when the deadline passes? Is the bonus frittered away?
In this case, you bonus points will reduce according to the delay caused by the Server. If you cross the Preferred Deadline, you will be assigned Base Credits. If you cross the Final Deadline, you will not get any credits and the Client will delete the WU and move on. Sorry but that is the way it currently works. Hopefully when new SMP Servers are added, this will be history and we all will be happy.

Re: 171.64.65.56 ???

Posted: Tue Oct 05, 2010 6:01 pm
by codysluder
VijayPande wrote:I'm sorry about this. We're monitoring the situation and have a longer term solution (new servers, work distributed amongst them), but that will take some time to get on line (sorry it's taking so long).
I understand how long it takes to get new servers on-line but there's a real conflict between how long that's going to take and how long it takes for our QRBonus to decay. Are there any short-term solutions that can help? Why can't at least some of the work be distributed as has been suggested here?
Datsun 1600 wrote:How can I get permanently assigned to Server 171.64.65.54, I have no problems with that one, but as soon as I get assigned a WU from Server 171.64.65.56 and go to return it, it is usually overloaded? GRRRRRRRRRRR

Re: 171.64.65.56 ???

Posted: Tue Oct 05, 2010 6:08 pm
by 7up1n3
PantherX wrote:
7up1n3 wrote:...What happens to WUs that fail to upload like this? Do they expire when the deadline passes? Is the bonus frittered away?
In this case, you bonus points will reduce according to the delay caused by the Server. If you cross the Preferred Deadline, you will be assigned Base Credits. If you cross the Final Deadline, you will not get any credits and the Client will delete the WU and move on. Sorry but that is the way it currently works. Hopefully when new SMP Servers are added, this will be history and we all will be happy.
Wow. It shouldn't be that difficult to read the code detailing beginning and end of the processing time, and rewarding the contribution accordingly so that users aren't penalized for server side issues.

Re: 171.64.65.56 ???

Posted: Tue Oct 05, 2010 7:08 pm
by codysluder
You're assuming that all failure to upload problems are server-side issues. That may be true much of the time and it's certainly true right now, but I'll bet that folks would find a way to cheat. I have not tried it, but what happens if you adjusting your clock to show that the WU finished one minuted after you downloaded it but then corrected the clock before you uploaded the WU, blaming all of your processing time on a server outage.

The points cannot be based on any clock other than the server clock.

Stanford said that they recognized the possibility of server problems and that they'd do their best to maintain the servers but it was a risk that you'd have to accept. Are you saying that you don't believe that they are making a sincere effort to correct the problems? If so, I disagree with you.

Re: 171.64.65.56 ???

Posted: Tue Oct 05, 2010 7:14 pm
by codysluder
Assuming that the problems with this server will continue until a new server comes on-line, some folks might actually benefit by temporarily suspending SMP. Multiple uniprocessor clients normally earn less PPD than SMP, but you don't have to reduce the PPD for SMP very much before multiple clients win out because of their reliability. At least it's an alternative to consider.

Re: 171.64.65.56 ???

Posted: Wed Oct 06, 2010 12:27 pm
by Tobit
Please kick it again, net load is over 200 yet again.

Re: 171.64.65.56 ???

Posted: Wed Oct 06, 2010 5:46 pm
by 7up1n3
codysluder wrote:You're assuming that all failure to upload problems are server-side issues.
I'm not assuming that at all. But I am assuming that, when server issues do arise, that a system could be implemented to address it. This has been done in the past, with mass point credits as users are "caught up", and would simply need to be adjusted to accommodate the bonus system.