problems with all SMP servers

ThunderRd · Post by **ThunderRd** » Wed Dec 17, 2008 8:47 am

I don't know if this is related to the server problems of late but I have had massive problems uploading for about three days now.

The SMP servers are showing that they are in good shape, but I'm getting a lot of 503s. I have about 20 SMP WUs waiting to be uploaded, my machines connect, begin to upload, and then they just fade out and stop. I can and do connect to the servers in my browser. There isn't any problem with security software. This farm has been running for a year and a half and, other than the occasional hitch, has been fine. I have checked my ISP and can't seem to find anything there. Here is a sample:

Code: Select all

[02:19:06] Timered checkpoint triggered.
[02:19:06] - Autosending finished units... [December 17 02:19:06 UTC]
[02:19:06] Trying to send all finished work units
[02:19:06] Project: 3064 (Run 2, Clone 138, Gen 33)


[02:19:06] + Attempting to send results [December 17 02:19:06 UTC]
[02:19:06] - Reading file work/wuresults_00.dat from core
[02:19:06]   (Read 1846876 bytes from disk)
[02:19:06] Connecting to http://171.64.65.63:8080/
[02:19:28] Writing local files
[02:19:28] Completed 12500 out of 250000 steps  (5 percent)
[02:34:30] Timered checkpoint triggered.
[02:49:31] Timered checkpoint triggered.
[03:00:29] - Couldn't send HTTP request to server
[03:00:29] + Could not connect to Work Server (results)
[03:00:29]     (171.64.65.63:8080)
[03:00:29] + Retrying using alternative port
[03:00:29] Connecting to http://171.64.65.63:80/
[03:04:31] Timered checkpoint triggered.
[03:04:41] Writing local files
[03:04:41] Completed 15000 out of 250000 steps  (6 percent)
[03:19:42] Timered checkpoint triggered.
[03:34:43] Timered checkpoint triggered.
[03:49:43] Timered checkpoint triggered.
[03:49:52] Writing local files
[03:49:53] Completed 17500 out of 250000 steps  (7 percent)
[04:04:53] Timered checkpoint triggered.
[04:19:54] Timered checkpoint triggered.
[04:26:31] - Couldn't send HTTP request to server
[04:26:31] + Could not connect to Work Server (results)
[04:26:31]     (171.64.65.63:80)
[04:26:31] - Error: Could not transmit unit 00 (completed December 16) to work server.
[04:26:31] - 4 failed uploads of this unit.


[04:26:31] + Attempting to send results [December 17 04:26:31 UTC]
[04:26:31] - Reading file work/wuresults_00.dat from core
[04:26:31]   (Read 1846876 bytes from disk)
[04:26:31] Connecting to http://171.67.108.17:8080/
[04:29:01] - Couldn't send HTTP request to server
[04:29:01] + Could not connect to Work Server (results)
[04:29:01]     (171.67.108.17:8080)
[04:29:01] + Retrying using alternative port
[04:29:01] Connecting to http://171.67.108.17:80/
[04:29:22] - Couldn't send HTTP request to server
[04:29:22] + Could not connect to Work Server (results)
[04:29:22]     (171.67.108.17:80)
[04:29:22]   Could not transmit unit 00 to Collection server; keeping in queue.
[04:29:22] Project: 3064 (Run 2, Clone 138, Gen 33)


[04:29:22] + Attempting to send results [December 17 04:29:22 UTC]
[04:29:22] - Reading file work/wuresults_00.dat from core
[04:29:22]   (Read 1846876 bytes from disk)
[04:29:22] Connecting to http://171.64.65.63:8080/
[04:29:43] - Couldn't send HTTP request to server
[04:29:43] + Could not connect to Work Server (results)
[04:29:43]     (171.64.65.63:8080)
[04:29:43] + Retrying using alternative port
[04:29:43] Connecting to http://171.64.65.63:80/
[04:30:04] - Couldn't send HTTP request to server
[04:30:04] + Could not connect to Work Server (results)
[04:30:04]     (171.64.65.63:80)
[04:30:04] - Error: Could not transmit unit 00 (completed December 16) to work server.
[04:30:04] - 5 failed uploads of this unit.


[04:30:04] + Attempting to send results [December 17 04:30:04 UTC]
[04:30:04] - Reading file work/wuresults_00.dat from core
[04:30:04]   (Read 1846876 bytes from disk)
[04:30:04] Connecting to http://171.67.108.17:8080/
[04:34:54] Timered checkpoint triggered.
[04:35:02] Writing local files
[04:35:03] Completed 20000 out of 250000 steps  (8 percent)
[04:50:03] Timered checkpoint triggered.
[04:51:35] - Couldn't send HTTP request to server
[04:51:35] + Could not connect to Work Server (results)
[04:51:35]     (171.67.108.17:8080)
[04:51:35] + Retrying using alternative port
[04:51:35] Connecting to http://171.67.108.17:80/
[04:51:43] - Couldn't send HTTP request to server
[04:51:43] + Could not connect to Work Server (results)
[04:51:43]     (171.67.108.17:80)
[04:51:43]   Could not transmit unit 00 to Collection server; keeping in queue.
[04:51:43] + Sent 0 of 2 completed units to the server
[04:51:43] - Autosend completed
[05:05:04] Timered checkpoint triggered.
[05:20:05] Timered checkpoint triggered.

This is also happening on .64.65.64 and .67.108.25. What's weird is that the odd WU does get through. I'm not sure what to think; the 503s are definitely not my ISP, but the fading upload problem[ie. getting connected, u/l starting, and gradually slowing to a halt]may be. This is on 30 SMP boxes simultaneously, so it's not a hardware problem or a configuration problem. I've been waiting it out to see if it would correct itself, but it's looking grim. I've been watching the forum to see of others are having the same problems [especially from my area, Southeast Asia] and I'm feeling like I'm the only one.

Post by **toTOW** » Wed Dec 17, 2008 10:19 am

Can you connect to http://171.64.65.63:8080/ with your web browser ?

ThunderRd · Post by **ThunderRd** » Wed Dec 17, 2008 2:46 pm

ThunderRd wrote: I can and do connect to the servers in my browser.

Remember that the machines do connect and try to upload, but stop after a while.

I'm thinking more along the lines of a problem after the ISP; perhaps from this area of the world.

Post by **toTOW** » Wed Dec 17, 2008 3:12 pm

Are you sure that your ISP is not limiting your file sizes (that might be a transparent proxy) ?

Post by **bruce** » Wed Dec 17, 2008 4:54 pm

Please check that these WUs have not expired and (if you're running a V5 client) that you do not have "Use IE Settings" set to yes. Have you tried temporarily disabling your firewall/security software?

Error 503 does mean that the server is too busy, but the errors I see in the log are Couldn't send HTTP request to server which is probably something entirely different.

Post by **kasson** » Wed Dec 17, 2008 5:05 pm

If you are using a client version less than 6.23, try upgrading. If you have a slow connection, some older clients will time out after 20 minutes.

ThunderRd · Post by **ThunderRd** » Thu Dec 18, 2008 1:39 am

@Totow: As for limiting file sizes, I can only say that if they are, it's all of a sudden, and they claim that they have done nothing new when I spoke to them the other day. Some WUs are getting through, so I don't think that's it.

@bruce: I will check on expiration, I'm sure by now at least some of them have expired. I have made no changes on the boxes, and they have been running for a long time. Of course, no to IE settings, and no, the firewall is not in the way...I can connect and begin to upload.

@Peter: I tried to update the client from 6.22 to 6.23 on 2 machines as a test. It made no difference, same error messages.

EDIT: I have arrived at one of my shops and see that some of the WUs are being upped, albeit slowly. Other have died while waiting to upload. I am confused by something, though. When a WU in the queue expires, why does the client stop working on the current one and delete it? Here's a sample[or am I reading the log wrongly]?:

Code: Select all

[06:07:32] Completed 162500 out of 250000 steps  (65 percent)  <-- working on this one
[06:18:00] - Autosending finished units... [December 16 06:18:00 UTC]
[06:18:00] Trying to send all finished work units
[06:18:00] Project: 2653 (Run 20, Clone 21, Gen 94)                    <--trying to send this one

[06:18:00] + Attempting to send results [December 16 06:18:00 UTC]
[06:18:00] - Reading file work/wuresults_01.dat from core
[06:18:00]   (Read 5522949 bytes from disk)
[06:18:00] Connecting to http://171.64.65.64:8080/
[06:18:22] - Couldn't send HTTP request to server
[06:18:22] + Could not connect to Work Server (results)
[06:18:22]     (171.64.65.64:8080)
[06:18:22] + Retrying using alternative port
[06:18:22] Connecting to http://171.64.65.64:80/
[06:18:43] - Couldn't send HTTP request to server
[06:18:43] + Could not connect to Work Server (results)
[06:18:43]     (171.64.65.64:80)
[06:18:43] - Error: Could not transmit unit 01 (completed December 14) to work server.
[06:18:43] - 9 failed uploads of this unit.


[06:18:43] + Attempting to send results [December 16 06:18:43 UTC]
[06:18:43] - Reading file work/wuresults_01.dat from core
[06:18:43]   (Read 5522949 bytes from disk)
[06:18:43] Connecting to http://171.67.108.25:8080/
[06:19:04] - Couldn't send HTTP request to server
[06:19:04] + Could not connect to Work Server (results)
[06:19:04]     (171.67.108.25:8080)
[06:19:04] + Retrying using alternative port
[06:19:04] Connecting to http://171.67.108.25:80/
[06:19:25] - Couldn't send HTTP request to server
[06:19:25] + Could not connect to Work Server (results)
[06:19:25]     (171.67.108.25:80)
[06:19:25]   Could not transmit unit 01 to Collection server; keeping in queue.
[06:19:25] + Sent 0 of 1 completed units to the server
[06:19:25] - Autosend completed
[06:22:32] Timered checkpoint triggered.
[06:37:35] Timered checkpoint triggered.
[06:43:56] Writing local files
[06:43:56] Completed 165000 out of 250000 steps  (66 percent)
[06:58:58] Timered checkpoint triggered.
[07:14:00] Timered checkpoint triggered.
[07:20:23] Writing local files
[07:20:23] Completed 167500 out of 250000 steps  (67 percent)
[07:35:26] Timered checkpoint triggered.
[07:50:28] Timered checkpoint triggered.
[07:56:50] Writing local files
[07:56:51] Completed 170000 out of 250000 steps  (68 percent)
[08:11:52] Timered checkpoint triggered.
[08:26:54] Timered checkpoint triggered.
[08:33:18] Writing local files
[08:33:18] Completed 172500 out of 250000 steps  (69 percent)
[08:48:20] Timered checkpoint triggered.
[09:03:22] Timered checkpoint triggered.
[09:09:46] Writing local files
[09:09:46] Completed 175000 out of 250000 steps  (70 percent)
[09:24:48] Timered checkpoint triggered.
[09:39:50] Timered checkpoint triggered.
[09:46:13] Writing local files
[09:46:13] Completed 177500 out of 250000 steps  (71 percent)
[10:01:15] Timered checkpoint triggered.
[10:16:17] Timered checkpoint triggered.
[10:22:38] Writing local files
[10:22:38] Completed 180000 out of 250000 steps  (72 percent)
[10:37:40] Timered checkpoint triggered.
[10:52:42] Timered checkpoint triggered.
[10:59:04] Writing local files
[10:59:04] Completed 182500 out of 250000 steps  (73 percent)
[11:14:06] Timered checkpoint triggered.
[11:29:09] Timered checkpoint triggered.
[11:35:31] Writing local files
[11:35:32] Completed 185000 out of 250000 steps  (74 percent)
[11:50:33] Timered checkpoint triggered.
[12:05:34] Timered checkpoint triggered.
[12:12:06] Writing local files
[12:12:07] Completed 187500 out of 250000 steps  (75 percent)  <--still working on this one
[12:20:19] Unit 1's deadline (December 16 07:06) has passed. <--6 hours later, dead in queue
[12:22:47] CoreStatus = 1 (1)
[12:22:47] Client-core communications error: ERROR 0x1
[12:22:47] Deleting current work unit & continuing...   <---deletes good WU at 74% WHY?         
[12:22:47] - Warning: Could not delete all work unit files (1): Core returned invalid code
[12:22:47] - Autosending finished units... [December 16 12:22:47 UTC]
[12:22:47] Trying to send all finished work units
[12:22:47] + No unsent completed units remaining.
[12:22:47] - Autosend completed
[12:25:09] - Warning: Could not delete all work unit files (2): Core returned invalid code
[12:25:09] Trying to send all finished work units
[12:25:09] + No unsent completed units remaining.
[12:25:09] - Preparing to get new work unit...
[12:25:09] + Attempting to get work packet
[12:25:09] - Will indicate memory of 511 MB
[12:25:09] - Detect CPU. Vendor: GenuineIntel, Family: 15, Model: 6, Stepping: 5
[12:25:09] - Connecting to assignment server
[12:25:09] Connecting to http://assign.stanford.edu:8080/
[12:25:11] Posted data.
[12:25:11] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[12:25:11] + News From Folding@Home: Welcome to Folding@Home
[12:25:11] Loaded queue successfully.
[12:25:11] Connecting to http://171.64.65.64:8080/
[12:25:14] Posted data.
[12:25:14] Initial: 0000; - Receiving payload (expected size: 2429429)   <--client d/l new WU
[12:29:20] - Downloaded at ~9 kB/s
[12:29:20] - Averaged speed for that direction ~53 kB/s
[12:29:20] + Received work.
[12:29:20] + Closed connections
[12:29:25] 
[12:29:25] + Processing work unit
[12:29:25] Work type a1 not eligible for variable processors
[12:29:25] Core required: FahCore_a1.exe
[12:29:25] Core found.
[12:29:25] Working on queue slot 03 [December 16 12:29:25 UTC]
[12:29:25] + Working ...
[12:29:25] - Calling 'mpiexec -np 4 -channel shm -env MPICH_USE_SMP_OPTIMIZATIONS 1 -host 127.0.0.1 FahCore_a1.exe -dir work/ -suffix 03 -checkpoint 15 -forceasm -verbose -lifeline 3980 -version 622'

[12:29:26] -----------------------------*
[12:29:27] Folding@Home Gromacs SMP Core
[12:29:27] Version 1.76 (February 23, 2008)
[12:29:27] 
[12:29:27] Preparing to commence simulation
[12:29:27] - Ensuring status. Please wait.
[12:29:44] - Assembly optimizations manually forced on.
[12:29:45] - Not checking prior termination.
[12:29:59] - Expanded 2428917 -> 12901785 (decompressed 531.1 percent)
[12:29:59] - Starting from initial work packet
[12:30:00] c
[12:30:00] - Failed to delete work/wudata_03.pdo
[12:30:00] Warning:  check for stray files
[12:30:00] - Starting from initial work packet
[12:30:00] Could not touch 
[12:30:00] 
[12:30:00] Project: 2653 (Run 7, Clone 47, Gen 95)
[12:30:00] 
[12:30:13] tions on if available.
[12:30:13] Entering M.D.
[12:30:13] e.
[12:30:13] Entering M.D.
[12:30:27] Protein: Protein in POPC
[12:30:27] Writing local files
[12:30:36] Extra SSE boost OK.
[13:00:43] heckpoint triggered.
[13:15:46] Timered checkpoint triggered.
[13:30:48] Timered checkpoint triggered.
[13:37:02] Writing local files
[13:37:02] Completed 5000 out of 500000 steps  (1 percent)

Is this behavior normal? If it is, it shouldn't be. I don't see why the client should delete the good WU that is being crunched, it should only delete the expired one in the queue. I saw this on every machine with a WU expired in the queue.

Here's something else; this looks like an ISP problem to me:

Code: Select all

[16:53:48] + Attempting to send results [December 16 16:53:48 UTC]
[16:53:48] - Reading file work/wuresults_02.dat from core
[16:53:48]   (Read 2270263 bytes from disk)
[16:53:48] Connecting to http://171.67.108.17:8080/
[17:02:54] Timered checkpoint triggered.
[17:06:53] Posted data.
[17:06:53] Initial: 0000; - Uploaded at ~2 kB/s
[17:06:53] - Averaged speed for that direction ~80 kB/s
[17:06:53] - Server reports packet it received an incomplete payload.
[17:06:53]   (May be due to packet loss during network transmission or a corrupted file.)
[17:06:53]   Could not transmit unit 02 to Collection server; keeping in queue.
[17:06:53] + Sent 0 of 1 completed units to the server
[17:06:53] - Autosend completed
[17:17:55] Timered checkpoint triggered.
[17:21:12] Writing local files
[17:21:12] Completed 140000 out of 500000 steps  (28 percent)

Post by **kasson** » Thu Dec 18, 2008 3:38 am

You are correct--if WU 1 expires in the queue while WU 2 is being worked on, the client should not delete WU 1. I'd have to look at this part of the code, but this is not the desired behavior.

ThunderRd · Post by **ThunderRd** » Thu Dec 18, 2008 3:44 am

Thanks for looking at it, Peter. It's the first time it has happened to me that WUs have actually expired while in the queue, so I wasn't sure.

I'm strongly suspecting that the failed uploads are nearer my end than Stanford's end. The ISP won't tell me it's them, though, so I'm a bit at their mercy for now, unless I can prove it.

eliot1785 · Post by **eliot1785** » Thu Dec 18, 2008 8:42 pm

Actually, I have just started having the exact same problem, in NYC.

I have both the SMP and the nVidia GPU2 client installed, and while the nVidia GPU2 client could upload/download new work about an hour ago, my SMP client has been dying on multiple servers for both upload AND download ever since it completed a WU about 11 hours ago.

This is on multiple servers which are all OK as far as the server stats page goes. I get OK if I navigate to them in my browser. I have restarted my router, computer, and wireless connection, and all of my other internet operations are fine, including GPU2, web browser, FTP client, etc.

EDIT: I have also tried a different internet connection entirely on this laptop, and it still didn't work, so I think I have ruled out the ISP. Also, I am seeing the exact same behavior where the data transfer seems to start and just gradually fades out. You will notice that my log ends with "Receiving Payload," but it never actually gets the payload.

Here's a portion of the log. The SIGTERM is mine as I've started/stopped the SMP client many times trying to get it to work:

Code: Select all


--- Opening Log file [December 18 18:31:32 UTC] 


# Windows SMP Console Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.23 Beta R1

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\folding_older
Executable: C:\Program Files\folding_older\Folding@home-Win32-x86.exe
Arguments: -verbosity 9 -smp -deino 

[18:31:32] - Ask before connecting: No
[18:31:32] - User name: Stephen_Eliot_Dewey (Team 165)
[18:31:32] - User ID: 1980460E6EB77A51
[18:31:32] - Machine ID: 1
[18:31:32] 
[18:31:32] Loaded queue successfully.
[18:31:32] Deleting incompletely fetched item (4) from queue position #9
[18:31:32] - Warning: Could not delete all work unit files (9): Core file absent
[18:31:32] - Preparing to get new work unit...
[18:31:32] - Autosending finished units... [December 18 18:31:32 UTC]
[18:31:32] Trying to send all finished work units
[18:31:32] Project: 2665 (Run 2, Clone 448, Gen 77)
[18:31:32] + Attempting to get work packet


[18:31:32] + Attempting to send results [December 18 18:31:32 UTC]
[18:31:32] - Reading file work/wuresults_08.dat from core
[18:31:32] - Will indicate memory of 3581 MB
[18:31:32] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 7, Stepping: 6
[18:31:32] - Connecting to assignment server
[18:31:32] Connecting to http://assign.stanford.edu:8080/
[18:31:37] Posted data.
[18:31:44] Initial: 40AB; - Successful: assigned to (171.64.65.64).
  (Read 22288943 bytes from disk)
[18:31:44] + News From Folding@Home: Welcome to Folding@Home
[18:31:44] Connecting to http://171.64.65.64:8080/
[18:31:44] Loaded queue successfully.
[18:31:44] Connecting to http://171.64.65.64:8080/
[18:31:53] Posted data.
[18:31:53] Initial: 0000; - Receiving payload (expected size: 2441338)
[18:32:44] - Couldn't send HTTP request to server
[18:32:44] + Could not connect to Work Server (results)
[18:32:44]     (171.64.65.64:8080)
[18:32:44] + Retrying using alternative port
[18:32:44] Connecting to http://171.64.65.64:80/
[18:36:07] - Couldn't send HTTP request to server
[18:36:07] + Could not connect to Work Server (results)
[18:36:07]     (171.64.65.64:80)
[18:36:07] - Error: Could not transmit unit 08 (completed December 18) to work server.
[18:36:07] - 10 failed uploads of this unit.


[18:36:07] + Attempting to send results [December 18 18:36:07 UTC]
[18:36:07] - Reading file work/wuresults_08.dat from core
[18:36:07]   (Read 22288943 bytes from disk)
[18:36:07] Connecting to http://171.67.108.25:8080/
[18:36:27] - Couldn't send HTTP request to server
[18:36:27] + Could not connect to Work Server (results)
[18:36:27]     (171.67.108.25:8080)
[18:36:27] + Retrying using alternative port
[18:36:27] Connecting to http://171.67.108.25:80/
[18:36:27] - Couldn't send HTTP request to server
[18:36:27]   (Got status 503)
[18:36:27] + Could not connect to Work Server (results)
[18:36:27]     (171.67.108.25:80)
[18:36:27]   Could not transmit unit 08 to Collection server; keeping in queue.
[18:36:27] + Sent 0 of 1 completed units to the server
[18:36:27] - Autosend completed
[20:32:10] Killing all core threads
[20:32:10] Killing 2 cores
[20:32:10] Killing core 0
[20:32:10] Killing core 1

Folding@Home Client Shutdown at user request.
[20:32:10] ***** Got a SIGTERM signal (2)
[20:32:10] Killing all core threads
[20:32:10] Killing 2 cores
[20:32:10] Killing core 0
[20:32:10] Killing core 1

Folding@Home Client Shutdown.


--- Opening Log file [December 18 20:32:26 UTC] 


# Windows SMP Console Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.23 Beta R1

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\folding_older
Executable: C:\Program Files\folding_older\Folding@home-Win32-x86.exe
Arguments: -verbosity 9 -smp -deino 

[20:32:26] - Ask before connecting: No
[20:32:26] - User name: Stephen_Eliot_Dewey (Team 165)
[20:32:26] - User ID: 1980460E6EB77A51
[20:32:26] - Machine ID: 1
[20:32:26] 
[20:32:26] Loaded queue successfully.
[20:32:26] Deleting incompletely fetched item (4) from queue position #9
[20:32:26] - Warning: Could not delete all work unit files (9): Core file absent
[20:32:26] - Preparing to get new work unit...
[20:32:26] + Attempting to get work packet
[20:32:26] - Autosending finished units... [December 18 20:32:26 UTC]
[20:32:26] - Will indicate memory of 3581 MB
[20:32:26] Trying to send all finished work units
[20:32:26] Project: 2665 (Run 2, Clone 448, Gen 77)
[20:32:26] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 7, Stepping: 6


[20:32:26] - Connecting to assignment server
[20:32:26] + Attempting to send results [December 18 20:32:26 UTC][20:32:26] Connecting to http://assign.stanford.edu:8080/

[20:32:26] - Reading file work/wuresults_08.dat from core
[20:32:26]   (Read 22288943 bytes from disk)
[20:32:26] Connecting to http://171.64.65.64:8080/
[20:32:27] Posted data.
[20:32:27] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[20:32:27] + News From Folding@Home: Welcome to Folding@Home
[20:32:27] Loaded queue successfully.
[20:32:27] Connecting to http://171.64.65.64:8080/
[20:32:32] Posted data.
[20:32:32] Initial: 0000; - Receiving payload (expected size: 2397681)

Post by **bruce** » Fri Dec 19, 2008 12:06 am

eliot1785 wrote:[18:31:44] Connecting to http://171.64.65.64:8080/
[18:31:44] Loaded queue successfully.
[18:31:44] Connecting to http://171.64.65.64:8080/
[18:31:53] Posted data.
[18:31:53] Initial: 0000; - Receiving payload (expected size: 2441338)
[18:32:44] - Couldn't send HTTP request to server
[18:32:44] + Could not connect to Work Server (results)
[18:32:44] (171.64.65.64:8080)
[18:32:44] + Retrying using alternative port
[18:32:44] Connecting to http://171.64.65.64:80/
[18:36:07] - Couldn't send HTTP request to server
[18:36:07] + Could not connect to Work Server (results)
[18:36:07] (171.64.65.64:80)
[18:36:07] - Error: Could not transmit unit 08 (completed December 18) to work server.

[18:36:07] + Attempting to send results [December 18 18:36:07 UTC]
[18:36:07] - Reading file work/wuresults_08.dat from core
[18:36:07] (Read 22288943 bytes from disk)
[18:36:07] Connecting to http://171.67.108.25:8080/
[18:36:27] - Couldn't send HTTP request to server
[18:36:27] + Could not connect to Work Server (results)
[18:36:27] (171.67.108.25:8080)
[18:36:27] (Got status 503)
[18:36:27] Could not transmit unit 08 to Collection server; keeping in queue.
[18:36:27] + Sent 0 of 1 completed units to the server
[18:36:27] - Autosend completed
[20:32:10] Killing all core threads

There are three important things happening in this log.
1) A new WU is being downloaded from 171.64.65.64
2) A result is being uploaded to 171.64.65.64
3) When the result cannot be uploaded, it is retried to the CS at 171.67.108.25
Since the download and the first upload involve two independent connections to the same server, the messages are difficult to follow (even more so before I edited out some of the messages in your log).

>> Item 3 fails because the collection server is saturated. Hence the 503 error message.
>> I"m not sure why item 2, the upload to 65.64, failed.
>> For item 1, it might be clearer if you tried uploading without downloading. What happens if you try -send all?

Some types of connections have trouble when you saturate both the upload and download bandwidth simultaneous. These were mostly older dial-up modem connections, but it's not impossible on more modern connections. Nevertheless, there's a good chance that the problems are all at Stanford. The best option you have is to just leave the client alone and let it keep trying using its own recommendations rather than trying to force it.

Post by **kasson** » Fri Dec 19, 2008 12:46 am

@eliot: server-side, I see a "Network send failure" on your WU download. The server was successfully assigning other WU's. Sometimes something goes a bit screwy with the HTTP library, though, so I restarted the server binary. Let me know if it works now.

eliot1785 · Post by **eliot1785** » Fri Dec 19, 2008 1:07 am

@kasson,
Unfortunately it's still not working. Same problem as before.
Granted, my internet connection is not the greatest, it will briefly pause every 5 seconds or so when downloading a large file, but it has been working fine with F@H for the past 6 weeks since I changed connections. This is a cable connection and it seems to have good days and bad days, so maybe this is just a bad day, but it would be the worst one so far.

eliot1785 · Post by **eliot1785** » Fri Dec 19, 2008 1:14 am

Also, I should probably withdraw my claim to have tested this on 2 different internet connections. The other one was a Blackberry tether, and I tried to test it again just now (after your server reset), but I realized I may not have been tethering it correctly. So I can't say for sure that I have ruled out my ISP, though I haven't "ruled it in" either, so to speak. But again, GPU2 was working and so is my browser, etc.

ThunderRd · Post by **ThunderRd** » Fri Dec 19, 2008 1:23 am

FWIW my problems are persistent. I had to sneakernet about a dozen WUs to my home last night so they wouldn't die; I hope that I won't have to continue. There isn't enough time in my life to do that

I will continue to put pressure on the ISP but I'm at a loss for ideas.

Folding Forum

problems with all SMP servers

problems with all SMP servers

Re: problems with all SMP servers

Re: problems with all SMP servers

Re: problems with all SMP servers

Re: problems with all SMP servers

Re: problems with all SMP servers

Re: problems with all SMP servers

Re: problems with all SMP servers

Re: problems with all SMP servers

Re: problems with all SMP servers

Re: problems with all SMP servers

Re: problems with all SMP servers

Re: problems with all SMP servers

Re: problems with all SMP servers

Re: problems with all SMP servers