Page 1 of 1

Failing to SEND completed work unit

Posted: Fri Apr 03, 2020 9:01 pm
by paulmd199
I am aware of the fact that your servers are overloaded in allocating new work units. Are they likewise overloaded for receiving completed work units?

I have a completed WU that I am unable to send, despite trying all night. Will completed work units keep accumulating until received? If so is there a limit to this? Will the client eventually give up trying to send?

Filtered log excerpt follows:

Code: Select all

20:23:58:WU01:FS03:0x22:Completed 40000 out of 2000000 steps (2%)
20:24:31:WU03:FS03:Sending unit results: id:03 state:SEND error:NO_ERROR project:11749 run:0 clone:5822 gen:14 core:0x22 unit:0x000000208ca304e75e6bb52dc4161efa
20:24:31:WU03:FS03:Uploading 12.56MiB to 140.163.4.231
20:24:31:WU03:FS03:Connecting to 140.163.4.231:8080
20:24:53:WARNING:WU03:FS03:WorkServer connection failed on port 8080 trying 80
20:24:53:WU03:FS03:Connecting to 140.163.4.231:80
20:25:14:WARNING:WU03:FS03:Exception: Failed to send results to work server: Failed to connect to 140.163.4.231:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
20:27:11:WU01:FS03:0x22:Completed 60000 out of 2000000 steps (3%)
20:30:19:WU01:FS03:0x22:Completed 80000 out of 2000000 steps (4%)
20:31:22:WU03:FS03:Sending unit results: id:03 state:SEND error:NO_ERROR project:11749 run:0 clone:5822 gen:14 core:0x22 unit:0x000000208ca304e75e6bb52dc4161efa
20:31:23:WU03:FS03:Uploading 12.56MiB to 140.163.4.231
20:31:23:WU03:FS03:Connecting to 140.163.4.231:8080
20:31:44:WARNING:WU03:FS03:WorkServer connection failed on port 8080 trying 80
20:31:44:WU03:FS03:Connecting to 140.163.4.231:80
20:32:05:WARNING:WU03:FS03:Exception: Failed to send results to work server: Failed to connect to 140.163.4.231:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
20:33:31:WU01:FS03:0x22:Completed 100000 out of 2000000 steps (5%)
20:36:39:WU01:FS03:0x22:Completed 120000 out of 2000000 steps (6%)
20:39:46:WU01:FS03:0x22:Completed 140000 out of 2000000 steps (7%)
20:42:28:WU03:FS03:Sending unit results: id:03 state:SEND error:NO_ERROR project:11749 run:0 clone:5822 gen:14 core:0x22 unit:0x000000208ca304e75e6bb52dc4161efa
20:42:28:WU03:FS03:Uploading 12.56MiB to 140.163.4.231
20:42:28:WU03:FS03:Connecting to 140.163.4.231:8080
20:42:50:WARNING:WU03:FS03:WorkServer connection failed on port 8080 trying 80
20:42:50:WU03:FS03:Connecting to 140.163.4.231:80
20:42:56:WU01:FS03:0x22:Completed 160000 out of 2000000 steps (8%)
20:42:57:WU03:FS03:Upload 0.50%
20:44:05:WU03:FS03:Upload 0.99%
20:44:05:WARNING:WU03:FS03:Exception: Failed to send results to work server: Transfer failed
Mod Edit: Added Code Tags - PantherX

Re: Failing to SEND completed work unit

Posted: Fri Apr 03, 2020 9:06 pm
by PantherX
Welcome to the F@H Forum paulmd199,

In a nutshell, some servers only assign WUs, some only receive WUs, and other send and receive WUs. When you add in the limitations of bandwidth, you can end up where sending a completed WU takes time.

Completed WUs will continue to attempt sending the data to the Server. It will stop once the Expiration date is reached. The expiration date will vary for each project.

On my system, I have seen a completed WU take 24 hours to be successfully returned :)

Re: Failing to SEND completed work unit

Posted: Fri Apr 03, 2020 9:10 pm
by X-Wing
I have read that their main issue right now is indeed the reception of completed work units (Note: I'm not affiliated with FAH, I'm a community member just like you). I am in the same boat, with the same error message. Hopefully some of the new servers coming online will be able to reduce the pressure a little bit. By my mental math, FAH is on course to 2.5x their server capacity in the last few months once all currently listed servers come fully online (10 to 26).

Re: Failing to SEND completed work unit

Posted: Fri Apr 03, 2020 9:24 pm
by paulmd199
Thanks to you both, i will just wait this out and try not to obsess over it too much.

Re: Failing to SEND completed work unit

Posted: Fri Apr 03, 2020 10:17 pm
by paulmd199
Hold on, maybe there is an issue. According to my logs, I'm attempting to connect to 140.163.4.231, which , according to https://apps.foldingathome.org/serverstats is set to assign, not to accept. I don't know enough about how FaH works to say that this is indeed the problem. But thought it worth pointing out.

140.163.4.231 plfah1-1.mskcc.org WS 9.6.1 rafal.wiewiora 3,600.00/hr 0 1 No Assign 69,801 69,801 OPENMM_22 5.10TiB 2 days 2020-04-03T22:09:07Z

Re: Failing to SEND completed work unit

Posted: Fri Apr 03, 2020 10:46 pm
by X-Wing
According to a thread about the servers I read recently, "Assign" means both assign and accept. "Accept" means accept only.

Re: Failing to SEND completed work unit

Posted: Sat Apr 04, 2020 3:44 pm
by Joe_H
That is correct. The servers in Accept mode either have no WUs to send out and are able to take returns, are near full and waiting on some returns before transferring off data to other storage, or for some other reason or combination of reasons.

A Work Server in Assign status is both sending and receiving WUs.

Re: Failing to SEND completed work unit

Posted: Sat Apr 04, 2020 4:58 pm
by astrorob
that particular server (140.163.4.231) seems to have some problem for a couple of days now. i think there was another thread where Joe_H or another moderator said they were aware of the problem. i've still got one WU that won't upload to that server - it gets to about 1% (very, very slowly) and then fails.

Re: Failing to SEND completed work unit

Posted: Sat Apr 04, 2020 8:18 pm
by paulmd199
My issue follows the pattern exactly as astrorob laid out, only now i have two completed WUs in this state. hope it's fixed before they expire.