I'm having an issue with a WU stalling during upload to 3.21.157.11 whenever it gets to 99%. I've restarted I've, manually closed the connection via TCPView but it all it does is get back to 99% uploaded and then stalls again. There are no messages, no failed upload signals nothing it just stops. Meanwhile other WU's are uploading fine to other servers. Any ideas?
05:57:16:WU03:FS01:Trying to send results to collection server
05:57:16:WU03:FS01:Uploading 78.07MiB to 3.21.157.11
05:57:16:WU03:FS01:Connecting to 3.21.157.11:8080
05:57:22:WU03:FS01:Upload 0.72%
...
06:11:46:WU03:FS01:Upload 99.76%
3.21.157.11 / aws2.foldingathome.org is full. According to (apps.*serverstat, it has 0 bytes available.) The Work Server for that project is also full. They're installing a larger RAID on the WS and the're moving data off of 3.21.157.11 but both are taking longer than planned.
Unfortunately, there isn't a good way for the servers to inform FAHClient to "Don't bother"
18:48:38:WU00:FS01:Connecting to 65.254.110.245:80
18:48:38:WU00:FS01:Assigned to work server 3.21.157.11
18:48:38:WU00:FS01:Requesting new work unit for slot 01: RUNNING cpu:32 from 3.21.157.11
18:48:38:WU00:FS01:Connecting to 3.21.157.11:8080
18:48:54:WU01:FS01:0xa7:Completed 250000 out of 250000 steps (100%)
18:48:55:WU01:FS01:0xa7:Saving result file ../logfile_01.txt
18:48:55:WU01:FS01:0xa7:Saving result file dhdl.xvg
18:48:55:WU01:FS01:0xa7:Saving result file frame163.trr
18:48:55:WU01:FS01:0xa7:Saving result file md.log
18:48:55:WU01:FS01:0xa7:Saving result file pullf.xvg
18:48:55:WU01:FS01:0xa7:Saving result file pullx.xvg
18:48:55:WU01:FS01:0xa7:Saving result file science.log
18:48:55:WU01:FS01:0xa7:Saving result file traj_comp.xtc
18:48:55:WU01:FS01:0xa7:Folding@home Core Shutdown: FINISHED_UNIT
18:48:55:WU01:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
18:48:55:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:14820 run:491 clone:0 gen:163 core:0xa7 unit:0x000000c22879986c5eaa282d8cba06cd
18:48:55:WU01:FS01:Uploading 6.79MiB to 40.121.152.108
18:48:55:WU01:FS01:Connecting to 40.121.152.108:8080
18:48:58:WU01:FS01:Upload complete
18:48:58:WU01:FS01:Server responded WORK_ACK (400)
18:48:58:WU01:FS01:Final credit estimate, 10473.00 points
18:48:58:WU01:FS01:Cleaning up
\x1b[93m18:50:49:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80\x1b[0m
18:50:49:WU00:FS01:Connecting to 3.21.157.11:80
\x1b[91m18:53:00:ERROR:WU00:FS01:Exception: Failed to connect to 3.21.157.11:80: Connection timed out\x1b[0m
18:53:01:WU00:FS01:Connecting to 65.254.110.245:80
18:53:01:WU00:FS01:Assigned to work server 3.21.157.11
18:53:01:WU00:FS01:Requesting new work unit for slot 01: READY cpu:32 from 3.21.157.11
18:53:01:WU00:FS01:Connecting to 3.21.157.11:8080
\x1b[93m18:55:11:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80\x1b[0m
18:55:11:WU00:FS01:Connecting to 3.21.157.11:80
\x1b[91m18:57:23:ERROR:WU00:FS01:Exception: Failed to connect to 3.21.157.11:80: Connection timed out\x1b[0m
18:57:23:WU00:FS01:Connecting to 65.254.110.245:80
18:57:23:WU00:FS01:Assigned to work server 3.21.157.11
18:57:23:WU00:FS01:Requesting new work unit for slot 01: READY cpu:32 from 3.21.157.11
18:57:23:WU00:FS01:Connecting to 3.21.157.11:8080
\x1b[93m18:59:34:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80\x1b[0m
18:59:34:WU00:FS01:Connecting to 3.21.157.11:80
Should I reboot?
EDIT: Not a problem. It switched over to another server, and I am back in business.
This server is still not working. It always fails, and it is five minutes before I get another server that works.
It is not much time, but since the work units are so short, it is a significant percentage of time lost.
16:50:58:WU00:FS01:Connecting to 65.254.110.245:80
16:50:58:WU00:FS01:Assigned to work server 3.21.157.11
16:50:58:WU00:FS01:Requesting new work unit for slot 01: RUNNING cpu:32 from 3.21.157.11
16:50:58:WU00:FS01:Connecting to 3.21.157.11:8080
16:51:28:WU01:FS01:0xa7:Completed 500000 out of 500000 steps (100%)
16:51:29:WU01:FS01:0xa7:Saving result file ../logfile_01.txt
16:51:29:WU01:FS01:0xa7:Saving result file ener.edr
16:51:29:WU01:FS01:0xa7:Saving result file frame80.trr
16:51:29:WU01:FS01:0xa7:Saving result file md.log
16:51:29:WU01:FS01:0xa7:Saving result file science.log
16:51:29:WU01:FS01:0xa7:Folding@home Core Shutdown: FINISHED_UNIT
16:51:29:WU01:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
16:51:29:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:16801 run:2 clone:692 gen:80 core:0xa7 unit:0x0000006382ed0b915e9597062b54b733
16:51:29:WU01:FS01:Uploading 8.56MiB to 130.237.11.145
16:51:29:WU01:FS01:Connecting to 130.237.11.145:8080
16:51:35:WU01:FS01:Upload 21.89%
16:51:41:WU01:FS01:Upload 57.65%
16:51:47:WU01:FS01:Upload 91.22%
16:51:49:WU01:FS01:Upload complete
16:51:49:WU01:FS01:Server responded WORK_ACK (400)
16:51:49:WU01:FS01:Final credit estimate, 23671.00 points
16:51:49:WU01:FS01:Cleaning up
\x1b[93m16:53:09:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80\x1b[0m
16:53:09:WU00:FS01:Connecting to 3.21.157.11:80
\x1b[91m16:55:20:ERROR:WU00:FS01:Exception: Failed to connect to 3.21.157.11:80: Connection timed out\x1b[0m
16:55:20:WU00:FS01:Connecting to 65.254.110.245:80
16:55:21:WU00:FS01:Assigned to work server 69.94.66.7
16:55:21:WU00:FS01:Requesting new work unit for slot 01: READY cpu:32 from 69.94.66.7
16:55:21:WU00:FS01:Connecting to 69.94.66.7:8080
16:55:21:WU00:FS01:Downloading 2.83MiB
16:55:23:WU00:FS01:Download complete
16:55:23:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:14377 run:146 clone:0 gen:316 core:0xa7 unit:0x00000160455e42075e8ba7674f4a4d4d
Last edited by Joe_H on Mon Jul 06, 2020 7:21 pm, edited 1 time in total.
Reason:change Quote tags to Code for log segments