3.21.157.11 upload stalls at 99%

Moderators: Site Moderators, FAHC Science Team

Post Reply
cantanga
Posts: 1
Joined: Fri Jun 26, 2020 11:14 pm

3.21.157.11 upload stalls at 99%

Post by cantanga »

I'm having an issue with a WU stalling during upload to 3.21.157.11 whenever it gets to 99%. I've restarted I've, manually closed the connection via TCPView but it all it does is get back to 99% uploaded and then stalls again. There are no messages, no failed upload signals nothing it just stops. Meanwhile other WU's are uploading fine to other servers. Any ideas?

Code: Select all

05:57:16:WU03:FS01:Trying to send results to collection server
05:57:16:WU03:FS01:Uploading 78.07MiB to 3.21.157.11
05:57:16:WU03:FS01:Connecting to 3.21.157.11:8080
05:57:22:WU03:FS01:Upload 0.72%
...
06:11:46:WU03:FS01:Upload 99.76%
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 3.21.157.11 upload stalls at 99%

Post by bruce »

3.21.157.11 / aws2.foldingathome.org is full. According to (apps.*serverstat, it has 0 bytes available.) The Work Server for that project is also full. They're installing a larger RAID on the WS and the're moving data off of 3.21.157.11 but both are taking longer than planned. :(

Unfortunately, there isn't a good way for the servers to inform FAHClient to "Don't bother"
JimF
Posts: 652
Joined: Thu Jan 21, 2010 2:03 pm

Re: 3.21.157.11 upload stalls at 99%

Post by JimF »

I don't know if this is related or not, but I am not getting anything from this server either.

Code: Select all

18:48:38:WU00:FS01:Connecting to 65.254.110.245:80
18:48:38:WU00:FS01:Assigned to work server 3.21.157.11
18:48:38:WU00:FS01:Requesting new work unit for slot 01: RUNNING cpu:32 from 3.21.157.11
18:48:38:WU00:FS01:Connecting to 3.21.157.11:8080
18:48:54:WU01:FS01:0xa7:Completed 250000 out of 250000 steps (100%)
18:48:55:WU01:FS01:0xa7:Saving result file ../logfile_01.txt
18:48:55:WU01:FS01:0xa7:Saving result file dhdl.xvg
18:48:55:WU01:FS01:0xa7:Saving result file frame163.trr
18:48:55:WU01:FS01:0xa7:Saving result file md.log
18:48:55:WU01:FS01:0xa7:Saving result file pullf.xvg
18:48:55:WU01:FS01:0xa7:Saving result file pullx.xvg
18:48:55:WU01:FS01:0xa7:Saving result file science.log
18:48:55:WU01:FS01:0xa7:Saving result file traj_comp.xtc
18:48:55:WU01:FS01:0xa7:Folding@home Core Shutdown: FINISHED_UNIT
18:48:55:WU01:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
18:48:55:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:14820 run:491 clone:0 gen:163 core:0xa7 unit:0x000000c22879986c5eaa282d8cba06cd
18:48:55:WU01:FS01:Uploading 6.79MiB to 40.121.152.108
18:48:55:WU01:FS01:Connecting to 40.121.152.108:8080
18:48:58:WU01:FS01:Upload complete
18:48:58:WU01:FS01:Server responded WORK_ACK (400)
18:48:58:WU01:FS01:Final credit estimate, 10473.00 points
18:48:58:WU01:FS01:Cleaning up
\x1b[93m18:50:49:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80\x1b[0m
18:50:49:WU00:FS01:Connecting to 3.21.157.11:80
\x1b[91m18:53:00:ERROR:WU00:FS01:Exception: Failed to connect to 3.21.157.11:80: Connection timed out\x1b[0m
18:53:01:WU00:FS01:Connecting to 65.254.110.245:80
18:53:01:WU00:FS01:Assigned to work server 3.21.157.11
18:53:01:WU00:FS01:Requesting new work unit for slot 01: READY cpu:32 from 3.21.157.11
18:53:01:WU00:FS01:Connecting to 3.21.157.11:8080
\x1b[93m18:55:11:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80\x1b[0m
18:55:11:WU00:FS01:Connecting to 3.21.157.11:80
\x1b[91m18:57:23:ERROR:WU00:FS01:Exception: Failed to connect to 3.21.157.11:80: Connection timed out\x1b[0m
18:57:23:WU00:FS01:Connecting to 65.254.110.245:80
18:57:23:WU00:FS01:Assigned to work server 3.21.157.11
18:57:23:WU00:FS01:Requesting new work unit for slot 01: READY cpu:32 from 3.21.157.11
18:57:23:WU00:FS01:Connecting to 3.21.157.11:8080
\x1b[93m18:59:34:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80\x1b[0m
18:59:34:WU00:FS01:Connecting to 3.21.157.11:80
Should I reboot?

EDIT: Not a problem. It switched over to another server, and I am back in business.

Code: Select all

19:01:45:WU00:FS01:Connecting to 65.254.110.245:80
19:01:45:WU00:FS01:Assigned to work server 3.21.157.11
19:01:45:WU00:FS01:Requesting new work unit for slot 01: READY cpu:32 from 3.21.157.11
19:01:45:WU00:FS01:Connecting to 3.21.157.11:8080
\x1b[93m19:03:56:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80\x1b[0m
19:03:56:WU00:FS01:Connecting to 3.21.157.11:80
\x1b[91m19:06:07:ERROR:WU00:FS01:Exception: Failed to connect to 3.21.157.11:80: Connection timed out\x1b[0m
19:06:07:WU00:FS01:Connecting to 65.254.110.245:80
19:06:07:WU00:FS01:Assigned to work server 206.223.170.146
19:06:07:WU00:FS01:Requesting new work unit for slot 01: READY cpu:32 from 206.223.170.146
19:06:07:WU00:FS01:Connecting to 206.223.170.146:8080
19:06:08:WU00:FS01:Downloading 21.90MiB
19:06:10:WU00:FS01:Download complete
19:06:10:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:14216 run:148 clone:1 gen:157 core:0xa7 unit:0x000000bdcedfaa925ea344ea04ddf9d4
19:06:10:WU00:FS01:Starting
19:06:10:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 706 -lifeline 1678 -checkpoint 15 -np 32
19:06:10:WU00:FS01:Started FahCore on PID 16172
19:06:10:WU00:FS01:Core PID:16176
19:06:10:WU00:FS01:FahCore 0xa7 started
19:06:11:WU00:FS01:0xa7:*********************** Log Started 2020-07-05T19:06:10Z ***********************
19:06:11:WU00:FS01:0xa7:************************** Gromacs Folding@home Core ***************************
19:06:11:WU00:FS01:0xa7:       Type: 0xa7
19:06:11:WU00:FS01:0xa7:       Core: Gromacs
19:06:11:WU00:FS01:0xa7:       Args: -dir 00 -suffix 01 -version 706 -lifeline 16172 -checkpoint 15 -np
19:06:11:WU00:FS01:0xa7:             32
19:06:11:WU00:FS01:0xa7:************************************ CBang *************************************
19:06:11:WU00:FS01:0xa7:       Date: Nov 5 2019
19:06:11:WU00:FS01:0xa7:       Time: 06:06:57
19:06:11:WU00:FS01:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
19:06:11:WU00:FS01:0xa7:     Branch: master
19:06:11:WU00:FS01:0xa7:   Compiler: GNU 8.3.0
19:06:11:WU00:FS01:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
19:06:11:WU00:FS01:0xa7:   Platform: linux2 4.19.0-5-amd64
19:06:11:WU00:FS01:0xa7:       Bits: 64
19:06:11:WU00:FS01:0xa7:       Mode: Release
19:06:11:WU00:FS01:0xa7:************************************ System ************************************
19:06:11:WU00:FS01:0xa7:        CPU: AMD Ryzen 9 3950X 16-Core Processor
19:06:11:WU00:FS01:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
19:06:11:WU00:FS01:0xa7:       CPUs: 32
19:06:11:WU00:FS01:0xa7:     Memory: 15.56GiB
19:06:11:WU00:FS01:0xa7:Free Memory: 10.51GiB
19:06:11:WU00:FS01:0xa7:    Threads: POSIX_THREADS
19:06:11:WU00:FS01:0xa7: OS Version: 5.3
19:06:11:WU00:FS01:0xa7:Has Battery: false
19:06:11:WU00:FS01:0xa7: On Battery: false
19:06:11:WU00:FS01:0xa7: UTC Offset: -4
19:06:11:WU00:FS01:0xa7:        PID: 16176
19:06:11:WU00:FS01:0xa7:        CWD: /var/lib/fahclient/work
19:06:11:WU00:FS01:0xa7:******************************** Build - libFAH ********************************
19:06:11:WU00:FS01:0xa7:    Version: 0.0.18
19:06:11:WU00:FS01:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
19:06:11:WU00:FS01:0xa7:  Copyright: 2019 foldingathome.org
19:06:11:WU00:FS01:0xa7:   Homepage: https://foldingathome.org/
19:06:11:WU00:FS01:0xa7:       Date: Nov 5 2019
19:06:11:WU00:FS01:0xa7:       Time: 06:13:26
19:06:11:WU00:FS01:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
19:06:11:WU00:FS01:0xa7:     Branch: master
19:06:11:WU00:FS01:0xa7:   Compiler: GNU 8.3.0
19:06:11:WU00:FS01:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
19:06:11:WU00:FS01:0xa7:   Platform: linux2 4.19.0-5-amd64
19:06:11:WU00:FS01:0xa7:       Bits: 64
19:06:11:WU00:FS01:0xa7:       Mode: Release
19:06:11:WU00:FS01:0xa7:************************************ Build *************************************
19:06:11:WU00:FS01:0xa7:       SIMD: avx_256
19:06:11:WU00:FS01:0xa7:********************************************************************************
19:06:11:WU00:FS01:0xa7:Project: 14216 (Run 148, Clone 1, Gen 157)
19:06:11:WU00:FS01:0xa7:Unit: 0x000000bdcedfaa925ea344ea04ddf9d4
19:06:11:WU00:FS01:0xa7:Reading tar file core.xml
19:06:11:WU00:FS01:0xa7:Reading tar file frame157.tpr
19:06:11:WU00:FS01:0xa7:Digital signatures verified
19:06:11:WU00:FS01:0xa7:Calling: mdrun -s frame157.tpr -o frame157.trr -x frame157.xtc -cpt 15 -nt 32
19:06:11:WU00:FS01:0xa7:Steps: first=9812500 total=62500
19:06:13:WU00:FS01:0xa7:Completed 1 out of 62500 steps (0%)
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: 3.21.157.11 upload stalls at 99%

Post by JohnChodera »

We've solved the disk space issues on aws2.foldingathome.org, so this should no longer be an issue!

~ John Chodera // MSKCC
JimF
Posts: 652
Joined: Thu Jan 21, 2010 2:03 pm

Re: 3.21.157.11 upload stalls at 99%

Post by JimF »

This server is still not working. It always fails, and it is five minutes before I get another server that works.
It is not much time, but since the work units are so short, it is a significant percentage of time lost.

Code: Select all

16:50:58:WU00:FS01:Connecting to 65.254.110.245:80
16:50:58:WU00:FS01:Assigned to work server 3.21.157.11
16:50:58:WU00:FS01:Requesting new work unit for slot 01: RUNNING cpu:32 from 3.21.157.11
16:50:58:WU00:FS01:Connecting to 3.21.157.11:8080
16:51:28:WU01:FS01:0xa7:Completed 500000 out of 500000 steps (100%)
16:51:29:WU01:FS01:0xa7:Saving result file ../logfile_01.txt
16:51:29:WU01:FS01:0xa7:Saving result file ener.edr
16:51:29:WU01:FS01:0xa7:Saving result file frame80.trr
16:51:29:WU01:FS01:0xa7:Saving result file md.log
16:51:29:WU01:FS01:0xa7:Saving result file science.log
16:51:29:WU01:FS01:0xa7:Folding@home Core Shutdown: FINISHED_UNIT
16:51:29:WU01:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
16:51:29:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:16801 run:2 clone:692 gen:80 core:0xa7 unit:0x0000006382ed0b915e9597062b54b733
16:51:29:WU01:FS01:Uploading 8.56MiB to 130.237.11.145
16:51:29:WU01:FS01:Connecting to 130.237.11.145:8080
16:51:35:WU01:FS01:Upload 21.89%
16:51:41:WU01:FS01:Upload 57.65%
16:51:47:WU01:FS01:Upload 91.22%
16:51:49:WU01:FS01:Upload complete
16:51:49:WU01:FS01:Server responded WORK_ACK (400)
16:51:49:WU01:FS01:Final credit estimate, 23671.00 points
16:51:49:WU01:FS01:Cleaning up
\x1b[93m16:53:09:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80\x1b[0m
16:53:09:WU00:FS01:Connecting to 3.21.157.11:80
\x1b[91m16:55:20:ERROR:WU00:FS01:Exception: Failed to connect to 3.21.157.11:80: Connection timed out\x1b[0m
16:55:20:WU00:FS01:Connecting to 65.254.110.245:80
16:55:21:WU00:FS01:Assigned to work server 69.94.66.7
16:55:21:WU00:FS01:Requesting new work unit for slot 01: READY cpu:32 from 69.94.66.7
16:55:21:WU00:FS01:Connecting to 69.94.66.7:8080
16:55:21:WU00:FS01:Downloading 2.83MiB
16:55:23:WU00:FS01:Download complete
16:55:23:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:14377 run:146 clone:0 gen:316 core:0xa7 unit:0x00000160455e42075e8ba7674f4a4d4d
Last edited by Joe_H on Mon Jul 06, 2020 7:21 pm, edited 1 time in total.
Reason: change Quote tags to Code for log segments
Post Reply