Time Out Uploading to CS 140.163.4.242

Moderators: Site Moderators, FAHC Science Team

Post Reply
Foxbat
Posts: 95
Joined: Wed Dec 05, 2007 10:23 pm
Hardware configuration: Apple Mac Pro 1,1 2x2.66 GHz Dual-Core Xeon w/10 GB RAM | EVGA GTX 960, Zotac GTX 750 Ti | Ubuntu 14.04 LTS
Dell Precision T7400 2x3.0 GHz Quad-Core Xeon w/16 GB RAM | Zotac GTX 970 | Ubuntu 14.04 LTS
Apple iMac Retina 5K 4.00 GHz Core i7 w/8 GB RAM | OS X 10.11.3 (El Capitan)
Location: Michiana, USA

Time Out Uploading to CS 140.163.4.242

Post by Foxbat »

I just happened to notice that one of my Folders with three slots had four WUs listed in the Queue (One at 100%) so I thought I'd watch the upload progress. Well, it got to a little over 55% and then hung. After timing out, it switched to a different CS and uploaded fine.

Code: Select all

13:43:34:WU00:FS01:0x21:Completed 5000000 out of 5000000 steps (100%)
                   13:43:35:WU01:FS01:Connecting to 171.67.108.45:80
                   13:43:35:WU01:FS01:Assigned to work server 171.67.108.155
                   13:43:35:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GP104 [GeForce GTX 1070] from 171.67.108.155
                   13:43:35:WU01:FS01:Connecting to 171.67.108.155:8080
                   13:43:37:WU01:FS01:Downloading 748.49KiB
                   13:43:37:WU01:FS01:Download complete
                   13:43:37:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:9657 run:1 clone:87 gen:59 core:0x18 unit:0x00000046ab436c9b56de69ba30ac958f
13:43:39:WU00:FS01:0x21:Saving result file logfile_01.txt
13:43:39:WU00:FS01:0x21:Saving result file checkpointState.xml
13:43:41:WU00:FS01:0x21:Saving result file checkpt.crc
13:43:41:WU00:FS01:0x21:Saving result file log.txt
13:43:41:WU00:FS01:0x21:Saving result file positions.xtc
13:43:43:WU00:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
13:43:44:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
13:43:44:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11406 run:1 clone:26 gen:233 core:0x21 unit:0x0000012a8ca304f25686b1af9e33aa06
13:43:44:WU00:FS01:Uploading 14.00MiB to 140.163.4.242
13:43:44:WU00:FS01:Connecting to 140.163.4.242:8080
                   13:43:44:WU01:FS01:Starting
                   13:43:44:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_18.fah/FahCore_18 -dir 01 -suffix 01 -version 704 -lifeline 1156 -checkpoint 8 -gpu 0 -gpu-vendor nvidia
                   13:43:44:WU01:FS01:Started FahCore on PID 6645
                   13:43:44:WU01:FS01:Core PID:6649
                   13:43:44:WU01:FS01:FahCore 0x18 started
                   13:43:44:WU01:FS01:0x18:*********************** Log Started 2016-10-28T13:43:44Z ***********************
                   13:43:44:WU01:FS01:0x18:Project: 9657 (Run 1, Clone 87, Gen 59)
                   13:43:44:WU01:FS01:0x18:Unit: 0x00000046ab436c9b56de69ba30ac958f
                   13:43:44:WU01:FS01:0x18:CPU: 0x00000000000000000000000000000000
                   13:43:44:WU01:FS01:0x18:Machine: 1
                   13:43:44:WU01:FS01:0x18:Reading tar file core.xml
                   13:43:44:WU01:FS01:0x18:Reading tar file integrator.xml
                   13:43:44:WU01:FS01:0x18:Reading tar file state.xml
                   13:43:44:WU01:FS01:0x18:Reading tar file system.xml
                   13:43:44:WU01:FS01:0x18:Digital signatures verified
                   13:43:44:WU01:FS01:0x18:Folding@home GPU core18
                   13:43:44:WU01:FS01:0x18:Version 0.0.4
13:43:50:WU00:FS01:Upload 13.39%
                   13:43:52:WU01:FS01:0x18:Completed 0 out of 2000000 steps (0%)
                   13:43:52:WU01:FS01:0x18:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
13:43:57:WU00:FS01:Upload 24.10%
13:44:03:WU00:FS01:Upload 33.03%
                   13:44:06:WU01:FS01:0x18:Completed 20000 out of 2000000 steps (1%)
13:44:09:WU00:FS01:Upload 44.64%
                   13:44:19:WU01:FS01:0x18:Completed 40000 out of 2000000 steps (2%)
                   13:44:33:WU01:FS01:0x18:Completed 60000 out of 2000000 steps (3%)
                   13:44:46:WU01:FS01:0x18:Completed 80000 out of 2000000 steps (4%)
                   13:44:59:WU01:FS01:0x18:Completed 100000 out of 2000000 steps (5%)
                   13:45:14:WU01:FS01:0x18:Completed 120000 out of 2000000 steps (6%)
                   13:45:27:WU01:FS01:0x18:Completed 140000 out of 2000000 steps (7%)
                   13:45:41:WU01:FS01:0x18:Completed 160000 out of 2000000 steps (8%)
                   13:45:54:WU01:FS01:0x18:Completed 180000 out of 2000000 steps (9%)
                   13:46:07:WU01:FS01:0x18:Completed 200000 out of 2000000 steps (10%)
                   13:46:22:WU01:FS01:0x18:Completed 220000 out of 2000000 steps (11%)
                   13:46:35:WU01:FS01:0x18:Completed 240000 out of 2000000 steps (12%)
                   13:46:48:WU01:FS01:0x18:Completed 260000 out of 2000000 steps (13%)
13:46:58:WU00:FS01:Upload 51.33%
13:46:58:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
13:46:58:WU00:FS01:Trying to send results to collection server
13:46:58:WU00:FS01:Uploading 14.00MiB to 128.252.203.2
13:46:58:WU00:FS01:Connecting to 128.252.203.2:8080
                   13:47:02:WU01:FS01:0x18:Completed 280000 out of 2000000 steps (14%)
13:47:04:WU00:FS01:Upload 11.61%
13:47:12:WU00:FS01:Upload 24.10%
                   13:47:15:WU01:FS01:0x18:Completed 300000 out of 2000000 steps (15%)
13:47:18:WU00:FS01:Upload 32.14%
13:47:24:WU00:FS01:Upload 40.17%
                   13:47:29:WU01:FS01:0x18:Completed 320000 out of 2000000 steps (16%)
13:47:31:WU00:FS01:Upload 51.33%
13:47:38:WU00:FS01:Upload 59.37%
                   13:47:43:WU01:FS01:0x18:Completed 340000 out of 2000000 steps (17%)
13:47:44:WU00:FS01:Upload 67.40%
13:47:52:WU00:FS01:Upload 75.89%
                   13:47:56:WU01:FS01:0x18:Completed 360000 out of 2000000 steps (18%)
13:47:59:WU00:FS01:Upload 81.24%
13:48:06:WU00:FS01:Upload 86.60%
                   13:48:09:WU01:FS01:0x18:Completed 380000 out of 2000000 steps (19%)
13:48:15:WU00:FS01:Upload 97.76%
                   13:48:22:WU01:FS01:0x18:Completed 400000 out of 2000000 steps (20%)
13:48:29:WU00:FS01:Upload complete
13:48:29:WU00:FS01:Server responded WORK_ACK (400)
13:48:29:WU00:FS01:Final credit estimate, 136827.00 points
13:48:29:WU00:FS01:Cleaning up
                   13:48:37:WU01:FS01:0x18:Completed 420000 out of 2000000 steps (21%)
                   13:48:51:WU01:FS01:0x18:Completed 440000 out of 2000000 steps (22%)
                   13:49:04:WU01:FS01:0x18:Completed 460000 out of 2000000 steps (23%)
(I indented WU01 from WU00 to give context to system activity at upload time).

I'm sure this could be a hiccup in the Internet, but I wanted to make sure that this Collection Server wasn't under the weather or running short of resources. If it helps, I'm on AT&T U-verse ASDL (20 Mbps down, 2 Mbps up)
Edit: It appears the subnet this node is on had some issues back in February: viewtopic.php?f=18&t=28596

Edit: I just concentrated on the "Warnings and Errors" on this node's log and see the following:

Code: Select all

08:20:37:WARNING:WU01:FS00:Failed to get assignment from '<myRouterIPv4>:8080': Failed to connect to <myRouterIPv4>:8080: Connection refused
08:20:38:WARNING:WU01:FS00:Failed to get assignment from '<myRouterIPv4>:80': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
08:20:38:ERROR:WU01:FS00:Exception: Could not get an assignment
08:59:50:WARNING:WU02:FS02:Failed to get assignment from '171.67.108.45:80': 10001: Server responded: HTTP_INTERNAL_SERVER_ERROR
******************************* Date: 2016-10-28 *******************************
12:01:56:WARNING:WU03:FS00:Detected clock skew (1 mins 39 secs), adjusting time estimates
13:46:58:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
I substituted <myRouterIPv4> for my Router's IPv4 address. For some reason the DNS resolved to my ATT&T-provided router when trying to resolve the Assignment server. However, looking at my other Folder, it appears my ever-so-reliable :lol: U-verse connection failed around 08:20 yesterday.
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Time Out Uploading to CS 140.163.4.242

Post by bruce »

Normally, a Work Server will drop a connection that is connected more that 15 minutes. Obviously the size of the WU divided by the speed of the connection is going to vary, but this is only a problem when the WU results get too big or your ISP connection is especially slow. FAH scientists normally manage the maximum size of their results file to minimize this issue.

There are fewer and fewer slow connections and if their owners use the max-packet-size setting to avoid larger WUs, you should be able to manage it yourself.

Nevertheless, there will be some exceptions. Whenever an upload is aborted (for any reason, including a timeout) the WU is redirected to a Collection Server and that CS should have a longer timeout than the WS. That's apparently what's happening to you.

If WUs from specific projects are not able to upload to a particular CS, let us know.

Attempting to upload a result to your router just demonstrates that FAH is designed NOT to be used that way. I don't see anything useful there.
Post Reply