Page 1 of 1

171.67.108.158 again

PostPosted: Wed Aug 14, 2019 8:13 am
by Blue_Bubble
I now have several systems at various locations stuck in "Downloading" over the last couple of days.

Common feature appears to be the Work Server at 171.67.108.158 again:

Code: Select all
19:16:25:WU01:FS00:0xa7:Completed 4900000 out of 5000000 steps (98%)
19:24:55:WU01:FS00:0xa7:Completed 4950000 out of 5000000 steps (99%)
19:24:56:WU00:FS00:Connecting to 65.254.110.245:8080
19:24:56:WU00:FS00:Assigned to work server 171.67.108.158
19:24:56:WU00:FS00:Requesting new work unit for slot 00: RUNNING cpu:4 from 171.67.108.158
19:24:56:WU00:FS00:Connecting to 171.67.108.158:8080
19:33:20:WU01:FS00:0xa7:Completed 5000000 out of 5000000 steps (100%)
19:33:20:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
19:33:20:WU01:FS00:0xa7:Saving result file ener.edr
19:33:20:WU01:FS00:0xa7:Saving result file frame77.trr
19:33:21:WU01:FS00:0xa7:Saving result file md.log
19:33:21:WU01:FS00:0xa7:Saving result file science.log
19:33:21:WU01:FS00:0xa7:Saving result file traj_comp.xtc
19:33:21:WU01:FS00:0xa7:Folding@home Core Shutdown: FINISHED_UNIT
19:33:21:WU01:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
19:33:21:WU01:FS00:Sending unit results: id:01 state:SEND error:NO_ERROR project:14153 run:8 clone:67 gen:77 core:0xa7 unit:0x0000005c0002894b5c546d39bf0634fb
19:33:21:WU01:FS00:Uploading 3.12MiB to 155.247.166.219
19:33:21:WU01:FS00:Connecting to 155.247.166.219:8080
19:33:25:WU01:FS00:Upload complete
19:33:25:WU01:FS00:Server responded WORK_ACK (400)
19:33:25:WU01:FS00:Final credit estimate, 6468.00 points
19:33:25:WU01:FS00:Cleaning up
******************************* Date: 2019-08-14 *******************************
07:11:12:FS00:Paused

Re: 171.67.108.158 again

PostPosted: Wed Aug 14, 2019 12:12 pm
by toTOW
The v7 client has always had troubles recovering for network events ... it could be caused by the work server, or not ...

When a slot is stuck a downloading, the only way to get rid of it is to restart (kill) the client, or the whole machine with a system reboot.

Re: 171.67.108.158 again

PostPosted: Wed Aug 14, 2019 6:22 pm
by toTOW
I can confirm that there is something wrong here. I have a client stuck at the same point ... :(

I noticed the server owner.

Re: 171.67.108.158 again

PostPosted: Wed Aug 14, 2019 7:10 pm
by bruce
Can you ping 171.67.108.158?
Does FAH recover if you restart it?

FAHClient generally does not recover from network interruptions if it happens while it's talking to a server.

Re: 171.67.108.158 again

PostPosted: Wed Aug 14, 2019 8:02 pm
by toTOW
It's a dedicated server, I don't think the network is to blame ... looking at server status page, 171.67.108.158 is getting many client assigned to it. I guess it's overloaded and can't handle them all.

Image

But yes, I rebooted after installing kernel updates, and it got work from another FAH server.

Re: 171.67.108.158 again

PostPosted: Wed Aug 14, 2019 9:54 pm
by toTOW
Joseph acknowledged the issue.
jcoffland wrote:Yes, something is wrong with vspd4. I'm looking into it now.

Re: 171.67.108.158 again

PostPosted: Thu Aug 15, 2019 6:35 pm
by cmhbob
I'm also affected. Any update?

Re: 171.67.108.158 again

PostPosted: Thu Aug 15, 2019 7:45 pm
by bruce
Earlier, 171.67.108.158 was effectively down (not responding to HTTP connections) but still receiving a lot of attempted connections from clients as you've shown in the screen-grab above. Apparently "assign rate" is the frequency that clients are directed to that WS, not the number of successful assignments.

As far as I can tell, it has been restarted and has been functioning correctly since about 07:40 PM Stanford time yesterday. Http connections are being accepted .. so hopefully WUs are being assigned.

Re: 171.67.108.158 again

PostPosted: Thu Aug 15, 2019 11:00 pm
by toTOW
Joseph put the server in Accept only mode, no client should be assigned to it.

cmhbob wrote:I'm also affected. Any update?

Restart the client or reboot your machine.

Re: 171.67.108.158 again

PostPosted: Fri Aug 16, 2019 2:06 am
by bruce
Notice that the projects from 171.67.108.158 do not appear on https://apps.foldingathome.org/psummary

Re: 171.67.108.158 again

PostPosted: Fri Aug 16, 2019 3:12 am
by cmhbob
I rebooted earlier and things are running fine now.