Page 1 of 1

[Bug] Client got stuck after initiating connection to WS

Posted: Sat Apr 18, 2020 4:30 pm
by Frogging101
My FAHClient has been stuck since 21:32:00 UTC, yesterday (April 17, 2020).

The last lines in the log were as follows:

Code: Select all

21:32:00:WU00:FS00:Connecting to 18.218.241.186:80
21:32:00:WU00:FS00:Assigned to work server 13.90.152.57
21:32:00:WU00:FS00:Requesting new work unit for slot 00: READY cpu:24 from 13.90.152.57
21:32:00:WU00:FS00:Connecting to 13.90.152.57:8080
Upon inspection, it appears that this connection is still in the ESTABLISHED state, some 15 hours later:

Code: Select all

Netid  State      Recv-Q Send-Q Local Address:Port                 Peer Address:Port                
tcp    ESTAB      0      0      10.176.100.154:38750              13.90.152.57:8080                users:(("FAHClient",pid=30164,fd=12))
I would venture a theory that this connection is not actually still active; the connection simply died (without an RST) at a specific time when the client was waiting for a response, which it will never receive.

I could not clear this condition by pausing/unpausing, or using the request-id or request-ws commands. request-ws did connect to an AS and get an "Assigned to work server 150.136.14.110" message, but nothing else happened after that. The original socket to 13.90.152.57:8080 remained open in the ESTABLISHED state throughout these attempts to jog the client.

The FAHControl UI continued to display 13.90.152.57 as the work server, with no next attempt. Here's the full queue-info output:

Code: Select all

  {"id": "00", "state": "DOWNLOAD", "error": "NO_ERROR", "project": 0, "run": 0, "clone": 0, "gen": 0, "core": "unknown", "unit": "0x00000000000000000000000000000000", "percentdone": "0.00%", "eta": "0.00 secs", "ppd": "0", "creditestimate": "0", "waitingon": "", "nextattempt": "0.00 secs", "timeremaining": "unknown time", "totalframes": 0, "framesdone": 0, "assigned": "<invalid>", "timeout": "<invalid>", "deadline": "<invalid>", "ws": "13.90.152.57", "cs": "0.0.0.0", "attempts": 0, "slot": "00", "tpf": "0.00 secs", "basecredit": "0"}                             
I had to restart to get it folding again. Sending SIGINT once did not close the client; the 13.90.152.57:8080 socket remained open. I had to send it again to force exit.

Re: [Bug] Client got stuck after initiating connection to WS

Posted: Sat Apr 18, 2020 7:54 pm
by Jan
I know it sounds like a bit of a weird solution - but here is a report that might help. If you are already on the newest client 7.6.9, maybe try a reboot. If that doesnt work: Here is another approach.

Re: [Bug] Client got stuck after initiating connection to WS

Posted: Sat Apr 18, 2020 9:46 pm
by PantherX
It seems that you have encountered this known bug: https://github.com/FoldingAtHome/fah-issues/issues/983

I will ad this link too.