Linux reset procedure on stalled WUs due to FAH server error

Moderators: Site Moderators, FAHC Science Team

Post Reply
MeeLee
Posts: 1375
Joined: Tue Feb 19, 2019 10:16 pm

Linux reset procedure on stalled WUs due to FAH server error

Post by MeeLee »

If you've experienced stalled WUs due to FAH server errors, there are a couple of ways to go about.
I am trying to better understand the procedures, and find an optimized way to restart the system, without wasting too much time.

If you're like me, and folding on Linux, with GPUs being overclocked, and set to custom fanspeeds, you'll know that restarting the whole procedure takes few minutes out of your time.
You may have tried to restart FAHClient, however, for some reason this doesn't seem to work. FahControl still is stuck on previous WUs, and new slots aren't available, causing control to freeze (in a place where there are no GPUs displayed, waiting for web input).
/etc/init.d/FAHClient log occasionally is able to recapture the database, but in most cases you'll be greeted with a 'database locked' message, and FAH can't continue.

1- The most common procedure is to just restart your pc.
The drawback is two fold.
a. At shutdown, FAHClient hangs, and will take several minutes to shut down the PC, unless you 'hard reset' the pc.
b. At startup, the OC procedure and fan speed, as well as power cap levels need to be adjusted. taking well over 5 minutes per single GPU, and 10 minutes for multi GPU systems.

2- Dump slot, and reintroduce them.
The pros are that GPU fan curves, OC values, and power caps don't need to be readjusted.
This works for occasionally stalled WUs. (eg: in a 4GPU setup, if 1 or 2 GPUs are stalled.
The con in this method is that you potentially might dump a fuly processed, and perhaps perfectly good WU, (@Bruce, any idea if this is the case?)

3- Kill FAHClient,
Rather than the above 2 solutions, I have found it far more easy to just kill FAHClient.
Lubuntu offers 'qps' as taskmanager (do: "sudo qps" in terminal to get root elevation),
Ubuntu probably uses gnome-system-monitor.
Once started as elevated (root), right click on any fahclient processes, and kill them.
If you are running headless, you can also use 'sudo top -u fahclient' to locate a process ID (PID), and use that to kill the FAH processes ("sudo kill #####", in which '#####' is the PID FAH is running at)
Then go back to Terminal and start fahcllient again (sudo /etc/init.d/FAHClient start)

I have found this works the best, without rebooting the system, and without the risk of throwing away any processed WUs.
It will release the database lock on WUs.
FAHControl will (*should) show that all inactive GPU slots are downloading actual WUs, and start them as soon as they're ready.

I yet have to fine-tune my solution, in as to what the exact name is, to kill; as in my case (4GPU system) there are 6 FAHClient processes.
The reason I didn't go with htop, is because HTOP shows A LOT of FAHClient PIDs, and it'll be hard to determine which ones to kill.
But perhaps when I find out exactly which process name of all it is to kill, even HTOP can be used.
goodyca
Posts: 187
Joined: Sun Dec 02, 2007 12:36 pm

Re: Linux reset procedure on stalled WUs due to FAH server e

Post by goodyca »

I am running Fedora 30. What works best for me:

1. pause the client that has a hung WU
2. delete the offending WU from /var/lib/fahclient/work
3. return client to folding
MeeLee
Posts: 1375
Joined: Tue Feb 19, 2019 10:16 pm

Re: Linux reset procedure on stalled WUs due to FAH server e

Post by MeeLee »

goodyca wrote:I am running Fedora 30. What works best for me:

1. pause the client that has a hung WU
2. delete the offending WU from /var/lib/fahclient/work
3. return client to folding
This works for a few times.
If the client hangs again, you'll have to redo the procedure. You can only redo the procedure so many times, before it hangs for good, and a restart or kill of the service is needed.


I played around with the settings a bit, and QPS shows 2 sessions of FAH (one being a config.xml service, and the other the client session) + 1 session per GPU (in the case of 4 GPUs, that is 6 session PIDs need to be killed).
All of them needed to be killed, or restarting the service won't work (the previous OpenCl session would still be using the GPUs).
Killing the XML PID didn't close FahClient, however I presume that whenever a new WU is restarted, it might restart as user 'anonymous'.

Another option I've been considering (and will try out later) is to go the way of 'sudo /etc/init.d/FAHClient stop', and then killing the OpenCL processes in the task manager (getting PID info via 'sudo top -u fahclient', and using 'sudo kill #PIDnumber#', or using QPS in my case)
Post Reply