Client fails after suspend

Moderators: Site Moderators, FAHC Science Team

Client fails after suspend

Postby ajgringo619 » Fri Feb 14, 2020 6:23 am

[Running Linux Mint 19.3, GeForce GTX 960 w/Nvidia v440.59]

I've searched high and low to find a cure for this problem, but have come up empty. The client runs fine - after a reboot, cold startup, or unpausing if the system has been up. However, every time I bring my system out of suspend, the client refuses to run, with these repeated messages:
Code: Select all
05:01:19:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
05:01:19:WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:11737 run:0 clone:8234 gen:19 core:0x22 unit:0x0000001f8ca304f15e4065468524abb0

It doesn't matter if I reload the FAHClient daemon or pause/unpause the client - unless I reboot, the client it stuck. Even if I pause the client before suspending, it still chokes when coming back; this also occurs if I suspend after the client finishes a run (put in finish mode).

Is there a file/directory I can purge to fix this? Rebooting all the time is not an option.

One more note: this has been going on for months and at least 4 different versions of the Nvidia drivers.
ajgringo619
 
Posts: 2
Joined: Fri Feb 14, 2020 6:14 am

Re: Client fails after suspend

Postby bruce » Fri Feb 14, 2020 6:15 pm

In Windows, I've observed the following. It is likely that the same applies to Linux, but I have no proof so you'll have to decide if it does or not.

When Windows enters a suspended state, a memory image is stored, along with whatever register states are required to resume CPU processing at the same state that existed before the system was suspended. Each active process then has to refresh it's on-screen information so that the desktop can be put back into the pre-Suspend state.

Note that there's no "please checkpoint the GPU" functionality built into the suspend process. FAHClient and the FAHCores have nothing like a Refresh function (they show nothing on the screen). Note that the state that the GPU was in prior to the suspend is not preserved by the suspend function but is regenerated by multiple Refresh functions ... except for whatever the FAHCore has allocated and initialized within the GPU. FAH's active GPU processes ("kernels") can be probably resumed from whatever state they were in as long as power is not removed from the GPU (unproven!).

For example, at the instant that the Suspend is initiated, suppose a GPU task has submitted to the GPU by FAHCore_2* but it has not been completed ... so the result has not been returned to main RAM. The FAHCore will still be waiting for that result to be returned.

Let's remember that the GPU is treated as a co-processor, not as an independent Operating System.

Could an enhancement be added to the FAHCore code to track the state of the pending in-VRAM processes ("kernels") and to refresh them from data previously stored by Suspend? Probably. Would it be a complex / major enhancement? Probably. Since it would be actually used very rarely, IMHO, FAH_development is unlikely to consider it worth the development costs.
bruce
 
Posts: 19637
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: Client fails after suspend

Postby ajgringo619 » Fri Feb 14, 2020 6:49 pm

This makes a lot of sense. In the beginning, I was only doing CPU-based FAH runs and this problem never occurred. But since my CPU is so slow, I switched to GPU-based.
ajgringo619
 
Posts: 2
Joined: Fri Feb 14, 2020 6:14 am

Re: Client fails after suspend

Postby bruce » Fri Feb 14, 2020 6:55 pm

If Linux is initiating the Suspend, there's not much you can do about it. If you're manually initiating the Suspend, you can Pause the GPU slot before Suspending.
bruce
 
Posts: 19637
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: Client fails after suspend

Postby gbowman » Thu Feb 20, 2020 6:20 pm

Yes, issues like this would be challenging to fix in the current client. We're putting our development effort into a new client that will resolve many of the various issues that have come up lately, and be easier to update/maintain.
gbowman
Pande Group Member
 
Posts: 208
Joined: Fri Nov 30, 2007 10:51 pm


Return to V7.5.1 Public Release Windows/Linux/MacOS X

Who is online

Users browsing this forum: No registered users and 1 guest

cron