Confirmation before automatic dumping

Moderators: Site Moderators, FAHC Science Team

Post Reply
arisu
Posts: 92
Joined: Mon Feb 24, 2025 11:11 pm

Confirmation before automatic dumping

Post by arisu »

If I switch virtual terminals while folding and an OpenGL context is open, the GPU driver resets itself. When folding is subsequently paused, the client sends SIGINT to the running cores. The GPU core does not respond in time (since the GPU was ripped out under it causing it to become unresponsive), and the client then kills the core forcibly and dumps the WU. But nothing at all was wrong with the WU and it could have easily continued from the last checkpoint.

How do I stop the client from automatically dumping the WU from a "Core did not shutdown gracefully" warning when the GPU resets? Asking for confirmation, or at least asking for confirmation with an automatic timeout, would help when the core dies and it's clearly not due to a bad WU. Is there any way to do this in the v8 client, or would I have to patch and recompile it? I've lost quite a few perfectly good WUs because of this.
calxalot
Site Moderator
Posts: 1375
Joined: Sat Dec 08, 2007 1:33 am
Location: San Francisco, CA
Contact:

Re: Confirmation before automatic dumping

Post by calxalot »

I think this might be a good enhancement request.

Both client and web control would need to support this.

https://github.com/FoldingAtHome/fah-cl ... tet/issues

Workaround is, of course, to manually pause before doing something that resets the GPU.
calxalot
Site Moderator
Posts: 1375
Joined: Sat Dec 08, 2007 1:33 am
Location: San Francisco, CA
Contact:

Re: Confirmation before automatic dumping

Post by calxalot »

Hmm. Might require a reboot to get a GPU driver happy again. In some cases.
arisu
Posts: 92
Joined: Mon Feb 24, 2025 11:11 pm

Re: Confirmation before automatic dumping

Post by arisu »

It's not always possible to know what will reset the GPU. If it resets though then a reboot isn't needed, but yeah there are some cases where it crashes and is unable to reset and in those cases the WU would end up getting dumped pretty soon.

Maybe I'll write and test a patch and if it works well, I'll submit a PR.
calxalot
Site Moderator
Posts: 1375
Joined: Sat Dec 08, 2007 1:33 am
Location: San Francisco, CA
Contact:

Re: Confirmation before automatic dumping

Post by calxalot »

Only thing I have heard is that windows RDP can reset drivers.
arisu
Posts: 92
Joined: Mon Feb 24, 2025 11:11 pm

Re: Confirmation before automatic dumping

Post by arisu »

Linux amdgpu can also reset and recover itself:

Code: Select all

[533240.381281] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:8 pasid:32770, for process FahCore_26 pid 479221 thread FahCore_26 pid 479221)
[533240.381345] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000760000000000 from client 10
[533240.381373] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00801A30
[533240.381397] amdgpu 0000:03:00.0: amdgpu: 	 Faulty UTCL2 client ID: SDMA0 (0xd)
[533240.381419] amdgpu 0000:03:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[533240.381437] amdgpu 0000:03:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[533240.381456] amdgpu 0000:03:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[533240.381475] amdgpu 0000:03:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[533240.381494] amdgpu 0000:03:00.0: amdgpu: 	 RW: 0x0
[533240.488899] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=2
[533240.489502] amdgpu: failed to add hardware queue to MES, doorbell=0x1202
[533240.489529] amdgpu: MES might be in unrecoverable state, issue a GPU reset
[533240.489556] amdgpu: Failed to restore queue 2
[533240.489575] amdgpu: Failed to restore process queues
[533240.489595] amdgpu: Failed to restore queues of pasid 0x8002
[533240.493583] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[533240.653995] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533240.654165] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533240.760268] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533240.760418] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533240.867906] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533240.868065] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533240.974299] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533240.974447] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533241.080514] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533241.080659] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533241.188208] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533241.188369] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533241.294664] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533241.294815] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533241.401317] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533241.401601] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533241.508607] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533241.508873] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533241.525954] amdgpu 0000:03:00.0: amdgpu: MODE2 reset
[533241.558631] amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
[533241.559697] [drm] PCIE GART of 512M enabled (table at 0x000000801FD00000).
[533241.559788] [drm] VRAM is lost due to GPU reset!
[533241.559794] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[533241.562017] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
[533241.562346] [drm] kiq ring mec 3 pipe 1 q 0
[533241.564223] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[533241.564592] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[533241.567010] [drm] DMUB hardware initialized: version=0x08004800
[533241.936214] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[533241.936228] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[533241.936235] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[533241.936241] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[533241.936246] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[533241.936251] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[533241.936257] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[533241.936262] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[533241.936267] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[533241.936273] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[533241.936279] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 1
[533241.936284] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 1
[533241.936290] amdgpu 0000:03:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[533242.205967] amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow start
[533242.205987] amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow done
[533242.206648] [drm] ring gfx_32776.1.1 was added
[533242.206972] [drm] ring compute_32776.2.2 was added
[533242.207338] [drm] ring sdma_32776.3.3 was added
[533242.207362] [drm] ring gfx_32776.1.1 test pass
[533242.207388] [drm] ring gfx_32776.1.1 ib test pass
[533242.207404] [drm] ring compute_32776.2.2 test pass
[533242.207432] [drm] ring compute_32776.2.2 ib test pass
[533242.207449] [drm] ring sdma_32776.3.3 test pass
[533242.207483] [drm] ring sdma_32776.3.3 ib test pass
[533242.208645] amdgpu 0000:03:00.0: amdgpu: GPU reset(9) succeeded!
muziqaz
Posts: 1324
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 9950x, 7950x3D, 5950x, 5800x3D
7900xtx, RX9070, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: Confirmation before automatic dumping

Post by muziqaz »

Recovery only brings back basic driver functionality, it never loads back all the required modules for GPU to continue working as before the crash. This has been witnessed with games (they run like crap on recovered drivers), and FAH (it outright cannot fold on recovered drivers).
Driver recovery is there for you to save your work (not fah), and restart the PC gracefully.
Now why FAH cannot restart from the previous checkpoint and continue folding? It can't because driver is in recovery mode, which is not fully functional, thus stopping folding on such hardware/software combination.
Now, how about pausing if driver crash happens and let user reboot and resume the WU? Who is to say that driver crash did not introduce an error in scientific simulation? But can't fah check if previous bit of simulation is correct?
I think Joe would just quit as a dev if we expected FAHClient to hold hand for every possible user move.

Running 2 compute tasks at the same time is BAD. Run one only
Running FAH on unstable hardware is bad. Please make sure your hardware is 101% stable before folding
Running FAH on unstable drivers is bad. Make sure to fix the drivers before folding.

I still urge the request made in github, but caution, some of us are still waiting for more severe issue fixes to the client to not dump work left and right :)
FAH Omega tester
Image
arisu
Posts: 92
Joined: Mon Feb 24, 2025 11:11 pm

Re: Confirmation before automatic dumping

Post by arisu »

muziqaz wrote: Sat Mar 01, 2025 12:26 pm Recovery only brings back basic driver functionality, it never loads back all the required modules for GPU to continue working as before the crash. This has been witnessed with games (they run like crap on recovered drivers), and FAH (it outright cannot fold on recovered drivers).
Driver recovery is there for you to save your work (not fah), and restart the PC gracefully.
Now why FAH cannot restart from the previous checkpoint and continue folding? It can't because driver is in recovery mode, which is not fully functional, thus stopping folding on such hardware/software combination.
Now, how about pausing if driver crash happens and let user reboot and resume the WU? Who is to say that driver crash did not introduce an error in scientific simulation? But can't fah check if previous bit of simulation is correct?
I think Joe would just quit as a dev if we expected FAHClient to hold hand for every possible user move.

Running 2 compute tasks at the same time is BAD. Run one only
Running FAH on unstable hardware is bad. Please make sure your hardware is 101% stable before folding
Running FAH on unstable drivers is bad. Make sure to fix the drivers before folding.

I still urge the request made in github, but caution, some of us are still waiting for more severe issue fixes to the client to not dump work left and right :)
That might be true for some GPU drivers, but the Linux amdgpu driver is able to recover itself to full functionality, including being able to run games and folding. You might be thinking of Windows driver recovery which is indeed only intended to bring the system to a vaguely-usable state. Linux does not automatically restore failed drivers, but some GPU drivers have the ability to restart to a completely-working state on their own.

Restoring from the last checkpoint protects against any errors in computation from when the driver crashed, in any case.

Another reason a PR would be good is due to other reasons a WU may fail. I just had nearly 10 WUs get dumped in a row, all because I changed the systemd unit slightly but added a typo which somehow caused the GPU core to fail to start correctly. I have a failure rate of 19.6%, and yet less than 2% of those failures were actually problematic (due to genuinely bad WUs). The rest were simply because the core didn't find a resource it expected to find for one reason or another (wrong filesystem permissions should not cause a core to dump its WU...)

At the very least it should be hidden under advanced options (or maybe only enabled if beta WUs are selected or something). Or perhaps a WU that is dumped immediately should not count as an expired WU in user statistics since no science is lost (it will be reassigned in just minutes).
arisu
Posts: 92
Joined: Mon Feb 24, 2025 11:11 pm

Re: Confirmation before automatic dumping

Post by arisu »

Looks like I lost another >5 WUs getting dumped because of a filesystem permission error (I'm writing an selinux policy to upstream to Arch) and now I'm under 80% success. At this point I'll probably just create a new user and passkey and start fresh.

I'm absolutely flabbergasted that a core would dump a WU because it's unable to map its own shared objects, before the WU even loads.
muziqaz
Posts: 1324
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 9950x, 7950x3D, 5950x, 5800x3D
7900xtx, RX9070, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: Confirmation before automatic dumping

Post by muziqaz »

arisu wrote: Sun Mar 02, 2025 4:40 am
That might be true for some GPU drivers, but the Linux amdgpu driver is able to recover itself to full functionality, including being able to run games and folding. You might be thinking of Windows driver recovery which is indeed only intended to bring the system to a vaguely-usable state. Linux does not automatically restore failed drivers, but some GPU drivers have the ability to restart to a completely-working state on their own.
Well, so what is then software doing crashing after driver recovering? Clearly driver does not recover perfectly. FAH requires full and stable software and hardware functionality. This is not a game, but scientific simulation. Any software or hardware induced instability might introduce an error in simulation, which might be catastrophic for a scientist.
Restoring from the last checkpoint protects against any errors in computation from when the driver crashed, in any case.

Another reason a PR would be good is due to other reasons a WU may fail. I just had nearly 10 WUs get dumped in a row, all because I changed the systemd unit slightly but added a typo which somehow caused the GPU core to fail to start correctly. I have a failure rate of 19.6%, and yet less than 2% of those failures were actually problematic (due to genuinely bad WUs). The rest were simply because the core didn't find a resource it expected to find for one reason or another (wrong filesystem permissions should not cause a core to dump its WU...)

At the very least it should be hidden under advanced options (or maybe only enabled if beta WUs are selected or something). Or perhaps a WU that is dumped immediately should not count as an expired WU in user statistics since no science is lost (it will be reassigned in just minutes).
I would really really urge you to stop folding on that system. You are doing more harm to the project than good. If you knowingly don't have stable computer for scientific simulation, please do not use it for FAH. You are actively working on the OS level stuff and you question FAH on why it is not detecting and anticipating your system level changes to your OS on the run. WTF?
What is your goal for with FAH? To help the project or to dump WUs left and right?
FAH Omega tester
Image
arisu
Posts: 92
Joined: Mon Feb 24, 2025 11:11 pm

Re: Confirmation before automatic dumping

Post by arisu »

No no it's the other way around. The core doesn't crash *after* the driver recovers. It crashes *when* the driver goes down for the first time because it still has an open file descriptor to /dev/kfd, which of course isn't going to be a valid handle after amdgpu resets. It has to close and reopen it before issuing more ioctls, which it doesn't do (it has no reason to know that). If the core was actually crashing after the driver recovered, then that would be a problem and would indicate that the driver is not in a fully restored state and that the system is not stable.

I'd like to be able to test things without having to worry about a WU being dumped and actually being wasted when it can easily restart. In this particular case, it's because I may be able to get my employer to run FAH on ~100 machines, but to do so I need to show that I can put the core in a sandbox separate from the client. This is doable with selinux (or apparmor), and as an example I am going to upstream it to a distro's repo. This means creating a tight policy and incrementally loosening it, which means there will be the occasional filesystem error. I had incorrectly thought that an error like that would cause the core to fail to start.

After finding that the client got stuck in a loop dumping WUs over and over simply because it couldn't map a shared library, I just downloaded, saved, and folded a quick WU then disconnected from the internet to experiment with the client and core on the (expired, already completed and uploaded) WU without needing to worry about it actually being dumped. So now I don't need to be concerned about the client wasting WUs anymore.
muziqaz
Posts: 1324
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 9950x, 7950x3D, 5950x, 5800x3D
7900xtx, RX9070, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: Confirmation before automatic dumping

Post by muziqaz »

You can post your requests and findings and code at https://github.com/FoldingAtHome.
Use https://github.com/FoldingAtHome/fah-client-bastet for any client related things
Use https://github.com/FoldingAtHome/fah-web-client-bastet for any Web UI things. Keep in mind even though repo is called web client, it is not a client, it is just a web ui for the fah-client. Good luck
FAH Omega tester
Image
arisu
Posts: 92
Joined: Mon Feb 24, 2025 11:11 pm

Re: Confirmation before automatic dumping

Post by arisu »

Thank you. I already have 8.4.10 checked out and have been experimenting with it (this time with the client blocked from the internet...) but haven't played with the web client yet. I guess with v8 in beta it is to be expected that it still has some rough edges. But it's open source so now there's something to work with. :D
arisu
Posts: 92
Joined: Mon Feb 24, 2025 11:11 pm

Re: Confirmation before automatic dumping

Post by arisu »

I patched the client to give a 30 second warning before dumping a WU or committing the dump status to the database, which works for me (at least if the hard drive hasn't run out of disk space...).
Post Reply