WU stuck at 99%

It seems that a lot of GPU problems revolve around specific versions of drivers. Though AMD has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

Post Reply
Janpieter_Sollie
Posts: 10
Joined: Fri Apr 10, 2020 2:23 pm

WU stuck at 99%

Post by Janpieter_Sollie »

I have a somewhat frustrating issue:

The WU on my Vega GPU gets stuck at 99%
shutting down the process and restarting it reverts the WU to 67%
I'm using amdgpu-pro libraries on top of the native amdgpu 5.6 driver, as the chipset on my mainboard does not provide PCI atomics and therefore is uncompatible with ROCm
The log files do not show anything.
radeontop just shows the Graphics pipe, texture addresser and shader interpolator being used at 100%, but that's it.
top does not show any activity at all for the FAHCore on the GPU
It seems like the WU is waiting for the GPU to complete a final operation, but it never does.
the kernel log does not show anything, the GPU sensors show a temperature of +- 70°C, which is OK for a GPU at full load.

Any suggestions?
Jan
Posts: 80
Joined: Tue Mar 31, 2020 6:46 pm

Re: WU stuck at 99%

Post by Jan »

I would suggest you still post your logs. How much time have you given the application to finish the final percent?
Janpieter_Sollie
Posts: 10
Joined: Fri Apr 10, 2020 2:23 pm

Re: WU stuck at 99%

Post by Janpieter_Sollie »

I got it running ... I waited 5m when stuck at 99%, then sent a SIGHUP, wait 1m, then a SIGTERM, wait 1m, and again a SIGHUP.
It reports (logs filtered):

Code: Select all

------------------------------------------------
13:43:56:WU03:FS00:0x22:WARNING:Next signal will force exit
13:46:08:WU03:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
13:46:08:WU03:FS00:Starting
13:46:08:WU03:FS00:Running FahCore: /opt/foldingathome/FAHCoreWrapper /opt/foldingathome/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 03 -suffix 01 -version 705 -lifeline 3287 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
13:46:08:WU03:FS00:Started FahCore on PID 3666
13:46:08:WU03:FS00:Core PID:3670
13:46:08:WU03:FS00:FahCore 0x22 started
13:46:09:WU03:FS00:0x22:*********************** Log Started 2020-04-15T13:46:08Z ***********************
13:46:09:WU03:FS00:0x22:*************************** Core22 Folding@home Core ***************************
13:46:09:WU03:FS00:0x22:       Type: 0x22
13:46:09:WU03:FS00:0x22:       Core: Core22
13:46:09:WU03:FS00:0x22:    Website: https://foldingathome.org/
13:46:09:WU03:FS00:0x22:  Copyright: (c) 2009-2018 foldingathome.org
13:46:09:WU03:FS00:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
13:46:09:WU03:FS00:0x22:             <rafal.wiewiora@choderalab.org>
13:46:09:WU03:FS00:0x22:       Args: -dir 03 -suffix 01 -version 705 -lifeline 3666 -checkpoint 15
13:46:09:WU03:FS00:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
13:46:09:WU03:FS00:0x22:     Config: <none>
13:46:09:WU03:FS00:0x22:************************************ Build *************************************
13:46:09:WU03:FS00:0x22:    Version: 0.0.2
....
13:46:09:WU03:FS00:0x22:       Date: Dec 6 2019
13:46:09:WU03:FS00:0x22:    OS Arch: AMD64
13:46:09:WU03:FS00:0x22:********************************************************************************
13:46:09:WU03:FS00:0x22:Project: 14543 (Run 0, Clone 1528, Gen 58)
13:46:09:WU03:FS00:0x22:Unit: 0x0000004180fccb045e7fc1bef1d12c9d
13:46:09:WU03:FS00:0x22:Digital signatures verified
13:46:09:WU03:FS00:0x22:Folding@home GPU Core22 Folding@home Core
13:46:09:WU03:FS00:0x22:Version 0.0.2
13:46:09:WU03:FS00:0x22:  Found a checkpoint file
13:46:13:WU03:FS00:0x22:ERROR:exception: Particle coordinate is nan
13:46:13:WU03:FS00:0x22:Saving result file ../logfile_01.txt
13:46:13:WU03:FS00:0x22:Saving result file checkpointState.xml
13:46:13:WU03:FS00:0x22:Saving result file checkpt.crc
13:46:13:WU03:FS00:0x22:Saving result file positions.xtc
13:46:13:WU03:FS00:0x22:Saving result file science.log
13:46:13:WU03:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
13:46:13:WARNING:WU03:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
13:46:13:WU03:FS00:Sending unit results: id:03 state:SEND error:FAULTY project:14543 run:0 clone:1528 gen:58 core:0x22 unit:0x0000004180fccb045e7fc1bef1d12c9d
13:46:13:WU03:FS00:Uploading 18.50MiB to 128.252.203.4
13:46:13:WU03:FS00:Connecting to 128.252.203.4:8080
13:46:23:WU01:FS00:Connecting to 65.254.110.245:8080
13:46:31:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
13:46:31:WU01:FS00:Connecting to 18.218.241.186:80
13:46:34:WU03:FS00:Upload 0.34%
13:46:45:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
13:46:45:ERROR:WU01:FS00:Exception: Could not get an assignment
13:46:45:WU01:FS00:Connecting to 65.254.110.245:8080
13:48:58:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:8080': Failed to connect to 65.254.110.245:8080: Connection timed out
13:48:58:WU01:FS00:Connecting to 18.218.241.186:80
13:49:03:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
13:49:03:ERROR:WU01:FS00:Exception: Could not get an assignment
13:49:03:WU01:FS00:Connecting to 65.254.110.245:8080
--------------------------------------------------------------
So does this mean there's a code error in the WU? a nan may indeed lock the OpenCL task on the GPU

Mod Edit: Added Code Tags - PantherX
Joe_H
Site Admin
Posts: 7867
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: WU stuck at 99%

Post by Joe_H »

The problem description you have given is consistent with a driver crash and reset during processing, the relevant information would be either be in the prior log file from before the restart, or earlier in this log. There is a bug where if the GPU driver crashes, the folding core stops processing, but the client keeps updating the progress as if the core was still processing. It will reach 99% and stop because it never gets the "done" message.

When you restarted, the client went back to the previous checkpoint and tried to start from there. There the error was either too many errors, or one it could not continue from. The previous log entries for this WU running would show progress to something after the 67%, and then no updates.

If you are getting multiple NaN errors, that may indicate a GPU that is overheating, or one that is overclocked by too much. The first, just check to see if a fan is not working, too much dust accumulated, etc. The second, try reducing any overclock, including factory overclocks, to closer to the reference values for your GPU.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: WU stuck at 99%

Post by bruce »

What GPU do you have? There's a known driver bug that effects many AMD GPUs. Many have had a different symptom, but I'm beginning to suspect that NaN errors can be another symptom of the same bug.

Your client did recognize the problem and it's attempting to report it to the server.
13:46:13:WU03:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
13:46:13:WARNING:WU03:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
13:46:13:WU03:FS00:Sending unit results: id:03 state:SEND error:FAULTY project:14543 run:0 clone:1528 gen:58 core:0x22 unit:0x0000004180fccb045e7fc1bef1d12c9d
13:46:13:WU03:FS00:Uploading 18.50MiB to 128.252.203.4
13:46:13:WU03:FS00:Connecting to 128.252.203.4:8080
13:46:34:WU03:FS00:Upload 0.34%
Janpieter_Sollie
Posts: 10
Joined: Fri Apr 10, 2020 2:23 pm

Re: WU stuck at 99%

Post by Janpieter_Sollie »

And what is this driver bug exacly? Are there any test CL codes I could use to see whether my setup works / fails with this bug?

*EDIT: I upgraded my opencl drivers from 19.30 to 19.50 and it seems much more stable now ... there was also an old libdrm_amdgpu so file in /usr/lib64/ which may have been the root cause of the problem, as the driver is supposed to use the libdrm from AMD.
I don't know whether it works, but things like clinfo etc ... work a lot faster. So, let's wait for the next WU and we'll see :)
Post Reply