[SOLVED] System crashes when crunching FAH on Radeon VII

It seems that a lot of GPU problems revolve around specific versions of drivers. Though AMD has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

[SOLVED] System crashes when crunching FAH on Radeon VII

Postby xanthene » Tue May 19, 2020 3:51 pm

Mod Edit: Moved Topic Upon xanthene's Request - PantherX

Solution:
A known driver issue in the Radeon Adrenaline drivers caused crashes when running Folding@Home at the same time as the GPU was accelerating video content. The issue was fixed in the Adrenaline drivers release 20.4.2. However, I decided to install Radeon Pro Software for Enterprise 20.Q1.1 instead, since this driver is developed with stability in mind, rather than latest game compatibility. This driver also does not crash.

Problem:
Hey everyone,

like many, I joined FAH a couple of months ago to help with the fight against Covid-19.
I recently acquired a Radeon VII, which I installed in my desktop computer to get some additional crunching performance (before the crunching was CPU-based). However, every now and then the system will just hang (screen, mouse freezes, num lock no longer responds) and needs a hard reboot. If that happens and I reboot, it is likely to crash again in the same manner not much later.

I suspect that the issue may be caused bycertain WUs, since, if I abandon a WU after the computer crashed and a fresh one is downloaded, the card with happily crunch along without any further issues. However, since the computer locks up completely, the FAHClient unfortunately doesn't get any error messages either.

Can anyone help me figure out what is causing the system to hang like that? Is the Radeon VII known for hiccups with some FAH WUs? Are there any stress tests I should run to rule out hardware issues? I don't mind the noise from the card when working, but regularly losing half-written e-mails and interrupting my work due to these hangs is very frustrating.

My card is not overclocked. The system's side panel is open, so heat shouldn't be an issue either. The system has a 550W Corsair PSU, which should also be supplying enough power.

Thanks for the help!
Last edited by xanthene on Fri May 22, 2020 6:19 am, edited 2 times in total.
xanthene
 
Posts: 8
Joined: Tue May 19, 2020 3:25 pm

Re: System crashes when crunching FAH on Radeon VII

Postby xanthene » Tue May 19, 2020 7:07 pm

Below is the log. Since posting my original post, I have updated to client v7.6.13, but it has not helped the crashing problem.
Code: Select all
*********************** Log Started 2020-05-19T15:37:16Z ***********************
15:37:16:Trying to access database...
15:37:16:Successfully acquired database lock
15:37:16:Read GPUs.txt
15:37:16:Enabled folding slot 01: READY gpu:0:Vega 20 [Radeon VII]
15:37:16:****************************** FAHClient ******************************
15:37:16:        Version: 7.6.13
15:37:16:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:37:16:      Copyright: 2020 foldingathome.org
15:37:16:       Homepage: https://foldingathome.org/
15:37:16:           Date: Apr 27 2020
15:37:16:           Time: 21:21:01
15:37:16:       Revision: 5a652817f46116b6e135503af97f18e094414e3b
15:37:16:         Branch: master
15:37:16:       Compiler: Visual C++ 2008
15:37:16:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
15:37:16:       Platform: win32 10
15:37:16:           Bits: 32
15:37:16:           Mode: Release
15:37:16:         Config: C:\Users\Redacted\AppData\Roaming\FAHClient\config.xml
15:37:16:******************************** CBang ********************************
15:37:16:           Date: Apr 24 2020
15:37:16:           Time: 17:07:55
15:37:16:       Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
15:37:16:         Branch: master
15:37:16:       Compiler: Visual C++ 2008
15:37:16:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
15:37:16:       Platform: win32 10
15:37:16:           Bits: 32
15:37:16:           Mode: Release
15:37:16:******************************* System ********************************
15:37:16:            CPU: AMD Ryzen 3 3200G with Radeon Vega Graphics
15:37:16:         CPU ID: AuthenticAMD Family 23 Model 24 Stepping 1
15:37:16:           CPUs: 4
15:37:16:         Memory: 15.93GiB
15:37:16:    Free Memory: 13.40GiB
15:37:16:        Threads: WINDOWS_THREADS
15:37:16:     OS Version: 6.2
15:37:16:    Has Battery: false
15:37:16:     On Battery: false
15:37:16:     UTC Offset: 2
15:37:16:            PID: 10268
15:37:16:            CWD: C:\Users\Redacted\AppData\Roaming\FAHClient
15:37:16:  Win32 Service: false
15:37:16:             OS: Windows 10 Enterprise
15:37:16:        OS Arch: AMD64
15:37:16:           GPUs: 1
15:37:16:          GPU 0: Bus:3 Slot:0 Func:0 AMD:5 Vega 20 [Radeon VII]
15:37:16:           CUDA: Not detected: Failed to open dynamic library 'nvcuda.dll': Das
15:37:16:                 angegebene Modul wurde nicht gefunden.
15:37:16:
15:37:16:OpenCL Device 0: Platform:0 Device:0 Bus:3 Slot:0 Compute:1.2 Driver:3004.8
15:37:16:******************************* libFAH ********************************
15:37:16:           Date: Apr 15 2020
15:37:16:           Time: 14:53:14
15:37:16:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
15:37:16:         Branch: master
15:37:16:       Compiler: Visual C++ 2008
15:37:16:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
15:37:16:       Platform: win32 10
15:37:16:           Bits: 32
15:37:16:           Mode: Release
15:37:16:***********************************************************************
15:37:16:<config>
15:37:16:  <!-- Folding Slot Configuration -->
15:37:16:  <cause v='COVID_19'/>
15:37:16:
15:37:16:  <!-- HTTP Server -->
15:37:16:  <allow v='0.0.0.0/0'/>
15:37:16:  <deny v=''/>
15:37:16:
15:37:16:  <!-- Network -->
15:37:16:  <proxy v=':8080'/>
15:37:16:
15:37:16:  <!-- Remote Command Server -->
15:37:16:  <password v='*****'/>
15:37:16:
15:37:16:  <!-- User Information -->
15:37:16:  <passkey v='*****'/>
15:37:16:  <user v='xylofoam'/>
15:37:16:
15:37:16:  <!-- Folding Slots -->
15:37:16:  <slot id='1' type='GPU'/>
15:37:16:</config>
15:37:16:WU00:FS01:Starting
15:37:16:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Redacted\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 10268 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
15:37:16:WU00:FS01:Started FahCore on PID 10624
15:37:16:WU00:FS01:Core PID:10648
15:37:16:WU00:FS01:FahCore 0x22 started
15:37:18:WU00:FS01:0x22:*********************** Log Started 2020-05-19T15:37:17Z ***********************
15:37:18:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
15:37:18:WU00:FS01:0x22:       Type: 0x22
15:37:18:WU00:FS01:0x22:       Core: Core22
15:37:18:WU00:FS01:0x22:    Website: https://foldingathome.org/
15:37:18:WU00:FS01:0x22:  Copyright: (c) 2009-2018 foldingathome.org
15:37:18:WU00:FS01:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
15:37:18:WU00:FS01:0x22:             <rafal.wiewiora@choderalab.org>
15:37:18:WU00:FS01:0x22:       Args: -dir 00 -suffix 01 -version 706 -lifeline 10624 -checkpoint 15
15:37:18:WU00:FS01:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
15:37:18:WU00:FS01:0x22:     Config: <none>
15:37:18:WU00:FS01:0x22:************************************ Build *************************************
15:37:18:WU00:FS01:0x22:    Version: 0.0.5
15:37:18:WU00:FS01:0x22:       Date: Apr 22 2020
15:37:18:WU00:FS01:0x22:       Time: 04:42:59
15:37:18:WU00:FS01:0x22: Repository: Git
15:37:18:WU00:FS01:0x22:   Revision: 2d69202c898bd9bb3e093f51cd32bf411c2a0388
15:37:18:WU00:FS01:0x22:     Branch: HEAD
15:37:18:WU00:FS01:0x22:   Compiler: Visual C++ 2008
15:37:18:WU00:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
15:37:18:WU00:FS01:0x22:   Platform: win32 10
15:37:18:WU00:FS01:0x22:       Bits: 64
15:37:18:WU00:FS01:0x22:       Mode: Release
15:37:18:WU00:FS01:0x22:************************************ System ************************************
15:37:18:WU00:FS01:0x22:        CPU: AMD Ryzen 3 3200G with Radeon Vega Graphics
15:37:18:WU00:FS01:0x22:     CPU ID: AuthenticAMD Family 23 Model 24 Stepping 1
15:37:18:WU00:FS01:0x22:       CPUs: 4
15:37:18:WU00:FS01:0x22:     Memory: 15.93GiB
15:37:18:WU00:FS01:0x22:Free Memory: 13.38GiB
15:37:18:WU00:FS01:0x22:    Threads: WINDOWS_THREADS
15:37:18:WU00:FS01:0x22: OS Version: 6.2
15:37:18:WU00:FS01:0x22:Has Battery: false
15:37:18:WU00:FS01:0x22: On Battery: false
15:37:18:WU00:FS01:0x22: UTC Offset: 2
15:37:18:WU00:FS01:0x22:        PID: 10648
15:37:18:WU00:FS01:0x22:        CWD: C:\Users\Redacted\AppData\Roaming\FAHClient\work
15:37:18:WU00:FS01:0x22:         OS: Windows 10 Pro
15:37:18:WU00:FS01:0x22:    OS Arch: AMD64
15:37:18:WU00:FS01:0x22:********************************************************************************
15:37:18:WU00:FS01:0x22:Project: 11749 (Run 0, Clone 6297, Gen 15)
15:37:18:WU00:FS01:0x22:Unit: 0x000000258ca304e75e6bb70b11e063d0
15:37:18:WU00:FS01:0x22:Digital signatures verified
15:37:18:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
15:37:18:WU00:FS01:0x22:Version 0.0.5
15:37:18:WU00:FS01:0x22:  Found a checkpoint file
15:37:35:WU00:FS01:0x22:Completed 450000 out of 2000000 steps (22%)
15:37:35:WU00:FS01:0x22:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
Last edited by xanthene on Tue May 19, 2020 8:43 pm, edited 1 time in total.
xanthene
 
Posts: 8
Joined: Tue May 19, 2020 3:25 pm

Re: System crashes when crunching FAH on Radeon VII

Postby xanthene » Tue May 19, 2020 7:16 pm

Upon further investigation, I can now say that the offending WU is for project 11749.
On occasion the computer manages to create a crashdump, which I was able to take a look at with the help of WhoCrashed. The stop code is VIDEO_TDR_FAILURE caused by the AMD graphics drivers. Can we somehow debug if the core itself caused the failure, or if the calculation simply took too long and Windows intervened?
xanthene
 
Posts: 8
Joined: Tue May 19, 2020 3:25 pm

Re: System crashes when crunching FAH on Radeon VII

Postby PantherX » Tue May 19, 2020 9:03 pm

Welcome to the F@H Forum xanthene,

What version of the AMD Drivers are you running?
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
User avatar
PantherX
Site Moderator
 
Posts: 6322
Joined: Wed Dec 23, 2009 10:33 am
Location: Land Of The Long White Cloud

Re: System crashes when crunching FAH on Radeon VII

Postby Crawdaddy79 » Wed May 20, 2020 12:57 am

I think that 550W is right at threshold for what you need. If you are using an UPS that reads power draw, you can tell exactly what your PC is pulling. My Vega 64 has a lower power limit than the Radeon VII and this PC pulls about 480W when both CPU and GPU are folding.

I am interested in what transpires in this thread because I'm having issues with a specific project on my system as well (16435).
Image
Crawdaddy79
 
Posts: 67
Joined: Sat Mar 21, 2020 4:56 pm

Re: System crashes when crunching FAH on Radeon VII

Postby MeeLee » Wed May 20, 2020 2:24 am

Preferably you'll want the load on your PSU to be between 40% and 80%, the sweet spot being around 2/3rd load.
MeeLee
 
Posts: 914
Joined: Tue Feb 19, 2019 11:16 pm

Re: System crashes when crunching FAH on Radeon VII

Postby foldy » Wed May 20, 2020 8:05 am

VIDEO_TDR_FAILURE: You can increase the timeout for TDR delay in Windows registry.

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers
Create DWORD value TdrDelay = 10

That means it has 10 sec for a driver response before Windows will reset the driver.

I also heared the error can occur if GPU gets too hot. So check Radeon VII temperature if OK. FAH is very demanding on HW more than games.
foldy
 
Posts: 1936
Joined: Sat Dec 01, 2012 4:43 pm

Re: System crashes when crunching FAH on Radeon VII

Postby xanthene » Wed May 20, 2020 8:48 am

Hello all,

thanks for the feedback up to this point.
PantherX wrote:Welcome to the F@H Forum xanthene,

What version of the AMD Drivers are you running?

Thanks! I'm currently running 20.2.2. Prior to that, I had a driver that was released some time in the second half of 2019, although I do not remember the exact version number.

Crawdaddy79 wrote:I think that 550W is right at threshold for what you need. If you are using an UPS that reads power draw, you can tell exactly what your PC is pulling. My Vega 64 has a lower power limit than the Radeon VII and this PC pulls about 480W when both CPU and GPU are folding.

MeeLee wrote:Preferably you'll want the load on your PSU to be between 40% and 80%, the sweet spot being around 2/3rd load.

That was a good hint. While I don't have a UPS, I do have a kill-a-watt monitor. I finished folding the WU that caused the constant crashing yesterday while the power monitor was attached to the computer. The WU went from 25% to 100% without any further crashes. During that time, the system's total energy use peaked at 240 W, so well within the 550 W bounds of the PSU. According to the AMD drivers, the GPU was at a fairly constant 98% load, peaking at 1679 MHz and 175 W of power usage. During the whole folding time, the CPU was at a fairly constant 22% usage.
I stopped CPU-based folding after installing the Radeon VII, so I don't have GPU and CPU working at the same time.

foldy wrote:I also heared the error can occur if GPU gets too hot. So check Radeon VII temperature if OK. FAH is very demanding on HW more than games.

Temperature is also something I'm a bit concerned about. I have played around with the fan curve to try and keep the GPU at around 70 Celcius, although the junction temperature is at about 90 degrees just the same. I might play with that setting some more later today and see if I can get the system to crash again when using the default fan curve.
Last edited by xanthene on Thu May 21, 2020 7:18 pm, edited 1 time in total.
xanthene
 
Posts: 8
Joined: Tue May 19, 2020 3:25 pm

Re: System crashes when crunching FAH on Radeon VII

Postby Crawdaddy79 » Wed May 20, 2020 6:47 pm

Definitely get the newest optional drivers (20.4.2).

If you're concerned about heat, get GPU-z. It will show you much more information than what AMD's software shows, including the GPU Hot Spot temp where an array of sensors on the chip report the highest temperature. 110C at that test point is "expected and within spec".
Crawdaddy79
 
Posts: 67
Joined: Sat Mar 21, 2020 4:56 pm

Re: System crashes when crunching FAH on Radeon VII

Postby MeeLee » Wed May 20, 2020 7:29 pm

I believe AMD's newer GPUs run a TJunction max of 114C.
That's only the hottest part of the GPU.
There's another readout somewhere, that should read around 70C. I personally prefer to keep the temps as low as possible, using regular air cooled fans.
But if the GPU remains under 75-78C I'd still be happy. 80C is a bit on the high side, but the GPU would survive.
Though higher temps isn't very good for capacitors on the board.
MeeLee
 
Posts: 914
Joined: Tue Feb 19, 2019 11:16 pm

Re: System crashes when crunching FAH on Radeon VII

Postby xanthene » Wed May 20, 2020 8:53 pm

After playing around some more, I got the next hard crash (no blue screen, just reboot without any error) while computing project 16900. According to the log files, the project completed 13% before crashing. Based on the time stamp when the log file was last written to, the project ran for 40 minutes.
I later restarted the project to monitor the computer personally while sitting in front of it. After completing an additional 2%, the computer crashed again within a mere 6m16s of resuming calculations. At that point, the GPU temp had not reached maximum yet.

Crawdaddy79 wrote:Definitely get the newest optional drivers (20.4.2).

I think this will be the solution. According to the Readme, the following was fixed in this release:
Code: Select all
Radeon RX Vega series graphics products may experience a system hang or black screen when running Folding@Home while also running an application using hardware acceleration of video content.
Last edited by xanthene on Thu May 21, 2020 7:16 pm, edited 1 time in total.
xanthene
 
Posts: 8
Joined: Tue May 19, 2020 3:25 pm

Re: System crashes when crunching FAH on Radeon VII

Postby xanthene » Thu May 21, 2020 12:24 pm

While downloading the Radeon drivers yesterday, I noticed that AMD now also offers the "Radeon Pro" enterprise drivers for their consumer grade cards (although without any warranty). I ended up downloading and using those drivers.
The enterprise drivers are checked to play nicely with a bunch of workstation software, like SolidWorks. They are based on the Crimson-generation of AMD drivers. I suppose in the Linux-world they might be considered the LTS development branch.
Since the latest Adrenaline drivers are developed mostly with game compatibility and performance in mind, I wonder if it wouldn't make sense for FAH to generally recommend using the enerprise grade drivers for people who primarily use their GPUs for folding and not for gaming?
xanthene
 
Posts: 8
Joined: Tue May 19, 2020 3:25 pm

Re: System crashes when crunching FAH on Radeon VII

Postby foldy » Thu May 21, 2020 2:18 pm

Did you try the TDRdelay tweak?
foldy
 
Posts: 1936
Joined: Sat Dec 01, 2012 4:43 pm

Re: System crashes when crunching FAH on Radeon VII

Postby xanthene » Thu May 21, 2020 7:15 pm

foldy wrote:Did you try the TDRdelay tweak?

No, I haven't. Since I saw the driver readme and it turned out that the crashes are a known problem, I did not want to fiddle with registry settings.
xanthene
 
Posts: 8
Joined: Tue May 19, 2020 3:25 pm

Re: [SOLVED] System crashes when crunching FAH on Radeon VII

Postby NormalDiffusion » Fri May 22, 2020 7:10 am

You may want to undervolt (UV) the card, as your temp are (for my taste) on the high side.
Also, use a more aggressive fan curve! The card is getting (a lot) louder, but the temps are getting a lot better.

To check for the limit on UV, I find the modified FAHbench with core 22 and real WU quite useful. It's an unofficial build. Choose the real WU (last option), check for errors every 10, and run the benchmark for some time. If you get a NaN error after UVing, you know you went to low, so up the voltage a little bit a run again.
https://www.file-upload.net/download-13 ... n.zip.html

As for the drivers: for the Rvii, the pro drivers are just an old (and stable) version of the gaming driver. Nothing more...
If I remember right, 20.4.2 are stable with the Rvii and can be used, should you want to use them.
NormalDiffusion
 
Posts: 52
Joined: Sat Apr 18, 2020 2:50 pm

Next

Return to Problems with AMD/ATI drivers

Who is online

Users browsing this forum: No registered users and 1 guest

cron