Bad State detected on GPU (AMD)

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: Site Moderators, FAHC Science Team

Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Bad State detected on GPU (AMD)

Post by Neil-B »

Different project WUs work the cores and your GPU in different ways so whether it is a hardware/driver/whatever issue it is quite possible to have some WUs fold and others not :(

You appear to be getting a similar type of error from a wide variety of Project WUs at quite high failure rates which other folders are completing - regrettably that probably means an issue with you kit/the way it is configured not with the Project WUs ... Tracking down these types of issues can be tricky - but it is worth persevering.

Couple of questions ... Have you always had these issues or have they started to happen more recently? ... What GPU are you running (couldn't spot this config in any of the posted logs)?

You can check your bonus status with the bonus status app https://apps.foldingathome.org/bonus
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
4n0n
Posts: 17
Joined: Thu Apr 09, 2020 5:12 pm

Re: Bad State detected on GPU (AMD)

Post by 4n0n »

Neil-B wrote:You appear to be getting a similar type of error from a wide variety of Project WUs at quite high failure rates which other folders are completing - regrettably that probably means an issue with you kit/the way it is configured not with the Project WUs ... Tracking down these types of issues can be tricky - but it is worth persevering.
I'm willing to help tracking down these issues. But i'll need advice what to do/try.
For what it is worth: I also think that the PPD on my GPU do not reflect the power that i would expect from such a relatively new hardware. I make about 140k-200k GPU-PPD while older nvidia cards are said to make 1.2 to 2.4M GPU-PPD.
Neil-B wrote:Couple of questions ... Have you always had these issues or have they started to happen more recently? ... What GPU are you running (couldn't spot this config in any of the posted logs)?
I'm folding for 13 years now. Looong time with many platforms and different hardware - but all without GPU - and without trouble also. Some years ago i interrupted folding but reactivated my machines for fighting covid-19 in March 2020. This was the first time i had a GPU equipped machine in my hands and wanted use it for folding too. The nan-errors mentioned in this thread were present from the very beginning of folding on that certain gpu-machine (mar 2020). Decide for yourself if this is "always" or "more recently".

The machine in question is on Linux Mint 19.3, AMD Ryzen 5 3600, AMD Radeon RX 5500 XT and original proprietary GPU drivers from the official AMD site.
Are you looking for these lines?

Code: Select all

19:05:43:****************************** FAHClient ******************************
19:05:43:        Version: 7.6.9
19:05:43:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
19:05:43:      Copyright: 2020 foldingathome.org
19:05:43:       Homepage: https://foldingathome.org/
19:05:43:           Date: Apr 17 2020
19:05:43:           Time: 18:11:26
19:05:43:       Revision: 398c2b17fa535e0cc6c9d10856b2154c32771646
19:05:43:         Branch: master
19:05:43:       Compiler: GNU 8.3.0
19:05:43:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
19:05:43:                 -funroll-loops -fno-pie
19:05:43:       Platform: linux2 4.19.0-5-amd64
19:05:43:           Bits: 64
19:05:43:           Mode: Release
19:05:43:           Args: --child /etc/fahclient/config.xml
19:05:43:                 --pid-file=/var/run/fahclient/fahclient.pid --daemon
19:05:43:         Config: /etc/fahclient/config.xml
19:05:43:******************************** CBang ********************************
19:05:43:           Date: Apr 17 2020
19:05:43:           Time: 18:10:13
19:05:43:Started thread 1 on PID 26178
19:05:43:       Revision: 2fb0be7809c5e45287a122ca5fbc15b5ae859a3b
19:05:43:         Branch: master
19:05:43:       Compiler: GNU 8.3.0
19:05:43:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
19:05:43:                 -funroll-loops -fno-pie -fPIC
19:05:43:       Platform: linux2 4.19.0-5-amd64
19:05:43:           Bits: 64
19:05:43:           Mode: Release
19:05:43:******************************* System ********************************
19:05:43:            CPU: AMD Ryzen 5 3600 6-Core Processor
19:05:43:         CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
19:05:43:           CPUs: 12
19:05:43:         Memory: 31.37GiB
19:05:43:    Free Memory: 6.98GiB
19:05:43:        Threads: POSIX_THREADS
19:05:43:     OS Version: 5.6
19:05:43:    Has Battery: false
19:05:43:     On Battery: false
19:05:43:     UTC Offset: 2
19:05:43:            PID: 26178
19:05:43:            CWD: /var/lib/fahclient
19:05:43:             OS: Linux 5.6.6-050606-generic x86_64
19:05:43:        OS Arch: AMD64
19:05:43:           GPUs: 1
19:05:43:          GPU 0: Bus:40 Slot:0 Func:0 AMD:6 Navi 14 [Radeon RX 5500/5500M / Pro
19:05:43:                 5500M]
19:05:43:           CUDA: Not detected: Failed to open dynamic library 'libcuda.so':
19:05:43:                 libcuda.so: cannot open shared object file: No such file or
19:05:43:                 directory
19:05:43:OpenCL Device 0: Platform:0 Device:0 Bus:40 Slot:0 Compute:2.0 Driver:3075.10
19:05:43:******************************* libFAH ********************************
19:05:43:           Date: Apr 15 2020
19:05:43:           Time: 21:43:24
19:05:43:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
19:05:43:         Branch: master
19:05:43:       Compiler: GNU 8.3.0
19:05:43:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
19:05:43:                 -funroll-loops -fno-pie
19:05:43:       Platform: linux2 4.19.0-5-amd64
19:05:43:           Bits: 64
19:05:43:           Mode: Release
19:05:43:***********************************************************************
19:05:43:<config>
19:05:43:  <!-- Client Control -->
19:05:43:  <client-threads v='6'/>
19:05:43:  <cycle-rate v='4'/>
19:05:43:  <cycles v='-1'/>
19:05:43:  <disable-sleep-when-active v='true'/>
19:05:43:  <exit-when-done v='false'/>
19:05:43:  <fold-anon v='true'/>
19:05:43:  <idle-seconds v='300'/>
19:05:43:  <open-web-control v='false'/>
19:05:43:
19:05:43:  <!-- Configuration -->
19:05:43:  <config-rotate v='true'/>
19:05:43:  <config-rotate-dir v='configs'/>
19:05:43:  <config-rotate-max v='16'/>
19:05:43:
19:05:43:  <!-- Debugging -->
19:05:43:  <assignment-servers>
19:05:43:    assign1.foldingathome.org assign2.foldingathome.org assign3.foldingathome.org assign4.foldingathome.org 
19:05:43:  </assignment-servers>
19:05:43:  <auth-as v='true'/>
19:05:43:  <capture-directory v='capture'/>
19:05:43:  <capture-on-error v='false'/>
19:05:43:  <capture-packets v='false'/>
19:05:43:  <capture-requests v='false'/>
19:05:43:  <capture-responses v='false'/>
19:05:43:  <capture-sockets v='false'/>
19:05:43:  <debug-sockets v='false'/>
19:05:43:  <exception-locations v='true'/>
19:05:43:  <stack-traces v='false'/>
19:05:43:
19:05:43:  <!-- Error Handling -->
19:05:43:  <max-slot-errors v='10'/>
19:05:43:  <max-unit-errors v='5'/>
19:05:43:
19:05:43:  <!-- Folding Core -->
19:05:43:  <checkpoint v='15'/>
19:05:43:  <core-priority v='idle'/>
19:05:43:  <cpu-usage v='100'/>
19:05:43:  <gpu-usage v='100'/>
19:05:43:  <no-assembly v='false'/>
19:05:43:
19:05:43:  <!-- Folding Slot Configuration -->
19:05:43:  <cause v='COVID_19'/>
19:05:43:  <client-subtype v='LINUX'/>
19:05:43:  <client-type v='normal'/>
19:05:43:  <cpu-species v='X86_AMD'/>
19:05:43:  <cpu-type v='AMD64'/>
19:05:43:  <cpus v='-1'/>
19:05:43:  <disable-viz v='false'/>
19:05:43:  <gpu v='true'/>
19:05:43:  <max-packet-size v='normal'/>
19:05:43:  <os-species v='UNKNOWN'/>
19:05:43:  <os-type v='LINUX'/>
19:05:43:  <project-key v='0'/>
19:05:43:  <smp v='true'/>
19:05:43:
19:05:43:  <!-- GUI -->
19:05:43:  <gui-enabled v='true'/>
19:05:43:
19:05:43:  <!-- HTTP Server -->
19:05:43:  <allow v='127.0.0.1 192.168.10.0/24'/>
19:05:43:  <connection-timeout v='60'/>
19:05:43:  <deny v='0/0'/>
19:05:43:  <http-addresses v='0:7396'/>
19:05:43:  <https-addresses v=''/>
19:05:43:  <max-connect-time v='900'/>
19:05:43:  <max-connections v='800'/>
19:05:43:  <max-request-length v='52428800'/>
19:05:43:  <min-connect-time v='300'/>
19:05:43:
19:05:43:  <!-- Logging -->
19:05:43:  <log v='log.txt'/>
19:05:43:  <log-color v='true'/>
19:05:43:  <log-crlf v='false'/>
19:05:43:  <log-date v='false'/>
19:05:43:  <log-date-periodically v='21600'/>
19:05:43:  <log-domain v='false'/>
19:05:43:  <log-header v='true'/>
19:05:43:  <log-level v='true'/>
19:05:43:  <log-no-info-header v='true'/>
19:05:43:  <log-redirect v='false'/>
19:05:43:  <log-rotate v='true'/>
19:05:43:  <log-rotate-dir v='logs'/>
19:05:43:  <log-rotate-max v='16'/>
19:05:43:  <log-short-level v='false'/>
19:05:43:  <log-simple-domains v='true'/>
19:05:43:  <log-thread-id v='false'/>
19:05:43:  <log-thread-prefix v='true'/>
19:05:43:  <log-time v='true'/>
19:05:43:  <log-to-screen v='true'/>
19:05:43:  <log-truncate v='false'/>
19:05:43:  <verbosity v='5'/>
19:05:43:
19:05:43:  <!-- Process Control -->
19:05:43:  <child v='true'/>
19:05:43:  <daemon v='true'/>
19:05:43:  <fork v='false'/>
19:05:43:  <pid v='false'/>
19:05:43:  <pid-file v='/var/run/fahclient/fahclient.pid'/>
19:05:43:  <respawn v='false'/>
19:05:43:  <service v='false'/>
19:05:43:
19:05:43:  <!-- Slot Control -->
19:05:43:  <idle v='false'/>
19:05:43:  <max-shutdown-wait v='60'/>
19:05:43:  <pause-on-battery v='true'/>
19:05:43:  <pause-on-start v='false'/>
19:05:43:  <paused v='false'/>
19:05:43:  <power v='medium'/>
19:05:43:
19:05:43:  <!-- Work Unit Control -->
19:05:43:  <dump-after-deadline v='true'/>
19:05:43:  <max-queue v='16'/>
19:05:43:  <max-units v='0'/>
19:05:43:  <next-unit-percentage v='99'/>
19:05:43:  <stall-detection-enabled v='false'/>
19:05:43:  <stall-percent v='5'/>
19:05:43:  <stall-timeout v='1800'/>
19:05:43:
19:05:43:  <!-- Folding Slots -->
19:05:43:  <slot id='0' type='CPU'>
19:05:43:    <cpus v='12'/>
19:05:43:    <paused v='true'/>
19:05:43:  </slot>
19:05:43:  <slot id='1' type='GPU'>
19:05:43:    <paused v='true'/>
19:05:43:  </slot>
19:05:43:</config>
19:05:43:Trying to access database...
19:05:43:Successfully acquired database lock
19:05:43:Enabled folding slot 00: PAUSED cpu:12 (by user)
19:05:43:Enabled folding slot 01: PAUSED gpu:0:Navi 14 [Radeon RX 5500/5500M / Pro 5500M] (by user)
Just tried FAHBench-cmd with the following result. It also shows some error:

Code: Select all

FAHBench Simulation
-------------------
Plugin directory: "/usr/lib/openmm"
Work unit: dhfr
WU Name: Dihydrofolate reductase
WU Description: A common system for benchmarking molecular dynamics
System XML: /usr/share/fahbench/workunits/dhfr/system.xml
Integrator XML: /usr/share/fahbench/workunits/dhfr/integrator.xml
State XML: /usr/share/fahbench/workunits/dhfr/state.xml
Step chunk: 40
Device ID 0; Platform OpenCL; Platform ID 0
Run length: 60s

Loading plugins from plugin directory
Number of registered plugins: 3
Deserializing input files: system
Deserializing input files: state
Deserializing input files: integrator
Creating context (may take several minutes)
Checking accuracy against reference code
Creating reference context (may take several minutes)
Comparing forces and energy

Something went wrong:
Force RMSE error of 27153.7 with threshold of 5
4n0n wrote:Are the returned WUs considered "successfully returned" in my case?
Neil-B wrote:You can check your bonus status with the bonus status app https://apps.foldingathome.org/bonus
Thanks for the link. My bonus stats implicate, that WU's are only distinguished between "returned in time" and "timed out". All my failed WUs seem not to be counted as "timed out". So they must have been either counted as "returned in time" or not counted at all. So i have no impact on bonus stats and am fine with 99.xy percent.
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Bad State detected on GPU (AMD)

Post by Neil-B »

I've CPU folded over the years too ... and only recently started a minor foray into GPU ... so know how you feel re troubleshooting GPUs :e?: ... really hope one of the GPU Gurus latches onto this topic :)

I asked about always to try and see if it might be a config/driver issue that has been with the system for a while as opposed to something that has recently changed as this might speed up the issue diagnosis ... seen this type of questions asked by others diagnosing similar issues so asked them up front so the information is there when the GPU folders read thread ... your ppd observations may well assist them as well :)

Hope someone can get this sorted for you soon !!
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Bad State detected on GPU (AMD)

Post by bruce »

As far as FAH's credibility for GPUs is concerned, it will be restored when a new version of FAHCore_22 is released. A great deal has been learned from the collection of these error reports at the cost of some temporary setbacks. Ordinarily FAH attempts to collect and fix errors as a result of beta testing and then make a second pass at the remaining ones in Advanced testing but as it turns out, the Donor population for Beta and Advanced is smaller than usual so it has been necessary to distribute a percentage of the WUs to full FAH.

We'll all be pleased when the next version of FAHCore_22 is ready for release.
I think i read somewhere, that bonus score is only added, if more than 80 percent of the WUs were returned successfully.
I don't remember reading that. From whom/where you get that information?

My information is that only WUs which are completed and successfully uploaded can be used to generate the trajectory's next Gen. The token points are simply a reward for your effort.
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Bad State detected on GPU (AMD)

Post by Neil-B »

It is one of the qualifications stated for the QRB on the website .. sorry brevity - on phone :(
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Bad State detected on GPU (AMD)

Post by bruce »

Aha. Errors don't get bonuses and successful returns only get them as long as you maintain an overall 80% success rate.
4n0n
Posts: 17
Joined: Thu Apr 09, 2020 5:12 pm

Re: Bad State detected on GPU (AMD)

Post by 4n0n »

bruce wrote:We'll all be pleased when the next version of FAHCore_22 is ready for release.
Where do you have your information from? Is there any public release plan or an estimation in terms of time?
bruce wrote:I don't remember reading that. From whom/where you get that information?
In addition to Neil-B's answer, here is the source:
https://foldingathome.org/support/faq/p ... or-the-qrb
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Bad State detected on GPU (AMD)

Post by PantherX »

4n0n wrote:...Where do you have your information from? Is there any public release plan or an estimation in terms of time?...
From the researcher:
JohnChodera wrote:We've had to checkpoint these WUs every 25% due to some limitations in the core, but we're working to remedy those ASAP in a forthcoming core release so we can checkpoint closer to 5%.

Thanks for bearing with us!

~ John Chodera // MSKCC
viewtopic.php?f=19&t=35175&p=333835#p333835

Please note that there's no ETA, or timeline. It will be released to public after the Beta team tests it out whenever it is made available.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Bad State detected on GPU (AMD)

Post by bruce »

4n0n wrote:Is there any public release plan or an estimation in terms of time?
FAH never pre-announces a release date. (We don't have a sales department that makes predictions of when new features will be available.) The only factual answers are "When it's ready" and more commonly "soon"
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Bad State detected on GPU (AMD)

Post by PantherX »

FYI, the timeline that the F@H Project uses is this:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

You can vaguely map it to something like this:
Public ↞ Beta ↔ Internal ↔ In Development ↔ Thinking/Planning ↠ Backlog

Hence, the "soon" maps to "In Development" which means it is only 3 stages away from Full release :)
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Post Reply