Multiple Issues with AMD GPU Processing?

It seems that a lot of GPU problems revolve around specific versions of drivers. Though AMD has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Multiple Issues with AMD GPU Processing?

Post by bruce »

mwroggenbuck wrote:I may be dense, but if this is a driver problem, why do my other tasks (and games) work? This seems to be specific to Folding At Home.
Just a thought. I am just trying to problem solve and not point fingers. I do not want to hurt anyone's feelings.
That's a good question. The fact is that the Error invoking kernel "sortShortList: clEnqueueNDRangeKernel (-5)'' is a known AMD driver bug.

In fact, the bug doesn't always occur. If you look at the 3rd column of https://apps.foldingathome.org/psummary, you'll notice that the proteins that FAH analyzes typically have a lot of atoms -- and that number varies a lot depending on which protein is being analyzed. (I can't compare that with what happens in a game. I don't know that much about code generated by the game industry.)
The error message ending in (-5) means that the sortShortList ran out of resources. Somewhere in the driver there is a process that's doing some sorting and the driver SHOULD know how much memory is available to perform the required sort. but if the driver gives the code the wrong size for the available memory, it will fail. The peculiar thing about this error is that proteins of certain sizes cause this error while both larger and smaller proteins can be processed.

I don't think your game is doing the same kind of analysis so I'm not surprised that the driver bug surfaces here while not in your game.
kwthom
Posts: 29
Joined: Sun Mar 29, 2020 11:06 pm
Location: Jaynes Station, AZ

Re: Multiple Issues with AMD GPU Processing?

Post by kwthom »

Wait a sec...

https://apps.foldingathome.org/wu#proje ... 981&gen=19


User Team CPUID Credit Assigned Returned Credited Days Code
Jp 0 7E978B5E72C12135 320.21 2020-04-06 20:57:50 2020-04-06 21:35:05 2020-04-06 21:27:47 0.021 Faulty 2
kwthom 35780 AA4C715E10F25523 27.44 2020-04-06 21:27:54 2020-04-06 21:35:08 2020-04-06 21:30:28 0.002 Faulty 2

Hi Jp (team 0), Your WU (P11776 R0 C14981 G19) was added to the stats database on Tue, 07 Apr 2020 04:27:47 GMT for 320.21 points of credit.
Hi kwthom (team 35780), Your WU (P11776 R0 C14981 G19) was added to the stats database on Tue, 07 Apr 2020 04:30:28 GMT for 27.44 points of credit.

If my interpretation is correct, we were both assigned this WU - and we both had issues.

Is this a legit bad WU?
Image
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Multiple Issues with AMD GPU Processing?

Post by PantherX »

kwthom wrote:...If my interpretation is correct, we were both assigned this WU - and we both had issues.

Is this a legit bad WU?
There should be 2 additional copies of the WU floating around. If they too are bad and it reached a specified threshold, then the server will automatically block it from being allocated and it will be a bad WU.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
kwthom
Posts: 29
Joined: Sun Mar 29, 2020 11:06 pm
Location: Jaynes Station, AZ

Re: Multiple Issues with AMD GPU Processing?

Post by kwthom »

Code: Select all

03:01:53:WU03:FS01:Connecting to 65.254.110.245:8080
03:01:53:WARNING:WU03:FS01:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
03:01:53:WU03:FS01:Connecting to 18.218.241.186:80
03:01:54:WARNING:WU03:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
03:01:54:ERROR:WU03:FS01:Exception: Could not get an assignment
03:03:30:WU03:FS01:Connecting to 65.254.110.245:8080
03:03:31:WU03:FS01:Assigned to work server 128.252.203.10
03:03:31:WU03:FS01:Requesting new work unit for slot 01: READY gpu:0:Ellesmere XT [Radeon RX 470/480/570/580] from 128.252.203.10
03:03:31:WU03:FS01:Connecting to 128.252.203.10:8080
03:03:52:WARNING:WU03:FS01:WorkServer connection failed on port 8080 trying 80
03:03:52:WU03:FS01:Connecting to 128.252.203.10:80
03:04:48:WU03:FS01:Downloading 86.24MiB
03:04:54:WU03:FS01:Download 7.18%
03:05:00:WU03:FS01:Download 15.58%
03:05:06:WU03:FS01:Download 21.45%
03:05:12:WU03:FS01:Download 29.79%
03:05:18:WU03:FS01:Download 42.69%
03:05:24:WU03:FS01:Download 54.07%
03:05:30:WU03:FS01:Download 61.31%
03:05:36:WU03:FS01:Download 67.98%
03:05:42:WU03:FS01:Download 75.37%
03:05:48:WU03:FS01:Download 84.72%
03:05:54:WU03:FS01:Download 96.17%
03:05:55:WU03:FS01:Download complete
03:05:55:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:11764 run:0 clone:5592 gen:25 core:0x22 unit:0x0000003280fccb0a5e71130ac303f133
03:05:55:WU03:FS01:Starting
03:05:55:WU03:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\kwtho\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe -dir 03 -suffix 01 -version 705 -lifeline 1028 -checkpoint 20 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
03:05:55:WU03:FS01:Started FahCore on PID 4560
03:05:55:WU03:FS01:Core PID:11136
03:05:55:WU03:FS01:FahCore 0x22 started
03:05:56:WU03:FS01:0x22:*********************** Log Started 2020-04-07T03:05:55Z ***********************
03:05:56:WU03:FS01:0x22:*************************** Core22 Folding@home Core ***************************
03:05:56:WU03:FS01:0x22:       Type: 0x22
03:05:56:WU03:FS01:0x22:       Core: Core22
03:05:56:WU03:FS01:0x22:    Website: https://foldingathome.org/
03:05:56:WU03:FS01:0x22:  Copyright: (c) 2009-2018 foldingathome.org
03:05:56:WU03:FS01:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
03:05:56:WU03:FS01:0x22:             <rafal.wiewiora@choderalab.org>
03:05:56:WU03:FS01:0x22:       Args: -dir 03 -suffix 01 -version 705 -lifeline 4560 -checkpoint 20
03:05:56:WU03:FS01:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
03:05:56:WU03:FS01:0x22:     Config: <none>
03:05:56:WU03:FS01:0x22:************************************ Build *************************************
03:05:56:WU03:FS01:0x22:    Version: 0.0.2
03:05:56:WU03:FS01:0x22:       Date: Dec 6 2019
03:05:56:WU03:FS01:0x22:       Time: 21:30:31
03:05:56:WU03:FS01:0x22: Repository: Git
03:05:56:WU03:FS01:0x22:   Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe
03:05:56:WU03:FS01:0x22:     Branch: HEAD
03:05:56:WU03:FS01:0x22:   Compiler: Visual C++ 2008
03:05:56:WU03:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
03:05:56:WU03:FS01:0x22:   Platform: win32 10
03:05:56:WU03:FS01:0x22:       Bits: 64
03:05:56:WU03:FS01:0x22:       Mode: Release
03:05:56:WU03:FS01:0x22:************************************ System ************************************
03:05:56:WU03:FS01:0x22:        CPU: Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz
03:05:56:WU03:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
03:05:56:WU03:FS01:0x22:       CPUs: 6
03:05:56:WU03:FS01:0x22:     Memory: 15.93GiB
03:05:56:WU03:FS01:0x22:Free Memory: 11.63GiB
03:05:56:WU03:FS01:0x22:    Threads: WINDOWS_THREADS
03:05:56:WU03:FS01:0x22: OS Version: 6.2
03:05:56:WU03:FS01:0x22:Has Battery: false
03:05:56:WU03:FS01:0x22: On Battery: false
03:05:56:WU03:FS01:0x22: UTC Offset: -7
03:05:56:WU03:FS01:0x22:        PID: 11136
03:05:56:WU03:FS01:0x22:        CWD: C:\Users\kwtho\AppData\Roaming\FAHClient\work
03:05:56:WU03:FS01:0x22:         OS: Windows 10 Pro
03:05:56:WU03:FS01:0x22:    OS Arch: AMD64
03:05:56:WU03:FS01:0x22:********************************************************************************
03:05:56:WU03:FS01:0x22:Project: 11764 (Run 0, Clone 5592, Gen 25)
03:05:56:WU03:FS01:0x22:Unit: 0x0000003280fccb0a5e71130ac303f133
03:05:56:WU03:FS01:0x22:Reading tar file core.xml
03:05:56:WU03:FS01:0x22:Reading tar file integrator.xml
03:05:56:WU03:FS01:0x22:Reading tar file state.xml
03:05:56:WU03:FS01:0x22:Reading tar file system.xml
03:05:57:WU03:FS01:0x22:Digital signatures verified
03:05:57:WU03:FS01:0x22:Folding@home GPU Core22 Folding@home Core
03:05:57:WU03:FS01:0x22:Version 0.0.2
03:06:10:WU03:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
03:06:10:WU03:FS01:0x22:Saving result file ..\logfile_01.txt
03:06:10:WU03:FS01:0x22:Saving result file science.log
03:06:10:WU03:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
03:06:11:WARNING:WU03:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
03:06:11:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:11764 run:0 clone:5592 gen:25 core:0x22 unit:0x0000003280fccb0a5e71130ac303f133
03:06:11:WU03:FS01:Uploading 8.00KiB to 128.252.203.10
03:06:11:WU03:FS01:Connecting to 128.252.203.10:8080
03:06:11:WU01:FS01:Connecting to 65.254.110.245:8080
03:06:11:WARNING:WU01:FS01:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
03:06:11:WU01:FS01:Connecting to 18.218.241.186:80
03:06:12:WARNING:WU01:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
03:06:12:ERROR:WU01:FS01:Exception: Could not get an assignment
03:06:12:WU01:FS01:Connecting to 65.254.110.245:8080
03:06:12:WARNING:WU01:FS01:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
03:06:12:WU01:FS01:Connecting to 18.218.241.186:80
03:06:12:WARNING:WU01:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
03:06:12:ERROR:WU01:FS01:Exception: Could not get an assignment
03:07:04:WU03:FS01:Upload complete
03:07:04:WU03:FS01:Server responded WORK_ACK (400)
03:07:04:WU03:FS01:Cleaning up
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Multiple Issues with AMD GPU Processing?

Post by bruce »

Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)

This is a known AMD OpenCL bug. We're waiting for AMD to accept responsibility for the problem and fix the drivers or OpenMM to figure out how to work around the problem. The OpenCL code is unable to allocate the right amount of memory to the sortShortList. for certain proteins. The same WU running on an nVidia GPU or running on an AMD Navi GPU do not encounter this problem.
mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Re: Multiple Issues with AMD GPU Processing?

Post by mwroggenbuck »

Well, I tried again and failed.

I am NOT getting the sortShortList error.

Unfortunately, my log file is gone (I exited the software after the error), but the job ran to 14%, then it crashed my radeon control software. The radeon software restarted, and the FAH log file showed nothing during all this time. However, when the advanced control percent complete bar (and numbers) reached 15%, it did not log that information to the log window.

This is exactly how it worked before. It would look like it was running, but not log progress to log screen (although the percent complete bar would increment). I know that if I let this go, it would eventually give up.

If it is any help, even though the percent complete bar shows progress, the GPU is not utilized or drawing any signification power. It is like the control software thinks things are fine when nothing is running.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Multiple Issues with AMD GPU Processing?

Post by bruce »

Please post the segment of your log where the WU that you say appeared to get to 15% was download and started up to the point that it crashed. (See my Sig to find your log or use FAHControl after clicking Refresh.)
mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Re: Multiple Issues with AMD GPU Processing?

Post by mwroggenbuck »

Unfortunately, I have removed the FAH software. I was going to try again in a few weeks. I went back to World Community Grid and Einstein at home.

When I looked at the log file, I saw no indication of an error for that particular work unit. My only clue is that my Radeon control software (the icon in the task bar) was restarted after the screen hung for a short period of time. Then the log file would just stop showing anything for that GPU slot (the cpu slot was just fine). The web page GIU and advanced control status would increment, but the log file would not.

I can't be more help at this time. I will continue to watch this thread.

There is one more piece of information that perhaps is relevant. The first work unit my system downloaded for the GPU did cause a sortSortList error that was detected immediately. It happened even before it appeared to start processing work unit. The log file showed an error of bad work unit, uploaded some log file, and then immediately obtained another unit. That unit work until 14% as I described above.

Perhaps the GPU was still in a bad state from the first unit? Nothing was restarted between the two.

Sorry that I cannot be of more help.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Multiple Issues with AMD GPU Processing?

Post by bruce »

Is FAH's log.txt still in you trashbin?

farewell, and thanks for your comments.
alxbelu
Posts: 109
Joined: Sat Mar 14, 2020 6:28 pm

Re: Multiple Issues with AMD GPU Processing?

Post by alxbelu »

kwthom wrote:Wait a sec...

https://apps.foldingathome.org/wu#proje ... 981&gen=19

...

Is this a legit bad WU?
Check the link again (the WU has been correctly completed and uploaded).

11776 was however one of the specific WUs that threw the sortShortList error on GCN based AMD cards, as per this thread: viewtopic.php?f=74&t=32991
Official F@H Twitter (frequently updated): https://twitter.com/foldingathome
Official F@H Facebook: https://www.facebook.com/Foldinghome-136059519794607/

(I'm not affiliated with the F@H Team, just promoting these channels for official updates)
mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Re: Multiple Issues with AMD GPU Processing?

Post by mwroggenbuck »

No log file or anything. I uninstalled the program (by the way, your uninstall is excellent--most programs leave a few files or empty directories around--your's did not.)

If I do try again, should I increase the log level? I seem to remember there was a setting for this in the advanced controls. If yes, what level should I use (I think the default was 3).

FYI

I happen to be a chemist and a software engineer (retired). I greatly respect what this project is trying to do and I wish you all the best of luck. Like I said, I will try again in the future, probably when AMD publishes new drivers., but I know it is hard to get all hardware configurations to work. I have a home built system, and something could be just different enough to cause problems.

Thanks for trying to help. :-)
NuovaApe
Posts: 54
Joined: Mon Jun 17, 2019 12:49 pm

Re: Multiple Issues with AMD GPU Processing?

Post by NuovaApe »

Could this be a bug with OpenMM rather than AMD?

Looking at the OpenMM code, sortShortList doesn't get used on nVidia, only AMD.

That's why no nVidia users report it.

There's some code in OpenMM that sorts data:

Code: Select all

OpenCLSort::sort
{
    if( nVidia GPU ) then
     {
          run this bit of code called sortShortList*2*
     }
     else
     {
         run this bit of code called sortShortList
     }
}
The bits of code are very different.

If OpenMM were running identical code on both AMD/nVidia GPUs then fine - blame AMD.

Not saying it isn't a bug in AMD drivers. Just saying apples oranges.

This OpenMM code checks for nVidia drivers:

Code: Select all

if(vendor.size() >= 6 && vendor.substr(0, 6) == "NVIDIA")
        useShortList2 = yes, if smallish size
    else
        useShortList2 = no - only use sortShortList
https://github.com/openmm/openmm/blob/6 ... CLSort.cpp
https://github.com/openmm/openmm/blob/m ... ls/sort.cl
Last edited by NuovaApe on Tue Apr 07, 2020 10:41 pm, edited 1 time in total.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Multiple Issues with AMD GPU Processing?

Post by bruce »

alxbelu wrote:
kwthom wrote:https://apps.foldingathome.org/wu#proje ... 981&gen=19
Is this a legit bad WU?
Check the link again (the WU has been correctly completed and uploaded).
Unfortunately I don't know how to tell if the succesful completion was on nVidia, but it's a pretty good guess.

The two failures might have been the AMD problem or an unstable overclock (and there are other possibilities).

I think it retries up to 5 times before it succeeds or it's declared a "bad WU" and is aborted.
alxbelu
Posts: 109
Joined: Sat Mar 14, 2020 6:28 pm

Re: Multiple Issues with AMD GPU Processing?

Post by alxbelu »

bruce wrote:
alxbelu wrote:
kwthom wrote:https://apps.foldingathome.org/wu#proje ... 981&gen=19
Is this a legit bad WU?
Check the link again (the WU has been correctly completed and uploaded).
Unfortunately I don't know how to tell if the succesful completion was on nVidia, but it's a pretty good guess.

The two failures might have been the AMD problem or an unstable overclock (and there are other possibilities).
Looking at the successful user history, I'd say odds are definitely that it was a Nvidia GPU: https://apps.foldingathome.org/cpu?q=sdumont.petrobras

As for the two failures, one is the OP and looks like the known AMD/GCN issue.
Official F@H Twitter (frequently updated): https://twitter.com/foldingathome
Official F@H Facebook: https://www.facebook.com/Foldinghome-136059519794607/

(I'm not affiliated with the F@H Team, just promoting these channels for official updates)
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Multiple Issues with AMD GPU Processing?

Post by PantherX »

mwroggenbuck wrote:...If I do try again, should I increase the log level? I seem to remember there was a setting for this in the advanced controls. If yes, what level should I use (I think the default was 3)...
Please leave the default log level of 3. Setting it any higher will actually hinder us since it will produce information that isn't needed to troubleshoot this issue.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Post Reply