Page 2 of 4

Re: AMD GPU Error sortShortList on some projects

Posted: Tue Mar 24, 2020 10:56 pm
by MrFrizzy
muziqaz wrote:Researchers are looking into disabling those projects on AMD GPUs until fix has been found.
Just to add to this discussion, I have a 5700 XT and have had no failures on any of the projects mentioned in this thread. Perhaps the source of the error isn't present on Navi cards?

Successful projects (tracked in the spreadsheet in my sig): 11741-11752, 11755, 11759, 11762-11764, 11776-11778, 11780, 11781

Driver: 20.2.2

Similar post here: viewtopic.php?f=81&t=32771

Re: AMD GPU Error sortShortList on some projects

Posted: Tue Mar 24, 2020 11:27 pm
by muziqaz
MrFrizzy wrote:
muziqaz wrote:Researchers are looking into disabling those projects on AMD GPUs until fix has been found.
Just to add to this discussion, I have a 5700 XT and have had no failures on any of the projects mentioned in this thread. Perhaps the source of the error isn't present on Navi cards?

Successful projects (tracked in the spreadsheet in my sig): 11741-11752, 11755, 11759, 11762-11764, 11776-11778, 11780, 11781

Driver: 20.2.2

Similar post here: viewtopic.php?f=81&t=32771
Thank you for information. Seems that GCN based cards are influenced.
Big Navi can't come quick enough :D

Re: AMD GPU Error sortShortList on some projects

Posted: Tue Mar 24, 2020 11:33 pm
by alxbelu
muziqaz wrote:
MrFrizzy wrote:
muziqaz wrote:Researchers are looking into disabling those projects on AMD GPUs until fix has been found.
Just to add to this discussion, I have a 5700 XT and have had no failures on any of the projects mentioned in this thread. Perhaps the source of the error isn't present on Navi cards?

Successful projects (tracked in the spreadsheet in my sig): 11741-11752, 11755, 11759, 11762-11764, 11776-11778, 11780, 11781

Driver: 20.2.2

Similar post here: viewtopic.php?f=81&t=32771
Thank you for information. Seems that GCN based cards are influenced.
Big Navi can't come quick enough :D
Yep, and yep! (Was planning on upgrading my desktop this year, my 290x just turned 6 and deserves retirement, but I guess we'll see if launches actually happen as planned this year..)

Re: AMD GPU Error sortShortList on some projects

Posted: Thu Mar 26, 2020 1:13 am
by _r2w_ben
muziqaz wrote:Researchers are looking into disabling those projects on AMD GPUs until fix has been found.
Thank you for understanding
The restriction needs to be added to p14533. One was assigned at 2020-03-25T23:29:18Z.

Re: AMD GPU Error sortShortList on some projects

Posted: Thu Mar 26, 2020 7:37 am
by muziqaz
_r2w_ben wrote:
muziqaz wrote:Researchers are looking into disabling those projects on AMD GPUs until fix has been found.
Thank you for understanding
The restriction needs to be added to p14533. One was assigned at 2020-03-25T23:29:18Z.
Thanks for the info. It was passed to researchers.

Re: AMD GPU Error sortShortList on some projects

Posted: Thu Mar 26, 2020 8:14 pm
by MrFrizzy
muziqaz wrote:
_r2w_ben wrote: The restriction needs to be added to p14533. One was assigned at 2020-03-25T23:29:18Z.
Thanks for the info. It was passed to researchers.
On the 5700 XT, I was able to process the only 14533 project I got to 100% and sent the results to the server only to have the server dump the results. So while this is a different result than the kernel message from before, I think it needs to be pointed out for distinction. Whatever the kernel message is about, it is not for all AMD cards.

As pointed out in an earlier post, I can process all of the COVID-19 core22 related projects just fine, not one has erred out for any reason besides me messing with my overclock. See the spreadsheet in my sig, I have tracked 95 successful COVID-19 core22 projects (85 are shown). If any of the devs/researchers need more information, I can provide PRCG numbers for all projects with timestamps or even the full logs (I archive all of them before the client can clean them out).

I would suggest not blocking all AMD cards on these projects and to allow species 6 to continue folding.

Re: AMD GPU Error sortShortList on some projects

Posted: Thu Mar 26, 2020 8:30 pm
by muziqaz
At the moment, projects which are known to fail on AMD are being blocked. The rest of them are freely available (relatively speaking).

Re: AMD GPU Error sortShortList on some projects

Posted: Sat Mar 28, 2020 8:13 pm
by _r2w_ben
The restriction needs to be added to p11781. One was assigned at 2020-03-28T20:10:11Z.

Re: AMD GPU Error sortShortList on some projects

Posted: Sat Mar 28, 2020 8:59 pm
by IkkeDus
I also see this problem.
AMD R9 280X 3GB (ID: 6798 SUB: 3001)

Project: 11776

It often seems to be stuck after the "...0x22:Version 0.0.2" log line. If I leave it alone it will stay there for hours. If I pause/unpause it either get stuck there again or it finishes with the error. At least it will retry to fetch another WU.

Code: Select all

20:42:10:WU02:FS02:Starting
20:42:10:WU02:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Ray\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe -dir 02 -suffix 01 -version 705 -lifeline 8668 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 2 -gpu 2
20:42:10:WU02:FS02:Started FahCore on PID 8908
20:42:10:WU02:FS02:Core PID:8932
20:42:10:WU02:FS02:FahCore 0x22 started
20:42:11:WU02:FS02:0x22:*********************** Log Started 2020-03-28T20:42:10Z ***********************
20:42:11:WU02:FS02:0x22:*************************** Core22 Folding@home Core ***************************
20:42:11:WU02:FS02:0x22:       Type: 0x22
20:42:11:WU02:FS02:0x22:       Core: Core22
20:42:11:WU02:FS02:0x22:    Website: https://foldingathome.org/
20:42:11:WU02:FS02:0x22:  Copyright: (c) 2009-2018 foldingathome.org
20:42:11:WU02:FS02:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
20:42:11:WU02:FS02:0x22:             <rafal.wiewiora@choderalab.org>
20:42:11:WU02:FS02:0x22:       Args: -dir 02 -suffix 01 -version 705 -lifeline 8908 -checkpoint 15
20:42:11:WU02:FS02:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 2 -gpu 2
20:42:11:WU02:FS02:0x22:     Config: <none>
20:42:11:WU02:FS02:0x22:************************************ Build *************************************
20:42:11:WU02:FS02:0x22:    Version: 0.0.2
20:42:11:WU02:FS02:0x22:       Date: Dec 6 2019
20:42:11:WU02:FS02:0x22:       Time: 21:30:31
20:42:11:WU02:FS02:0x22: Repository: Git
20:42:11:WU02:FS02:0x22:   Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe
20:42:11:WU02:FS02:0x22:     Branch: HEAD
20:42:11:WU02:FS02:0x22:   Compiler: Visual C++ 2008
20:42:11:WU02:FS02:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
20:42:11:WU02:FS02:0x22:   Platform: win32 10
20:42:11:WU02:FS02:0x22:       Bits: 64
20:42:11:WU02:FS02:0x22:       Mode: Release
20:42:11:WU02:FS02:0x22:************************************ System ************************************
20:42:11:WU02:FS02:0x22:        CPU: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz
20:42:11:WU02:FS02:0x22:     CPU ID: GenuineIntel Family 6 Model 23 Stepping 10
20:42:11:WU02:FS02:0x22:       CPUs: 4
20:42:11:WU02:FS02:0x22:     Memory: 4.00GiB
20:42:11:WU02:FS02:0x22:Free Memory: 2.13GiB
20:42:11:WU02:FS02:0x22:    Threads: WINDOWS_THREADS
20:42:11:WU02:FS02:0x22: OS Version: 6.2
20:42:11:WU02:FS02:0x22:Has Battery: false
20:42:11:WU02:FS02:0x22: On Battery: false
20:42:11:WU02:FS02:0x22: UTC Offset: 1
20:42:11:WU02:FS02:0x22:        PID: 8932
20:42:11:WU02:FS02:0x22:        CWD: C:\Users\\AppData\Roaming\FAHClient\work
20:42:11:WU02:FS02:0x22:         OS: Windows 10 Pro
20:42:11:WU02:FS02:0x22:    OS Arch: AMD64
20:42:11:WU02:FS02:0x22:********************************************************************************
20:42:11:WU02:FS02:0x22:Project: 11776 (Run 0, Clone 1781, Gen 6)
20:42:11:WU02:FS02:0x22:Unit: 0x0000000f287234c95e73c47b56c80b8a
20:42:11:WU02:FS02:0x22:Reading tar file core.xml
20:42:11:WU02:FS02:0x22:Reading tar file integrator.xml
20:42:11:WU02:FS02:0x22:Reading tar file state.xml
20:42:12:WU02:FS02:0x22:Reading tar file system.xml
20:42:14:WU02:FS02:0x22:Digital signatures verified
20:42:14:WU02:FS02:0x22:Folding@home GPU Core22 Folding@home Core
20:42:14:WU02:FS02:0x22:Version 0.0.2
20:42:45:WU02:FS02:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
20:42:45:WU02:FS02:0x22:Saving result file ..\logfile_01.txt
20:42:45:WU02:FS02:0x22:Saving result file science.log
20:42:45:WU02:FS02:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
20:42:45:WARNING:WU02:FS02:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
20:42:45:WU02:FS02:Sending unit results: id:02 state:SEND error:FAULTY project:11776 run:0 clone:1781 gen:6 core:0x22 unit:0x0000000f287234c95e73c47b56c80b8a
20:42:45:WU02:FS02:Uploading 15.00KiB to 40.114.52.201
20:42:45:WU02:FS02:Connecting to 40.114.52.201:8080
20:42:46:WU03:FS02:Connecting to 65.254.110.245:8080
20:42:46:WARNING:WU03:FS02:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
20:42:46:WU03:FS02:Connecting to 18.218.241.186:80
20:42:47:WARNING:WU03:FS02:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
20:42:47:ERROR:WU03:FS02:Exception: Could not get an assignment
20:42:47:WU03:FS02:Connecting to 65.254.110.245:8080
20:42:47:WARNING:WU03:FS02:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
20:42:47:WU03:FS02:Connecting to 18.218.241.186:80
20:42:48:WARNING:WU03:FS02:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
20:42:48:ERROR:WU03:FS02:Exception: Could not get an assignment
20:43:06:WARNING:WU02:FS02:WorkServer connection failed on port 8080 trying 80
20:43:06:WU02:FS02:Connecting to 40.114.52.201:80
20:43:14:WU02:FS02:Upload 100.00%
20:43:30:WU02:FS02:Upload complete
20:43:30:WU02:FS02:Server responded WORK_ACK (400)
20:43:30:WU02:FS02:Cleaning up

Re: AMD GPU Error sortShortList on some projects

Posted: Sat Mar 28, 2020 10:48 pm
by muziqaz
Just an update, some people at AMD are aware of this issue and are looking into it :)
Hopefully we will have it solved sooner rather than later :)
Thank you for your patience

Re: AMD GPU Error sortShortList on some projects

Posted: Sun Mar 29, 2020 12:11 am
by bruce
First a temporary solution from FAH: Those projects will not be assigned to that group of GPUs.
Second, a permanent solution: New AMD drivers or a new FAHCore from FAH will be prepared that fixes the original problem. (Then the temporary solution will be removed.)

Re: AMD GPU Error sortShortList on some projects

Posted: Sun Mar 29, 2020 10:41 am
by alxbelu
That's great news! Thanks for the update!

Re: AMD GPU Error sortShortList on some projects

Posted: Sat Apr 04, 2020 9:41 pm
by _r2w_ben
The restriction needs to be added to p11759. One was assigned at 2020-04-04T20:59:25Z.

Re: AMD GPU Error sortShortList on some projects

Posted: Sat Apr 04, 2020 9:48 pm
by bruce
MrFrizzy wrote:
muziqaz wrote:Researchers are looking into disabling those projects on AMD GPUs until fix has been found.
Just to add to this discussion, I have a 5700 XT and have had no failures on any of the projects mentioned in this thread. Perhaps the source of the error isn't present on Navi cards?

Right. Navi is the one exception.

Re: AMD GPU Error sortShortList on some projects

Posted: Tue Apr 07, 2020 4:47 am
by Hey_Allen
It appears that the AMD GPUs are still getting this family of projects.

Project 11776 just failed on my RX 580, and I've had a few work units end in a status "Failure 2" as reported on the stats page.
I've seen a few instances where I have a ~20 credit job submitted, and if I catch it and examine it, find a failed unit.

Code: Select all

20:05:04:WU02:FS01:Connecting to 65.254.110.245:8080
20:05:04:WU02:FS01:Assigned to work server 140.163.4.231
20:05:04:WU02:FS01:Requesting new work unit for slot 01: READY gpu:0:Ellesmere XT [Radeon RX 470/480/570/580/590] from 140.163.4.231
20:05:04:WU02:FS01:Connecting to 140.163.4.231:8080
20:05:25:WARNING:WU02:FS01:WorkServer connection failed on port 8080 trying 80
20:05:25:WU02:FS01:Connecting to 140.163.4.231:80
20:05:46:ERROR:WU02:FS01:Exception: Failed to connect to 140.163.4.231:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
******************************* Date: 2020-04-06 *******************************
01:27:04:WU02:FS01:Connecting to 65.254.110.245:8080
01:27:04:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
01:27:04:WU02:FS01:Connecting to 18.218.241.186:80
01:27:04:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:27:04:ERROR:WU02:FS01:Exception: Could not get an assignment
01:51:27:WU02:FS01:Connecting to 65.254.110.245:8080
01:51:28:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
01:51:28:WU02:FS01:Connecting to 18.218.241.186:80
01:51:29:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:51:29:ERROR:WU02:FS01:Exception: Could not get an assignment
01:53:04:WU02:FS01:Connecting to 65.254.110.245:8080
01:53:04:WU02:FS01:Assigned to work server 40.114.52.201
01:53:04:WU02:FS01:Requesting new work unit for slot 01: READY gpu:0:Ellesmere XT [Radeon RX 470/480/570/580/590] from 40.114.52.201
01:53:04:WU02:FS01:Connecting to 40.114.52.201:8080
01:53:32:WU02:FS01:Downloading 79.12MiB
01:53:38:WU02:FS01:Download 7.74%
01:53:44:WU02:FS01:Download 19.59%
01:53:50:WU02:FS01:Download 30.10%
01:53:56:WU02:FS01:Download 43.84%
01:54:02:WU02:FS01:Download 57.90%
01:54:08:WU02:FS01:Download 71.96%
01:54:14:WU02:FS01:Download 86.57%
01:54:19:WU02:FS01:Download complete
01:54:19:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:11776 run:0 clone:31304 gen:7 core:0x22 unit:0x0000000b287234c95e7931c2b282407f
01:54:20:WU02:FS01:Starting
01:54:20:WU02:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Josh\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe -dir 02 -suffix 01 -version 705 -lifeline 14036 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
01:54:20:WU02:FS01:Started FahCore on PID 2444
01:54:20:WU02:FS01:Core PID:13684
01:54:20:WU02:FS01:FahCore 0x22 started
01:54:20:WU02:FS01:0x22:*********************** Log Started 2020-04-07T01:54:20Z ***********************
01:54:20:WU02:FS01:0x22:*************************** Core22 Folding@home Core ***************************
01:54:20:WU02:FS01:0x22:       Type: 0x22
01:54:20:WU02:FS01:0x22:       Core: Core22
01:54:20:WU02:FS01:0x22:    Website: https://foldingathome.org/
01:54:20:WU02:FS01:0x22:  Copyright: (c) 2009-2018 foldingathome.org
01:54:20:WU02:FS01:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
01:54:20:WU02:FS01:0x22:             <rafal.wiewiora@choderalab.org>
01:54:20:WU02:FS01:0x22:       Args: -dir 02 -suffix 01 -version 705 -lifeline 2444 -checkpoint 15
01:54:20:WU02:FS01:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
01:54:20:WU02:FS01:0x22:     Config: <none>
01:54:20:WU02:FS01:0x22:************************************ Build *************************************
01:54:20:WU02:FS01:0x22:    Version: 0.0.2
01:54:20:WU02:FS01:0x22:       Date: Dec 6 2019
01:54:20:WU02:FS01:0x22:       Time: 21:30:31
01:54:20:WU02:FS01:0x22: Repository: Git
01:54:20:WU02:FS01:0x22:   Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe
01:54:20:WU02:FS01:0x22:     Branch: HEAD
01:54:20:WU02:FS01:0x22:   Compiler: Visual C++ 2008
01:54:20:WU02:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
01:54:20:WU02:FS01:0x22:   Platform: win32 10
01:54:20:WU02:FS01:0x22:       Bits: 64
01:54:20:WU02:FS01:0x22:       Mode: Release
01:54:20:WU02:FS01:0x22:************************************ System ************************************
01:54:20:WU02:FS01:0x22:        CPU: AMD Ryzen 5 3600 6-Core Processor
01:54:20:WU02:FS01:0x22:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
01:54:20:WU02:FS01:0x22:       CPUs: 12
01:54:20:WU02:FS01:0x22:     Memory: 31.94GiB
01:54:20:WU02:FS01:0x22:Free Memory: 25.86GiB
01:54:20:WU02:FS01:0x22:    Threads: WINDOWS_THREADS
01:54:20:WU02:FS01:0x22: OS Version: 6.2
01:54:20:WU02:FS01:0x22:Has Battery: false
01:54:20:WU02:FS01:0x22: On Battery: false
01:54:20:WU02:FS01:0x22: UTC Offset: -7
01:54:20:WU02:FS01:0x22:        PID: 13684
01:54:20:WU02:FS01:0x22:        CWD: C:\Users\Josh\AppData\Roaming\FAHClient\work
01:54:20:WU02:FS01:0x22:         OS: Windows 10 Pro
01:54:20:WU02:FS01:0x22:    OS Arch: AMD64
01:54:20:WU02:FS01:0x22:********************************************************************************
01:54:20:WU02:FS01:0x22:Project: 11776 (Run 0, Clone 31304, Gen 7)
01:54:20:WU02:FS01:0x22:Unit: 0x0000000b287234c95e7931c2b282407f
01:54:20:WU02:FS01:0x22:Reading tar file core.xml
01:54:20:WU02:FS01:0x22:Reading tar file integrator.xml
01:54:20:WU02:FS01:0x22:Reading tar file state.xml
01:54:21:WU02:FS01:0x22:Reading tar file system.xml
01:54:21:WU02:FS01:0x22:Digital signatures verified
01:54:21:WU02:FS01:0x22:Folding@home GPU Core22 Folding@home Core
01:54:21:WU02:FS01:0x22:Version 0.0.2
01:54:37:WU02:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
01:54:37:WU02:FS01:0x22:Saving result file ..\logfile_01.txt
01:54:37:WU02:FS01:0x22:Saving result file science.log
01:54:37:WU02:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
01:54:37:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
01:54:37:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:11776 run:0 clone:31304 gen:7 core:0x22 unit:0x0000000b287234c95e7931c2b282407f
01:54:37:WU02:FS01:Uploading 8.00KiB to 40.114.52.201