AMD GPU Error sortShortList on some projects

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: Site Moderators, FAHC Science Team

Jan
Posts: 80
Joined: Tue Mar 31, 2020 6:46 pm

Re: AMD GPU Error sortShortList on some projects

Post by Jan »

muziqaz wrote:If one person reports Faulty WU, we question that person's hardware, of two or more return the same WU as faulty, we start questioning the WU :)
Maybe a bit off topic - but this WU is already in generation 50. How can it be faulty? Does my understanding of the whole process still lack?
Simplex0
Posts: 69
Joined: Sun Oct 06, 2013 10:35 am

Re: AMD GPU Error sortShortList on some projects

Post by Simplex0 »

Jan wrote:Client-type advanced will not fix any errors. It will simply make your client looking for advanced WUs (which are WUs that just made it out of beta testing) additionally to "normal" WUs and thats it. Afaik.

muziqaz might have a point, as this WU has been returned 2 or 3 times as faulty. Have you had other WUs on your GPUs so far/since then?
Well, in my case my hope was that using the 'advanced' settings would result in less errors because it would result in downloading more of newer and better coded applications as indicated by Neil-B earlier in this thread but in my case it did not help.
I have checked the logs to day from my computers running the GTX1070 and RTX2080 and have no errors on work units running on those cards.

I will not run any more folding on my Radeon cards for a while, I check back in the Forum in few weeks to see if the problem is still there.
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: AMD GPU Error sortShortList on some projects

Post by Neil-B »

Simplex0 wrote:… as indicated by Neil-B earlier in this thread but in my case it did not help.
.. tbh, I wasn't recommending such as a solution, simply trying to explain to a previous poster one reason why they might be seeing their issue when folding FAH but not when folding ADV … Apologies if it came across that I was promoting this as a solution.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
muziqaz
Posts: 905
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: AMD GPU Error sortShortList on some projects

Post by muziqaz »

Jan wrote:
muziqaz wrote:If one person reports Faulty WU, we question that person's hardware, of two or more return the same WU as faulty, we start questioning the WU :)
Maybe a bit off topic - but this WU is already in generation 50. How can it be faulty? Does my understanding of the whole process still lack?
project is in gen50, WU is a single Work Unit you download to process :)
That particular WU is most likely Faulty, but it does not mean the whole project or even Generation is faulty as well :)
Simplex0 wrote:
I will not run any more folding on my Radeon cards for a while, I check back in the Forum in few weeks to see if the problem is still there.
It is your choice of course, but in my opinion this solution is too drastic. This type of error you are encountering is not very frequent, and is independent of type of GPU folder is running. Currently these random failed WUs are acceptable, and known. Devs are working on better handling of these errors, though.
FAH Beta tester
Jan
Posts: 80
Joined: Tue Mar 31, 2020 6:46 pm

Re: AMD GPU Error sortShortList on some projects

Post by Jan »

muziqaz wrote:project is in gen50, WU is a single Work Unit you download to process :)
That particular WU is most likely Faulty, but it does not mean the whole project or even Generation is faulty as well :)
Sure. I just didnt think the new WUs after so many generations could still (or rather: newly) be faulty. I probably dont understand the generating process of these WUs well enough. And now I'm done derailing this thread. :P
Simplex0
Posts: 69
Joined: Sun Oct 06, 2013 10:35 am

Re: AMD GPU Error sortShortList on some projects

Post by Simplex0 »

Neil-B wrote:
Simplex0 wrote:… as indicated by Neil-B earlier in this thread but in my case it did not help.
.. tbh, I wasn't recommending such as a solution, simply trying to explain to a previous poster one reason why they might be seeing their issue when folding FAH but not when folding ADV … Apologies if it came across that I was promoting this as a solution.
No problem, the fact is that sam6861 observed a reduction in errors after using this settings and it was worth trying.

It is usually like that, you observe a change and come up with a assumption on WHY that happened.
That can finally turn out to be wrong but is was still a plausible explanation at that time.
Simplex0
Posts: 69
Joined: Sun Oct 06, 2013 10:35 am

Re: AMD GPU Error sortShortList on some projects

Post by Simplex0 »

muziqaz wrote:
Jan wrote:
muziqaz wrote:If one person reports Faulty WU, we question that person's hardware, of two or more return the same WU as faulty, we start questioning the WU :)
It is your choice of course, but in my opinion this solution is too drastic. This type of error you are encountering is not very frequent, and is independent of type of GPU folder is running. Currently these random failed WUs are acceptable, and known. Devs are working on better handling of these errors, though.
Fact is that this type of errors was very frequent on my R9 290 cards and close to nonexistent on my Nvidia cards, I have observed a lot of this type of errors on my AMD cards lately an non on my Nvidia cards.
I am wondering if this type of work units are sent more frequently to specifically AMD cards maybe? I will try to dig in a little deeper next week.

For now I can say that in the log files covering 15 days on my computer running Nvidia cards I have 0 cases of Bad state detected, BAD WORK UNIT(114=0x72)
On the computer with AMD R9 290 cards I have in the log files covering 9 days found 8 work units which resulted in Bad state detected, BAD WORK UNIT(114=0x72

Thank you all for your support
muziqaz
Posts: 905
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: AMD GPU Error sortShortList on some projects

Post by muziqaz »

Simplex0 wrote: Fact is that this type of errors was very frequent on my R9 290 cards and close to nonexistent on my Nvidia cards, I have observed a lot of this type of errors on my AMD cards lately an non on my Nvidia cards.
I am wondering if this type of work units are sent more frequently to specifically AMD cards maybe? I will try to dig in a little deeper next week.

For now I can say that in the log files covering 15 days on my computer running Nvidia cards I have 0 cases of Bad state detected, BAD WORK UNIT(114=0x72)
On the computer with AMD R9 290 cards I have in the log files covering 9 days found 8 work units which resulted in Bad state detected, BAD WORK UNIT(114=0x72

Thank you all for your support
So maybe it is time to clean the fans of the card, and maybe reduce the clocks ;) 290 is VERY old card, it's possible that VRMs are on their last legs ;)
FAH Beta tester
Simplex0
Posts: 69
Joined: Sun Oct 06, 2013 10:35 am

Re: AMD GPU Error sortShortList on some projects

Post by Simplex0 »

muziqaz wrote:
Simplex0 wrote: Fact is that this type of errors was very frequent on my R9 290 cards and close to nonexistent on my Nvidia cards, I have observed a lot of this type of errors on my AMD cards lately an non on my Nvidia cards.
I am wondering if this type of work units are sent more frequently to specifically AMD cards maybe? I will try to dig in a little deeper next week.

For now I can say that in the log files covering 15 days on my computer running Nvidia cards I have 0 cases of Bad state detected, BAD WORK UNIT(114=0x72)
On the computer with AMD R9 290 cards I have in the log files covering 9 days found 8 work units which resulted in Bad state detected, BAD WORK UNIT(114=0x72

Thank you all for your support
So maybe it is time to clean the fans of the card, and maybe reduce the clocks ;) 290 is VERY old card, it's possible that VRMs are on their last legs ;)
The computer is all water cooled, custom loop, and the temperature on the GPU and VRM on my graphic cards stays under 65 °C at all time.
You are right regarding the fact that it is indeed very old graphic cards and that seams to be the problem, after reducing the GPU-clock to 80% on all cards everything works just fine now. :)

Thank you for your support muziqaz.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: AMD GPU Error sortShortList on some projects

Post by bruce »

The errors with the keyword "sortShortList" are unique to AMD GPUs and simply do not occur on nV hardware.

"Bad state detected, BAD WORK UNIT(114=0x72)" covers that case as well as several other possibilities across both brands of GPUs. If you eliminate the sortShortList errors, are the Bad State errors about the same?
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: AMD GPU Error sortShortList on some projects

Post by JohnChodera »

Just a quick note here: We've fixed this issue in OpenMM:
https://github.com/openmm/openmm/pull/2631

We're just working on backporting the fix into core22.

Thanks for your patience!

~ John Chodera // MSKCC
Post Reply