16600 consistently crashing on AMD Radeon VII

Moderators: Site Moderators, FAHC Science Team

Re: 16600 consistently crashing on AMD Radeon VII

Postby ThWuensche » Thu Aug 20, 2020 8:02 pm

muziqaz wrote:nVidia is doing that.
FAH dev creates fahcore>nVidia rep takes that core and runs it through their hardware in their lab with all their driver profilers and tools>driver team either optimises the drivers for the fahcore, or they give suggestions/submit patches of code to fah devs to improve fahcore.
Hardware vendor does not need to have source code in order to optimise for the code.
I know how much nVidia is involved, and I just don't see the same involvement from AMD, not even close, which is a shame, as their hardware was always very strong in pure compute tasks.
Also, fah devs mainly have nVidia hardware, as far as I know. I do not believe there are any AMD GPUs in their possession. At least we can be content that AMD CPUs punched through Intel wall when it comes to fah


As far as nVidia support goes, that's good. AMD is also listed as contributor on the FAH website, maybe they could be motivated to provide such support at least for the ROCm stack, the project they advertise as open and scalable HPC solution. Can't be in their interest to have nVidia recognized as running without problems and AMD being consistently troublesome.

If FAH developers develop and tests mostly on nVidia hardware, but do not have AMD GPUs, then it's no wonder that in the wild FAH is having more trouble on AMD hardware. But in that case AMD should be asked for what they have their logo as supporter on the website.

As for need of source code, there may be a difference between optimizing and debugging. For optimizing it may be enough to profile which kernels are run at what frequency, without the need to understand the calculation flow. For debugging it however is very helpful to understand what is going on, what should be going on and at what point and under which preconditions there is a failure. That even more, as part of the failures point in the direction of early failures, which might be caused by missing/invalid initialization. Understanding can best be achieved by following code function through the logics of source code and observing the effects (follow variable values ...). If FAH developers can't follow that flow due to lack of appropriate (AMD) systems and want to do it by feedback from series (published) core versions, for me that looks like a rather long turnaround time for debugging. To be effective such turnaround times should be measured in minutes, not weeks between released core versions. Just my thoughts, I well might miss something important.
ThWuensche
 
Posts: 73
Joined: Fri May 29, 2020 5:10 pm

Re: 16600 consistently crashing on AMD Radeon VII

Postby n_w95482 » Thu Aug 20, 2020 9:32 pm

muziqaz wrote:Project has been disabled on all AMD cards but Navi. Please let us know if you still receive new p16600 WU on AMD GPU :)

The change seems to have worked on my RX 580. It finished the 16600 WU it was working on when I posted earlier and has since worked on 25 consecutive 13423's and currently working on a 13422; all with no issues. Thank you!
Folding since December 2003. In memory of my mother, who lost her battle with cancer.

Image
n_w95482
 
Posts: 65
Joined: Tue May 01, 2012 1:46 am
Location: California

Re: 16600 consistently crashing on AMD Radeon VII

Postby muziqaz » Thu Aug 20, 2020 9:39 pm

n_w95482 wrote:
muziqaz wrote:Project has been disabled on all AMD cards but Navi. Please let us know if you still receive new p16600 WU on AMD GPU :)

The change seems to have worked on my RX 580. It finished the 16600 WU it was working on when I posted earlier and has since worked on 25 consecutive 13423's and currently working on a 13422; all with no issues. Thank you!


Good to know, thanks :)
User avatar
muziqaz
 
Posts: 690
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Re: 16600 consistently crashing on AMD Radeon VII

Postby UofM.MartinK » Fri Aug 21, 2020 2:12 am

Same here, no 16600 anymore for my RX580 since the last one received 2020-08-19 02:00:51 GMT - and as to celebrate that, that last WU, project:16600 run:0 clone:1393 gen:201, completed after four "Particle coordinate is nan" checkpoint resumes :)
UofM.MartinK
 
Posts: 55
Joined: Tue Apr 07, 2020 9:53 pm

Re: 16600 consistently crashing on AMD Radeon VII

Postby PantherX » Fri Aug 21, 2020 12:50 pm

ThWuensche wrote:...But in that case AMD should be asked for what they have their logo as supporter on the website...

The F@H system has almost 20 years under the belt. AMD (at that time, ATI) were the first GPU to support folding. They did have play a decent role in developing FahCore to optimize their GPU/drivers for folding. However, over the years, things have changed and currently, it seems that Nvidia has sufficient presence to provide the right guidance/testing/debugging to ensure that FahCore can fully utilize their GPUs. I do hope that someone from AMD would be able to provide a similar level of support but time will tell :)
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
User avatar
PantherX
Site Moderator
 
Posts: 6765
Joined: Wed Dec 23, 2009 10:33 am
Location: Land Of The Long White Cloud

Re: 16600 consistently crashing on AMD Radeon VII

Postby ThWuensche » Fri Aug 21, 2020 3:46 pm

PantherX wrote:
ThWuensche wrote:...But in that case AMD should be asked for what they have their logo as supporter on the website...

The F@H system has almost 20 years under the belt. AMD (at that time, ATI) were the first GPU to support folding. They did have play a decent role in developing FahCore to optimize their GPU/drivers for folding. However, over the years, things have changed and currently, it seems that Nvidia has sufficient presence to provide the right guidance/testing/debugging to ensure that FahCore can fully utilize their GPUs. I do hope that someone from AMD would be able to provide a similar level of support but time will tell :)

So let's hope somebody from AMD ROCm team is following this thread :wink:
ThWuensche
 
Posts: 73
Joined: Fri May 29, 2020 5:10 pm

Re: 16600 consistently crashing on AMD Radeon VII

Postby bruce » Fri Aug 21, 2020 5:45 pm

There are a number of GPUs that you might be talking about. Please post the PCI codes associated with the on you're using. (using lspci or GPU-Z)

I understand P16600 is not assigning to AMD Species 5 which is a broad group of AMD devices._
bruce
 
Posts: 20009
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: 16600 consistently crashing on AMD Radeon VII

Postby muziqaz » Fri Aug 21, 2020 6:56 pm

bruce wrote:There are a number of GPUs that you might be talking about. Please post the PCI codes associated with the on you're using. (using lspci or GPU-Z)

I understand P16600 is not assigning to AMD Species 5 which is a broad group of AMD devices._


Bruce, they all run AMD, and they are not suppose to get any 16600 :) They are just reporting, that they are no longer receiving them on AMD species 5. All is good
User avatar
muziqaz
 
Posts: 690
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Previous

Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 2 guests

cron