16600 consistently crashing on AMD Radeon VII

Moderators: Site Moderators, FAHC Science Team

Re: 16600 consistently crashing on AMD Radeon VII

Postby Neil-B » Wed Aug 19, 2020 11:00 am

@muziqaz ... so specific (guessing amd) cards are running failure rates in the 80-90% range over a range of Projects is what you are confirming? ... If so can we change the thread topic to something more relevant than current which implies the discussion/issue is about a single project ... the two recent sets of failure rates posted in this thread have very different failure rates profiles across projects - one is specific to this project for the most part the other appears to be failing on potentially all projects - albeit there may be some projects it is actually working on that have not been posted - this might indicated two different types of scenario?

Maybe even move the topic thread if it is a much wider issue from the "Issues with a specific WU" forum as it appears from what you are confirming it isn't.
1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent, Quadro K420 1GB, FAH 7.6.13
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro, Quadro M1000M 2GB, FAH 7.6.13
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro, GTX 750Ti 2GB, FAH 7.6.13
Neil-B
 
Posts: 1405
Joined: Sun Mar 22, 2020 6:52 pm
Location: UK

Re: 16600 consistently crashing on AMD Radeon VII

Postby UofM.MartinK » Wed Aug 19, 2020 2:12 pm

At least for me, this thread's focus and main finding, as in the title, is p16600 - which now has been disabled for everything AMD before Navi (5000 series), yeah! Thanks, muziqaz!

The "waters" are muddled if you just look at the plain failure rates per project reported in this thread, because at least the p134XX series is also showing up as failures (I guess on many cards, but also on the before-Navi AMD cards which are affected by the totally unrelated p16600). But those p134XX failures seem to be far less critical, because they almost always happen in the first 9-17 seconds. And if not failing right in the beginning, those projects seem to usually complete without any further hickup.

This thread is a very good example how several effects can overlay and make the data hard to interpret, and made "the usually correct" explanation of just overclocked hardware etc appearing very convincing.

Even I was convinced my card has a hardware, clock or driver issue, and spent almost a week fiddling with drivers & underclocking, and now even more posters come forward reporting going through the same motions... almost like a #p16600metoo :)
UofM.MartinK
 
Posts: 55
Joined: Tue Apr 07, 2020 9:53 pm

Re: 16600 consistently crashing on AMD Radeon VII

Postby muziqaz » Wed Aug 19, 2020 2:21 pm

The goal of this thread was finally achieved by bringing this issue to project owner's attention. The project now is being checked out in more detail and excluded for problematic hardware (for now).
User avatar
muziqaz
 
Posts: 690
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Re: 16600 consistently crashing on AMD Radeon VII

Postby Neil-B » Wed Aug 19, 2020 6:46 pm

Please could you get the Moonshot 134xx series Project owners to assess these impact of rapid failures from types of cards ... whilst from the folders perspective having the vast majority fail within the first 20 seconds I really can't see that that makes sense ... If a number of these cards are failing WUs at speed the chances that WUs that could be valid and folded without issue on say nvidia cards get 5 failures from these cards and get labled as bad when they aren't necessarily so makes little sense to me.

The speed of failure of potentially ok WUs by these cards means that it would not take very many of them to raise the statistical chance of a WU being hit by 5 of these card related failures to occur to non negligible levels ... Moonshot WUs are quick results - but potentially throwing away valid results due to a group of "rapid fail doesn't matter because it doesn't impact our usefulness" gpus seems madness to me :(

Perhaps someone could check all WUs that have had five failed returns and check that they are not all quick failures from these types of gpu?
Neil-B
 
Posts: 1405
Joined: Sun Mar 22, 2020 6:52 pm
Location: UK

Re: 16600 consistently crashing on AMD Radeon VII

Postby muziqaz » Wed Aug 19, 2020 7:04 pm

Moonshot project owner knows about failed WUs, and has accepted it as is. They are not failing as much as 16600 or 16448.
Moonshot has similar failure rate on all different GPUs, not just AMD. This is due to the nature of simulation being done. The owner of Moonshot project is also one of the lead devs for fahcore_22, so while other project owners are researchers using fahcore_22, Moonshot owner actually developed and updated fahcore_22 to be able to do Moonshot type simulations :) In that update process a lo of other bugs and issues have been fixed :)

P.S. Out of 77 13422s my 3 different GPUs received, only 2 of them failed. One on Navi, one on either Radeon VII or Vega64
User avatar
muziqaz
 
Posts: 690
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Re: 16600 consistently crashing on AMD Radeon VII

Postby Neil-B » Wed Aug 19, 2020 7:29 pm

... but on another machine we have been told of failures rates of 30 out of 37 for p13421 and 8 out of 9 WUs for p13423: 9 WUs !!!! ... and those errors are being assigned to the Project WUs and that level of errors is seemingly being accepted as normal with everyone happy for the machine to just keep fast erroring WUs en masse.

Heck, if everyone is fine with machines failing at this rate as declared earlier in this thread - and continuing to do so - when others such as yours or mine have minimal failure rates then fine - I've tried to make the potentially overlooking perfectly good WUs due to this issue ... I guess it the Project Owner is actually happy to have this level of failures from a single machine (at least 80&) then far be it from me to argue.

I'll wind my neck in and simply ignore the absurdity of this scenario.
Neil-B
 
Posts: 1405
Joined: Sun Mar 22, 2020 6:52 pm
Location: UK

Re: 16600 consistently crashing on AMD Radeon VII

Postby muziqaz » Wed Aug 19, 2020 8:57 pm

Single AMD GPU is 0.00001% of horsepower in the see of nVidia GPUs ;) of you have one machine which fails constantly but have 10000 other machines which fold same project stably, would you go out of the way to halt the project (which needs to be finished as soon as possible)? Now if we had AMD actively involved to contribute in developing fahcore_22 and debugging most likely their own OpenCL mess, that would be insanely helpful and less time consuming. Fah devs are extremely limited resource with their own priorities ;)
User avatar
muziqaz
 
Posts: 690
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Re: 16600 consistently crashing on AMD Radeon VII

Postby ThWuensche » Wed Aug 19, 2020 9:47 pm

muziqaz wrote:Now if we had AMD actively involved to contribute in developing fahcore_22 and debugging most likely their own OpenCL mess, that would be insanely helpful and less time consuming. Fah devs are extremely limited resource with their own priorities ;)


Of course I'm repeating myself, but as FAH devs are a very valuable and limited resource, they should mostly stick to the core development (in the sense of science) and should leave at least part of the debugging to others - which implies providing the source. Could AMD even be actively involved in debugging, would they get access to the source to find out what triggers failures, even if it is caused by weaknesses in their openCL stack? Who else could get the source and help in debugging? I'm aware that JohnChodera is really listening and active to get things solved, but probably third party help would speed up things. In a closed company project you can say "We limit us to this and that hardware to reduce compatibility issues", but in a project relying on the contribution of volunteers spread around the world the problems of contributors need to be taken serious. If you start to say "We don't care, it's only a small number of contributors, so not worthwhile to deal with" it will hurt the project as a whole.
ThWuensche
 
Posts: 73
Joined: Fri May 29, 2020 5:10 pm

Re: 16600 consistently crashing on AMD Radeon VII

Postby ViTe » Wed Aug 19, 2020 10:24 pm

Neil-B wrote:... and high rates of failure on 13421 (30 of 37 failed) and 13423 (7 of 8 failed) on the same rig ... that doesn't just feel like an issue with the 16600 project as far as that rig is concerned ... yes the 34 of 38 failures on 16600 may be down to an issue with the project but with the wider failures it feels like a rig issue or possibly an incompatible core to rig issue


That might be interesting. 13423 WUs never failed on my machine. Last 2-3 days it was the only project I got and 100% of them completed successfully.
Dedicated machine, Win7 64bit, AMD RX570 4gb (not overclocked), Adrenalin 20.2.2, Client ver. 7.6.9, OpenCL 2.0 AMD-APP Driver 3004.8
ViTe
 
Posts: 19
Joined: Tue Feb 14, 2012 3:22 am

Re: 16600 consistently crashing on AMD Radeon VII

Postby gunnarre » Thu Aug 20, 2020 8:07 am

ThWuensche wrote:Who else could get the source and help in debugging?

The folding cores (the part that runs on the GPU) is already from open source projects, and if I understand correctly they're planning to make the whole client open source. But if AMD or Apple were interested in helping out by improving their drivers or even making a folding core for Metal, then closed source isn't a hindrance to that.
Image
gunnarre
 
Posts: 174
Joined: Sun May 24, 2020 8:23 pm
Location: Norway

Re: 16600 consistently crashing on AMD Radeon VII

Postby muziqaz » Thu Aug 20, 2020 9:45 am

ThWuensche wrote:
muziqaz wrote:Now if we had AMD actively involved to contribute in developing fahcore_22 and debugging most likely their own OpenCL mess, that would be insanely helpful and less time consuming. Fah devs are extremely limited resource with their own priorities ;)


Of course I'm repeating myself, but as FAH devs are a very valuable and limited resource, they should mostly stick to the core development (in the sense of science) and should leave at least part of the debugging to others - which implies providing the source. Could AMD even be actively involved in debugging, would they get access to the source to find out what triggers failures, even if it is caused by weaknesses in their openCL stack? Who else could get the source and help in debugging? I'm aware that JohnChodera is really listening and active to get things solved, but probably third party help would speed up things. In a closed company project you can say "We limit us to this and that hardware to reduce compatibility issues", but in a project relying on the contribution of volunteers spread around the world the problems of contributors need to be taken serious. If you start to say "We don't care, it's only a small number of contributors, so not worthwhile to deal with" it will hurt the project as a whole.


nVidia is doing that.
FAH dev creates fahcore>nVidia rep takes that core and runs it through their hardware in their lab with all their driver profilers and tools>driver team either optimises the drivers for the fahcore, or they give suggestions/submit patches of code to fah devs to improve fahcore.
Hardware vendor does not need to have source code in order to optimise for the code.
I know how much nVidia is involved, and I just don't see the same involvement from AMD, not even close, which is a shame, as their hardware was always very strong in pure compute tasks.
Also, fah devs mainly have nVidia hardware, as far as I know. I do not believe there are any AMD GPUs in their possession. At least we can be content that AMD CPUs punched through Intel wall when it comes to fah
User avatar
muziqaz
 
Posts: 690
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Re: 16600 consistently crashing on AMD Radeon VII

Postby NormalDiffusion » Thu Aug 20, 2020 11:32 am

muziqaz wrote:Project has been disabled on all AMD cards but Navi. Please let us know if you still receive new p16600 WU on AMD GPU :)


Yep, still getting 16600 on Radeon VII (2nd Gen Vega) as of today (all time UTC):
Machine 1:
- 19.08.2020 - 20:55
- 19.08.2020 - 22:43
- 20.08.2020 - 10:2x

Machine 2:
- 19.08.2020 - 13:06
- 19.08.2020 - 13:11
- 19.08.2020 - 13:27
NormalDiffusion
 
Posts: 104
Joined: Sat Apr 18, 2020 2:50 pm

Re: 16600 consistently crashing on AMD Radeon VII

Postby muziqaz » Thu Aug 20, 2020 11:39 am

NormalDiffusion wrote:
muziqaz wrote:Project has been disabled on all AMD cards but Navi. Please let us know if you still receive new p16600 WU on AMD GPU :)


Yep, still getting 16600 on Radeon VII (2nd Gen Vega) as of today (all time UTC):
Machine 1:
- 19.08.2020 - 20:55
- 19.08.2020 - 22:43
- 20.08.2020 - 10:2x

Machine 2:
- 19.08.2020 - 13:06
- 19.08.2020 - 13:11
- 19.08.2020 - 13:27


Thanks, we'll try other means
User avatar
muziqaz
 
Posts: 690
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Re: 16600 consistently crashing on AMD Radeon VII

Postby NormalDiffusion » Thu Aug 20, 2020 11:42 am

muziqaz wrote:
Thanks, we'll try other means


But it's a lot less than before! :D
NormalDiffusion
 
Posts: 104
Joined: Sat Apr 18, 2020 2:50 pm

Re: 16600 consistently crashing on AMD Radeon VII

Postby muziqaz » Thu Aug 20, 2020 11:45 am

NormalDiffusion wrote:
muziqaz wrote:
Thanks, we'll try other means


But it's a lot less than before! :D


That's not good enough. It was set to exclude everything but Navi. Appearantly the setting failed :D
User avatar
muziqaz
 
Posts: 690
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

PreviousNext

Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 2 guests

cron