Culprit found in NV core v 1.15 issues for certain hardware?

Moderators: slegrand, Site Moderators, PandeGroup

Culprit found in NV core v 1.15 issues for certain hardware?

Postby VijayPande » Fri Nov 07, 2008 11:57 pm

Engineers at NVIDIA (notably Scott LeGrand) have come up with a theory for the EUE's seen in core 1.15 (and a few others in the 1.15 to 1.18 range) on certain hardware. They found that this core had code optimizations that drove the GPU so hard that it would draw a lot more electricity (one sign of this was running hotter). In some boxes, this was too much electricity and this lead to numerical instabilities. When the same machine was given a beefier power supply, the problem went away.

We've been told that 8800's require 600W power supplies, but we're finding that even a little bigger (eg at least 650W) is important to leave some room for error. We are working to see if there is some way to detect this issue in software, but for now, if you're getting EUE's on the NV GPU client, this is something to consider.

By the way, this will be very important for us to consider future code optimizations. NV core v1.19 removed some optimizations to solve this problem, but there are many cards which would run fine w/this more optimized code. If we can find a way to detect whether the card can draw enough power, we may be able to choose different code paths to allow for greater optimization for cards which can handle it.

We're still looking into this. For now, if you're seeing issues with your card, please consider trying out a bigger power supply. We will continue to look to see if this is indeed the problem and what we can do to help the situation such that the code runs stably on all machines.

Edit: Scott posted below but it's in the middle of the thread and some people said it was hard to find, so I'll quote it below:
SLegrand wrote:Here are the current facts:

1. Something very odd is up with some and I do mean *some* G8x/G9x chips.
2. This problem wasn't evident until recently or the NVIDIA client would never have made it out the door, but sometime recently,
like a harmonic convergence so to speak, a subset of G8x/G9x chips started having random failures. It may be a hardware
issue, but it seems to be caused by some sort of software change. I'd guess something is messing up some sort of timing on
the chip, but that's just a guess.
3. Some chips stop exhibiting this problem with a beefy enough power supply.
4. Some don't, but they all do it less often.
5. For whatever reason, it doesn't happen on GTX260/280 - I've had a GTX260 running F@H for the past 2 months straight without a single instance of this.
6. Reproing this bug takes anywhere from 40 minutes to 8 hours of computation so fixing it is going awfully slow where 40 minutes was the norm for an underpowered system, and 4-8 hours the current norm now that I've addressed that.

Keep in mind that GPUs currently do not have ECC memory. But, in graphics, if a memory error occurs, the write target is defined by the hardware itself as a specific pixel in the framebuffer or a render target, and all inputs are done in terms of texture coordinates. This constrains the reads and the writes to stay in reasonable areas of memory and limits the worst-case scenario to a corrupted pixel.

In contrast, in Folding@Home, naked memory pointers are used both for reads and writes. When a memory error occurs, this can lead to an invalid read or write of random memory. When this happens, a kernel for the GPU fails. This is what is happening here. Memory errors are almost guaranteed to occur if there is insufficient power for the GPU. But, as I just said, when it's in graphics, the worst you're likely to see is a corrupt pixel for a single frame (obviously one can come up with more bizarre failure scenarios, but this is the lion's share of them).

Alternatively, if an atom coordinate is misread from memory, it can cause the forces to shoot off to the moon, and that leads to a cascade of NaNs, which is the other EUE failure scenario here.

I'm now seeing it repro with an 800W power supply and a 9800GTX. But the frequency of reproduction is much lower than with the 460W power supply with which I initially did so.

I can force a fix of this in the same way that I once fixed a bug on the Atari Jaguar in reading memory twice and then comparing the values, but that's a kludge that merely reduces the frequency of memory errors by a factor of 1e9 or so, and since the 9800s were all working just fine a month or two ago, there's a root cause, and it really points to being a software failure.

So I'm going to end on a bright note - we can repro this, that means we can fix it. Getting to this stage was the hardest part.
User avatar
VijayPande
Pande Group Member
 
Posts: 1975
Joined: Fri Nov 30, 2007 7:25 am
Location: Stanford

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby Futurism » Sat Nov 08, 2008 12:26 am

So I'm assuming my 1000W PSU wouldn't have been sufficent for 2 8800GTS G92 cards with that core as I was getting loads of these errors on 1.15 and 1.18 until I disabled larger units in the config.
Futurism
 
Posts: 9
Joined: Thu Oct 23, 2008 11:06 am

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby Xilikon » Sat Nov 08, 2008 12:41 am

I believe this is bogus. A quality PSU like Antec, Seasonic or Corsair can run it well. I ran 2x8800GT with a Corsair VX550W without issues and right now, 2 of my GPU computers have just a 500w PSU without tossing a EUE. If it eue even with a 1000W psu, something else is causing issues, not a insufficient PSU.
Image
User avatar
Xilikon
 
Posts: 978
Joined: Sun Dec 02, 2007 2:34 pm

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby osgorth » Sat Nov 08, 2008 12:50 am

It should be easy enough to test this. Someone with a wattmeter could measure a problematic machine running 1.15, then replace it with 1.19 and measure again. Anyone up for the task? :)

This might be part of the problem, but I'm certain it isn't the whole picture. Most of my machines have plenty of power to spare, and yet some WUs go EUE. Not many, mind you, I'm seeing a few per day or so, spread over 15 clients.
Image
osgorth
 
Posts: 194
Joined: Fri Sep 12, 2008 11:46 am

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby VijayPande » Sat Nov 08, 2008 12:53 am

Xilikon wrote:I believe this is bogus. A quality PSU like Antec, Seasonic or Corsair can run it well. I ran 2x8800GT with a Corsair VX550W without issues and right now, 2 of my GPU computers have just a 500w PSU without tossing a EUE. If it eue even with a 1000W psu, something else is causing issues, not a insufficient PSU.


Perhaps the issue is in terms of the non-quality PSU's. Anyway, this is an easy enough hypothesis to test and has been verified in the NV labs at least for their boxes which show the problems. For now, we have a software fix (dialing back the optimizations), but this is an issue for us since we'd like to dial them back in.
User avatar
VijayPande
Pande Group Member
 
Posts: 1975
Joined: Fri Nov 30, 2007 7:25 am
Location: Stanford

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby Xilikon » Sat Nov 08, 2008 12:55 am

I made a suggestion but my post got wiped : Why don't you force 1.19 for everyone but offer also a manual version for 1.19b with full optimisations for those who can run it without issues. Like that, everyone is happy ?
Image
User avatar
Xilikon
 
Posts: 978
Joined: Sun Dec 02, 2007 2:34 pm

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby popandbob » Sat Nov 08, 2008 12:57 am

The power is not the problem in my case. I have a 8600GTS with a 750W power supply with a single 12v line with 60 amps output...
If that GPU is pulling more than 50 amps then It would be fried by now....

Both 5.18 and 5.15 are constantly EUE's

~Bob
popandbob
 
Posts: 14
Joined: Wed Jun 18, 2008 3:47 am

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby klitetools » Sat Nov 08, 2008 1:14 am

VijayPande wrote:
By the way, this will be very important for us to consider future code optimizations. NV core v1.19 removed some optimizations to solve this problem, but there are many cards which would run fine w/this more optimized code. If we can find a way to detect whether the card can draw enough power, we may be able to choose different code paths to allow for greater optimization for cards which can handle it.


Why not let the people that can handle the core 1.18 run it and the people that have problems can stick to 1.19
klitetools
 
Posts: 28
Joined: Mon Feb 11, 2008 9:37 pm

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby Karamiekos » Sat Nov 08, 2008 1:24 am

More importantly, What kind of current rating should we look for on the 12 Volt Rails?
Zakk Wylde, "Then you start firing back some cocktails."
Rigs
"Wife's" 9950@3.2 Ubuntu 9.10
Quad 8356s
User avatar
Karamiekos
 
Posts: 188
Joined: Tue Jul 15, 2008 1:27 am

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby Grendel » Sat Nov 08, 2008 1:39 am

Oh really ? The official NVIDIA GF/X card selector tool lists all 8800 cards if you select a 450W PSU. Now they are telling us you really need 150W more ??? Right. :lol:
User avatar
Grendel
 
Posts: 43
Joined: Mon Sep 22, 2008 8:16 pm
Location: OR, USA

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby VijayPande » Sat Nov 08, 2008 2:06 am

I've replied to the "why not let people choose" idea in the past, but here's some more info. It's a mess to keep two code bases. That's a good way to have bugs and to really burn up a lot of programmer resources, especially since we're talking about not a huge PPD change (eg not 50%). The main goal for all PG new activities these days is stability first, fancy functionality second. I'm happy to revisit this later once this has been stable for more than just a few days. For now, I'm happy that we have a pretty stable code running on lots of different platforms.
User avatar
VijayPande
Pande Group Member
 
Posts: 1975
Joined: Fri Nov 30, 2007 7:25 am
Location: Stanford

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby powerarmour » Sat Nov 08, 2008 2:54 am

I've run a Power & PC Cooling 750W silencer with a 9800GTX+ and still had issues with 1.15 and 1.18. In any case, if it's 100% 3D Stable then people should expect CUDA stable also, if not then Nvidia need to revamp all of their PSU requirements if verified.
Image
powerarmour
 
Posts: 127
Joined: Wed Oct 29, 2008 2:00 am
Location: Surrey, UK

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby Slash_2CPU » Sat Nov 08, 2008 3:19 am

There's a fair chance that the cards with marginal VRM circuitry onboard can't keep up with the GPU draw.

Even with a perfect 12.0V at the 6-pin connector, if the VRM on the card can't keep the V to the GPU stable, you'll still have symptoms that resemble a bad PSU. Fact is that the vast majority of users do not fold with these cards, and the power draw on the most intense 3D games is 50-70% what folding draws.

nVidia will of course never say outright that it's vendors can design imperfect products.

I think it's pretty impressive the the F@H people can produce code so efficient that it can actually push stock hardware to the failure point. If only some of the commercial software companies could be so efficient with their apps.
Slash_2CPU
 
Posts: 226
Joined: Sat Apr 19, 2008 6:15 pm

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby stevehat1 » Sat Nov 08, 2008 3:53 am

I tend to agree with the possibility of in inadequate PSU being a potential problem.

The weakest PSU in my folding farm is a OCZ 600W SLI and it drives a single overclocked 8800GT. The others are a Seasonic 625, Silverstone 1000W and a Silverstone 1200 both of which are the Olympus series, the 1200w runs a 9800GTX and the others 8800GT's. All of these machines are dual core or quad core running SMP as well as GPU2. The 1.15 core was flawless for me, no EUE's that I was aware of anyway. I also seem to have much less trouble with SMP than most, so maybe a PSU correlation there as well????

I just hate to take the points loss when my equipment was running fine with the 1.15.....
ImageImage
stevehat1
 
Posts: 62
Joined: Fri Jun 06, 2008 1:33 pm

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby codysluder » Sat Nov 08, 2008 3:56 am

Karamiekos wrote:More importantly, What kind of current rating should we look for on the 12 Volt Rails?


Yes, it cannot be simplified to just the PSU wattage. How much current is available at the 12v connector after some Watts go at other voltages and after the subtracting whatever the rest of the computer uses from the 12v rail.
codysluder
 
Posts: 1592
Joined: Sun Dec 02, 2007 1:43 pm

Next

Return to NVIDIA specific issues

Who is online

Users browsing this forum: No registered users