We've been told that 8800's require 600W power supplies, but we're finding that even a little bigger (eg at least 650W) is important to leave some room for error. We are working to see if there is some way to detect this issue in software, but for now, if you're getting EUE's on the NV GPU client, this is something to consider.
By the way, this will be very important for us to consider future code optimizations. NV core v1.19 removed some optimizations to solve this problem, but there are many cards which would run fine w/this more optimized code. If we can find a way to detect whether the card can draw enough power, we may be able to choose different code paths to allow for greater optimization for cards which can handle it.
We're still looking into this. For now, if you're seeing issues with your card, please consider trying out a bigger power supply. We will continue to look to see if this is indeed the problem and what we can do to help the situation such that the code runs stably on all machines.
Edit: Scott posted below but it's in the middle of the thread and some people said it was hard to find, so I'll quote it below:
SLegrand wrote:Here are the current facts:
1. Something very odd is up with some and I do mean *some* G8x/G9x chips.
2. This problem wasn't evident until recently or the NVIDIA client would never have made it out the door, but sometime recently,
like a harmonic convergence so to speak, a subset of G8x/G9x chips started having random failures. It may be a hardware
issue, but it seems to be caused by some sort of software change. I'd guess something is messing up some sort of timing on
the chip, but that's just a guess.
3. Some chips stop exhibiting this problem with a beefy enough power supply.
4. Some don't, but they all do it less often.
5. For whatever reason, it doesn't happen on GTX260/280 - I've had a GTX260 running F@H for the past 2 months straight without a single instance of this.
6. Reproing this bug takes anywhere from 40 minutes to 8 hours of computation so fixing it is going awfully slow where 40 minutes was the norm for an underpowered system, and 4-8 hours the current norm now that I've addressed that.
Keep in mind that GPUs currently do not have ECC memory. But, in graphics, if a memory error occurs, the write target is defined by the hardware itself as a specific pixel in the framebuffer or a render target, and all inputs are done in terms of texture coordinates. This constrains the reads and the writes to stay in reasonable areas of memory and limits the worst-case scenario to a corrupted pixel.
In contrast, in Folding@Home, naked memory pointers are used both for reads and writes. When a memory error occurs, this can lead to an invalid read or write of random memory. When this happens, a kernel for the GPU fails. This is what is happening here. Memory errors are almost guaranteed to occur if there is insufficient power for the GPU. But, as I just said, when it's in graphics, the worst you're likely to see is a corrupt pixel for a single frame (obviously one can come up with more bizarre failure scenarios, but this is the lion's share of them).
Alternatively, if an atom coordinate is misread from memory, it can cause the forces to shoot off to the moon, and that leads to a cascade of NaNs, which is the other EUE failure scenario here.
I'm now seeing it repro with an 800W power supply and a 9800GTX. But the frequency of reproduction is much lower than with the 460W power supply with which I initially did so.
I can force a fix of this in the same way that I once fixed a bug on the Atari Jaguar in reading memory twice and then comparing the values, but that's a kludge that merely reduces the frequency of memory errors by a factor of 1e9 or so, and since the 9800s were all working just fine a month or two ago, there's a root cause, and it really points to being a software failure.
So I'm going to end on a bright note - we can repro this, that means we can fix it. Getting to this stage was the hardest part.



