Culprit found in NV core v 1.15 issues for certain hardware?

Moderators: slegrand, Site Moderators, PandeGroup

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby MtM » Sat Nov 08, 2008 11:55 am

I think it's not that unplausible, could be the psu as main quilly component but even then you have so many factors outside the psu alone. The power in your house, I can tell you I lived in an appartment where the light dimmed when I turned the vaccuumcleaner on, I couldn;t leave my pc on there all day because I would get bsod's all the time ( didn't stay there long offcourse ;) ). Then there is the powerdraw from other components, and the qualitu of the power circtuitry on the card itself.

Remember we're talking about numerical precision here which goes deeper then graphical stability, remember the client has an integrety check which get's triggerd when any data seems suspicous, there only has to be one moment where the card can not draw it's ower and blam.. eue. Where other programs might be much more forgiving, Folding is not.

I'm offcourse pulling this from a dark place, as I don't have their testing data, just some trust in what they say so I would like to request that the community is shown the testing data from Nvidia. Get Scot who tested it to show his data, and have trust in this community to be able to take the data for what it's worth.
MtM
 
Posts: 2303
Joined: Fri Jun 27, 2008 3:20 pm
Location: The Netherlands

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby jcfuller » Sat Nov 08, 2008 11:59 am

As a followup to my previous post:
I had 0% completions with 1.18 so I gave up after several tries and ran 2 standard clients.
With the 1.19 core I am also running 1 standard client.
Note that this particular 8600gt is using ddr2 (gdr2...whatever) memory.
I am getting ~5min per frame on a 5506

James
jcfuller
 
Posts: 31
Joined: Thu Oct 30, 2008 12:22 pm
Location: Fort Edward, NY

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby mikeb12 » Sat Nov 08, 2008 12:07 pm

P5-133XL wrote:The proper attitude should be:

Vijay, thank you for telling us this. Now we can understand and deal with the issue.

if this is the case then I apologize for coming down on the OP.


What I'm hearing below is a suggestion to upgrade the psu on my machines to fix the small eue rate experienced on my 9600gso, 9800gt, and 2 8800gt's. when they are already running beefy psu's. not that I would consider doing that.. my gpu's turn in 10-20 back to back 0-100% wu's, then kick out an unstable eue... it's hard to imagine they can run though 10-20 consecutive successful wu's if power was an issue.
VijayPande wrote:Engineers at NVIDIA (notably Scott LeGrand) have come up with a theory for the EUE's seen in core 1.15 (and a few others in the 1.15 to 1.18 range) on certain hardware. They found that this core had code optimizations that drove the GPU so hard that it would draw a lot more electricity (one sign of this was running hotter). In some boxes, this was too much electricity and this lead to numerical instabilities. When the same machine was given a beefier power supply, the problem went away.

We've been told that 8800's require 600W power supplies, but we're finding that even a little bigger (eg at least 650W) is important to leave some room for error. We are working to see if there is some way to detect this issue in software, but for now, if you're getting EUE's on the NV GPU client, this is something to consider.

By the way, this will be very important for us to consider future code optimizations. NV core v1.19 removed some optimizations to solve this problem, but there are many cards which would run fine w/this more optimized code. If we can find a way to detect whether the card can draw enough power, we may be able to choose different code paths to allow for greater optimization for cards which can handle it.

We're still looking into this. For now, if you're seeing issues with your card, please consider trying out a bigger power supply. We will continue to look to see if this is indeed the problem and what we can do to help the situation such that the code runs stably on all machines.
mikeb12
 
Posts: 261
Joined: Tue Feb 12, 2008 12:51 pm
Location: South Carolina USA

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby (_KoDAk_) » Sat Nov 08, 2008 12:35 pm

I have
Q9550@3800-Mhz \8GB\ asus p5k-E\wifi\ HDD 80+250 \2х (ASUS+Zotac) 9600GSO(OC to 1734)\ High power Watt 560 HCP-560-A12C
Q6600@3300-Mhz \8GB \asus P5B-deluxe\ HDD 320+230+320+250\ XFX 9800GTX+(OC to 1944) \ Chieftec 600 Watt CFT-600-14CS
driver 178.24, 2008 server x64 \ FAH Core 113, 115 ,118 ALL Fine.
(_KoDAk_)
 
Posts: 16
Joined: Wed Oct 08, 2008 5:52 am

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby Xilikon » Sat Nov 08, 2008 1:52 pm

Another nail in the PSU coffin : If PSU is really a issue, how do we explain why the GTX 2xx series can fold without issues for everyone (I hardly hear any failure on those cards) and the 8600GT is having the most issues ? Power consumption can be 3-4 times bigger with the newest series vs the 8600GT, yet those smaller cards have issues. However, a very plausible theory about this is probably related to the power regulation circuit design : The GTX 2xx series is a high-end card so it's designed with a very beefy power regulation while the 8600GT is just a budget card with the design under the same budgetary constraints.

Another possibility is the spotty QA testing on those cards, which is confirmed by the fact that a teammate have 2 identical 8800GS. One of them fold perfectly without issues and the other crash left and right. He tried on different motherboards, swapping slots, swpping power connections, etc... without fixing the problem. This mean that the failing card might have a poor QA testing done. It could also be possible that the heatsink is not installed properly so some mosfets isn't in contact and overheat while folding.

I think it's ludricous to blame just the PSU unless it's a crappy one like a Roswill (shudders) and the likes but consider the possibility of a faulty VR circuit or even bad heatsink installation.
Image
User avatar
Xilikon
 
Posts: 978
Joined: Sun Dec 02, 2007 2:34 pm

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby grumpydaddy » Sat Nov 08, 2008 1:53 pm

Taking this a stage further with power usage....with no changes to my oc, moving from 1.18 to core 1.19 and folding 3* 5748 my kill-a-watt went from circa 440watts to 510watts
Image
grumpydaddy
 
Posts: 111
Joined: Sun Jun 01, 2008 9:31 pm

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby John Naylor » Sat Nov 08, 2008 2:14 pm

Xilikon wrote:Another nail in the PSU coffin : If PSU is really a issue, how do we explain why the GTX 2xx series can fold without issues for everyone (I hardly hear any failure on those cards) and the 8600GT is having the most issues ? Power consumption can be 3-4 times bigger with the newest series vs the 8600GT, yet those smaller cards have issues. However, a very plausible theory about this is probably related to the power regulation circuit design : The GTX 2xx series is a high-end card so it's designed with a very beefy power regulation while the 8600GT is just a budget card with the design under the same budgetary constraints.


The other points you made that I haven't quoted are within reason, but consider this: 8600GTs (I don't know about GTSs) draw their entire power supply from the PCI-e socket, so there is a low upper limit in terms of power supply and when they are being pushed hard I suppose it could be easily reached...

And FWIW my 8600GT did not complete a unit in about 20 attempts with the optimised cores until I gave up, but its crunching fine now.

EDIT: on second thoughts this was a daft point (see the post two below me)... ignore it...
Last edited by John Naylor on Sat Nov 08, 2008 3:16 pm, edited 2 times in total.
Folding whatever they send me since March 2006 :) Beta testing since October 2006. www.FAH-Addict.net Administrator since August 2009.
User avatar
John Naylor
 
Posts: 1268
Joined: Mon Dec 03, 2007 5:36 pm
Location: University of Birmingham, UK

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby SKeptical_Thinker » Sat Nov 08, 2008 2:46 pm

Just a data point. I was experiencing a low EUE rate (more than none, but far from all) with 1.15 using a 8800GS overclocked to the edge. I'm using a 380 W Antec power supply.

I tried backing off the overclock but I didn't see a difference in the EUE rate.
Image
SKeptical_Thinker
 
Posts: 266
Joined: Wed Apr 30, 2008 12:02 am

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby shatteredsilicon » Sat Nov 08, 2008 3:06 pm

John Naylor wrote:
Xilikon wrote:Another nail in the PSU coffin : If PSU is really a issue, how do we explain why the GTX 2xx series can fold without issues for everyone (I hardly hear any failure on those cards) and the 8600GT is having the most issues ? Power consumption can be 3-4 times bigger with the newest series vs the 8600GT, yet those smaller cards have issues. However, a very plausible theory about this is probably related to the power regulation circuit design : The GTX 2xx series is a high-end card so it's designed with a very beefy power regulation while the 8600GT is just a budget card with the design under the same budgetary constraints.


The other points you made that I haven't quoted are within reason, but consider this: 8600GTs (I don't know about GTSs) draw their entire power supply from the PCI-e socket, so there is a low upper limit in terms of power supply and when they are being pushed hard I suppose it could be easily reached...

And FWIW my 8600GT did not complete a unit in about 20 attempts with the optimised cores until I gave up, but its crunching fine now.


So in conclusion - the problem isn't the PSU, the problem is that nVidia have designed GPUs that exceed the power rating of the PCIe slot? Nice one.

IMO, the underspecified PSU idea is totally bogus. How do we then explain the problems with cards like the GX2 that draw 2/3 of the power that a GTX280 does and yet occasionally get problems, even though they have the 3-pin + 4-pin external power?

Oh, and I had tried with 2x GX2 on my 1000W PSU (550W peak plug drain) and 3x GX2 (778W wall drain), and it didn't seem to make any difference.
Image
1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers
shatteredsilicon
 
Posts: 717
Joined: Tue Jul 08, 2008 3:27 pm

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby Ivoshiee » Sat Nov 08, 2008 3:15 pm

Still, how hard is to accept that nVIDIA engineers have overcome their EUE problem by using beefier PSUs?
It does not say that all EUEs are caused by the weak PSU. Maybe it is just another card/GPU design shortcoming. There is no need to go overly ballistic over it and start firing at all directions, but it is just one thing to consider if you have the EUE issue on your nVIDIA card.
Ivoshiee
Site Moderator
 
Posts: 1359
Joined: Sun Dec 02, 2007 1:05 am
Location: Estonia

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby Xilikon » Sat Nov 08, 2008 3:25 pm

Ivoshiee wrote:Still, how hard is to accept that nVIDIA engineers have overcome their EUE problem by using beefier PSUs?
It does not say that all EUEs are caused by the weak PSU. Maybe it is just another card/GPU design shortcoming. There is no need to go overly ballistic over it and start firing at all directions, but it is just one thing to consider if you have the EUE issue on your nVIDIA card.


Where in this thread we got ballistic against this ? We just said it's bogus and not a end-all explanation. While I do agree with you that the psu can be the cause (especially a cheap one), we also need to accept the fact that it's probably a small factor and that we need to think of other possibilities like the VR design and the power draw path of the cards. It's a strong possibility that with the 8600GT, drawing the power just from the PCI-E slot might explain why it's having power issues.

The positive thing is that we are tossing ideas about what might be the cause and if enough ideas is tossed around, we might hit the nail on a cause.
Image
User avatar
Xilikon
 
Posts: 978
Joined: Sun Dec 02, 2007 2:34 pm

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby Flathead74 » Sat Nov 08, 2008 3:44 pm

Xilikon wrote:The positive thing is that we are tossing ideas about what might be the cause and if enough ideas is tossed around, we might hit the nail on a cause.

Good point there.
Stopping short at a recommendation of a "beefier" power supply that completely overlooks the amperage of the 12v line is no solution either.
I have seen 600 watt power supplies with low amp 12v lines.
I have seen none that say that they are beefier.

From EVGA's web site:
Requirements for 8800 GTS 512
Minimum of a 400 Watt power supply.
(Minimum recommended power supply with +12 Volt current rating of 26 Amp Amps.)
Minimum 450 Watt for SLI mode system.
(Minimum recommended power supply with +12 Volt current rating of 30 Amp Amps.)
Flathead74
 
Posts: 730
Joined: Sun Dec 02, 2007 7:08 pm
Location: Central New York

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby MtM » Sat Nov 08, 2008 3:54 pm

Guys I want to say/repeat something I said before but I feel might be overlooked.

Psu is one part, you also have the 'quality' off your wall outlets and the quality of the onboard power circuitry. Allot of you overclock, you should know there is a great variance in quality off components on the pcb. Some batches of caps are better then others but still have the same ratting, which is why some cards oc better then others ( besides the core itself playing a huge role ). The eue's happend more on lower end boards which have less power stages, it also happend on top end boards though but allot less. That all can be explained with power draw, not just the psu. The psu is just the easiest way to try and either dismiss the issue being power fluctuations or to confirm it, which is I think the meaning of the OP.

Not to put the blame somewhere, but to point out something which is likely an important factor. A board with lesser quality components will be more influenced by the psu then a board with better quality components, but that doesn't mean it's the board itself it just points to a combination of things possibly causing the problem.

And you can't blame Nvidia's QA because they test for graphical stability not numerical stability and while I'm not an engineer I do think there is a diffrence in how much tolerance one has in relation to the other. And then there are the numerous boards by vendors with non reference designs which might be better or wurse. A good way of testing if this is an issue is to find a vendor with custom pcb with increased/higher quality power circuitry and see if one of those boards will have the same eue rate in a system then a reference board.

I'm still hoping Scot will post the data from the internal testing that went on at Nvidia so everyone can draw their own conclusions based on real tangible datapoints instead of discussing this without having the data he has. I don't think it will lead to much without any data. People can test their own configurations, but mostly don't realise they only look at a very small percentage of possible results. They might have good stable outlets in their house/office, and good quality power regulation and conclude it's not the psu being at fault, while others have less stable outlets but also bad power circuitry so they will conclude the upgrade didn't have the wanted effect.

None of use alone has enough comparison points to draw a definitive conclusion I think, and that's why this discussion should be accompanied with the data gatherd from Nvidia's own testing.

.002 cents guys?
MtM
 
Posts: 2303
Joined: Fri Jun 27, 2008 3:20 pm
Location: The Netherlands

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby qyx » Sat Nov 08, 2008 3:59 pm

Reading all of the above, I think everyone can agree that in some configurations the psu can be A "Culprit" but it is not THE "Culprit" for all of the problems people are having. I would think that most folders would be aware of the need for a high quality psu with adequate clean 12 Volt rail and good airflow. That being said, for most users, this is not the answer and the search continues...in the core itself, in mobo voltage regulation, in video card quality control and in cuda. Why should a 9800GTX work and a 9800GTX+ not work with the same setup and software? I would like to see Nvidia focus on things like that too, which Nvidia should be infinitely qualified to do.
Last edited by qyx on Sat Nov 08, 2008 4:04 pm, edited 1 time in total.
qyx
 
Posts: 4
Joined: Wed Jun 18, 2008 7:06 pm

Re: Culprit found in NV core v 1.15 issues for certain hardware?

Postby grumpydaddy » Sat Nov 08, 2008 4:04 pm

If quality of supply is a possible issue can anyone experiencing these issues reconnect using a UPS. I don't have one but if it were the permanently on type it could iron out the ripples???
Image
grumpydaddy
 
Posts: 111
Joined: Sun Jun 01, 2008 9:31 pm

PreviousNext

Return to NVIDIA specific issues

Who is online

Users browsing this forum: No registered users