Announcing release: standalone memory tester for NVIDIA GPUs

Moderators: slegrand, Site Moderators, PandeGroup

Re: Announcing release: standalone memory tester for NVIDIA GPUs

Postby hackman2007 » Tue May 25, 2010 4:52 am

You guys ever figure out anything about the "Modulo-20" errors?

I ran it on my 9800GT and got about 4,000,000 errors in about 90000 ms.

I checked the temperature of the card and it was running at 100C (it has run hotter, 107C previously).

Time for RMA? Or ignore?
hackman2007
 
Posts: 76
Joined: Wed Feb 13, 2008 12:54 am

Re: Announcing release: standalone memory tester for NVIDIA GPUs

Postby ihaque » Tue May 25, 2010 6:13 am

As far as we can tell, mod-20 errors are endemic to that generation of board. 4 million errors in 90 seconds is very high, though. What were the parameters you ran the test with, and is your board overclocked at all?
User avatar
ihaque
Pande Group Member
 
Posts: 234
Joined: Mon Dec 03, 2007 5:20 am
Location: Stanford

Re: Announcing release: standalone memory tester for NVIDIA GPUs

Postby hackman2007 » Tue May 25, 2010 6:24 am

ihaque wrote:As far as we can tell, mod-20 errors are endemic to that generation of board. 4 million errors in 90 seconds is very high, though. What were the parameters you ran the test with, and is your board overclocked at all?


I ran with the following....

MemtestG80.exe 400 10000 > logfile.txt

I did not let it finish to 10000, exited at 1000 after I noticed the errors.

I did remote in to the machine at one point though, not sure if this would happen though.

It is a stock clocked 9800GT from EVGA.
hackman2007
 
Posts: 76
Joined: Wed Feb 13, 2008 12:54 am

Re: Announcing release: standalone memory tester for NVIDIA GPUs

Postby ihaque » Tue May 25, 2010 6:57 am

The first thing I'd check is that the number of errors is around 4 million (4,000,000) and not 4 billion (4,000,000,000). If it's the latter, that probably means that the mod-20 test is timing out (I really need to fix the reporting of that particular type of error).

If it's not a timeout, then that is definitely an excessively high number of errors (the baseline error rate is around a couple per week, not a few million in a minute and a half). Your card is also running very hot. You might want to try to clear out the heatsink, in case it's clogged with dust. If you can't get it to cool down, or if you do get it cool and it's still throwing that many errors, then the card is bad and should be replaced.
User avatar
ihaque
Pande Group Member
 
Posts: 234
Joined: Mon Dec 03, 2007 5:20 am
Location: Stanford

Re: Announcing release: standalone memory tester for NVIDIA GPUs

Postby hackman2007 » Tue May 25, 2010 5:24 pm

ihaque wrote:The first thing I'd check is that the number of errors is around 4 million (4,000,000) and not 4 billion (4,000,000,000). If it's the latter, that probably means that the mod-20 test is timing out (I really need to fix the reporting of that particular type of error).

If it's not a timeout, then that is definitely an excessively high number of errors (the baseline error rate is around a couple per week, not a few million in a minute and a half). Your card is also running very hot. You might want to try to clear out the heatsink, in case it's clogged with dust. If you can't get it to cool down, or if you do get it cool and it's still throwing that many errors, then the card is bad and should be replaced.


Re-checked the number and yes it is 4 billion (this is why I'm not in a math-related field :wink: ).

Exact number of errors is: 4,294,967,284

Regarding the temperatures, actually this is normal. The 9800GT is a single slot card and just runs hot when GPU folding or doing anything excessively intensive. My 9600GSO, 8800GTX and 9800GX2 were the same way (8800GTX died, not from temperature issues though). The reason I mentioned the temperature is because I didn't know how the program worked. It appears to stress the graphics card almost as much as GPU folding.
hackman2007
 
Posts: 76
Joined: Wed Feb 13, 2008 12:54 am

Re: Announcing release: standalone memory tester for NVIDIA GPUs

Postby ihaque » Tue May 25, 2010 5:49 pm

hackman2007 wrote:Exact number of errors is: 4,294,967,284


Yup, that's a time out. Try re-running the test with a smaller memory size - maybe 256 or 200 MB.

hackman2007 wrote:Regarding the temperatures, actually this is normal. The 9800GT is a single slot card and just runs hot when GPU folding or doing anything excessively intensive. My 9600GSO, 8800GTX and 9800GX2 were the same way (8800GTX died, not from temperature issues though). The reason I mentioned the temperature is because I didn't know how the program worked. It appears to stress the graphics card almost as much as GPU folding.


I'll take your word on the temps (I don't have any single-slot cards that powerful). The memory test tends to be less intensive than folding at the default settings, though if you're running a really big memory test region the workload will clearly increase. It's still not working the logic as hard as FAH (but is working the memory harder, of course).
User avatar
ihaque
Pande Group Member
 
Posts: 234
Joined: Mon Dec 03, 2007 5:20 am
Location: Stanford

Re: Announcing release: standalone memory tester for NVIDIA GPUs

Postby Sidicas » Tue May 25, 2010 8:20 pm

ihaque wrote:Yup, that's a time out. Try re-running the test with a smaller memory size - maybe 256 or 200 MB.

Question...
Is there any way with CUDA to force it to NOT use system RAM? I know with DX9, I was able to do that kind of thing and it just threw an error if you ran out of video memory that you could catch and then just abort gracefully..

If I understand CUDA correctly (correct me if I'm wrong), what's happening is that it's overflowing to system RAM and then it's not getting back in time before the shader gets automatically axed (by the driver?) for taking too long to execute.. I'd imagine such errors might be very difficult to sort out from genuine compute failures of the shader..

It's either that.. Or perhaps the shader is just sitting around waiting for GPU memory to become available that simply won't because you're trying to allocate beyond the range of memory.. And the shader gets axed that way...

Either way, I'd imagine there's got to be a better way to detect such situations than reporting 4 billion errors.. I might dig into the memtest code myself, it can't be too complicated, right? It's just a memtester.... Famous last words :D
Sidicas
 
Posts: 232
Joined: Sun Feb 17, 2008 5:46 pm

Re: Announcing release: standalone memory tester for NVIDIA GPUs

Postby Nathan_P » Tue May 25, 2010 9:25 pm

hackman2007 wrote:
ihaque wrote:The first thing I'd check is that the number of errors is around 4 million (4,000,000) and not 4 billion (4,000,000,000). If it's the latter, that probably means that the mod-20 test is timing out (I really need to fix the reporting of that particular type of error).

If it's not a timeout, then that is definitely an excessively high number of errors (the baseline error rate is around a couple per week, not a few million in a minute and a half). Your card is also running very hot. You might want to try to clear out the heatsink, in case it's clogged with dust. If you can't get it to cool down, or if you do get it cool and it's still throwing that many errors, then the card is bad and should be replaced.


Re-checked the number and yes it is 4 billion (this is why I'm not in a math-related field :wink: ).

Exact number of errors is: 4,294,967,284

Regarding the temperatures, actually this is normal. The 9800GT is a single slot card and just runs hot when GPU folding or doing anything excessively intensive. My 9600GSO, 8800GTX and 9800GX2 were the same way (8800GTX died, not from temperature issues though). The reason I mentioned the temperature is because I didn't know how the program worked. It appears to stress the graphics card almost as much as GPU folding.


No its not normal, i had problems with my old single slot 8800gt's last year, mine were running at 101/102 deg C and kept EUEing. I took the shrouds of the heatsinks, cleaned out all the dust and left the shrouds off. Temps never went over 85 after that - even on the old dreaded 511 pointers. If your cards are going to 107 then you do have a problem with heat.
Image
Nathan_P
 
Posts: 1564
Joined: Wed Apr 01, 2009 10:22 pm
Location: Jersey, Channel islands

Re: Announcing release: standalone memory tester for NVIDIA GPUs

Postby ihaque » Tue May 25, 2010 10:01 pm

Sidicas wrote:If I understand CUDA correctly (correct me if I'm wrong), what's happening is that it's overflowing to system RAM and then it's not getting back in time before the shader gets automatically axed (by the driver?) for taking too long to execute.. I'd imagine such errors might be very difficult to sort out from genuine compute failures of the shader..


No, CUDA does not automatically spill to system memory (unless you're using an IGP, in which case "system memory" and "video memory" are the same thing). You have to explicitly issue copies to and from device-memory, or explicitly map page-locked system memory to be accessible by the device. The problem is that certain kernels are just very slow (in particular, mod-20 is slow because it uses an integer modulo, which is exceedingly slow on these GPUs). The graphics driver will only allow kernels to run for a limited amount of time, to keep the GPU from hanging (iirc, 5 sec on XP, 2 sec on Vista/7, and infinite on Linux if not driving an X display). What's happening is just that the kernel is taking too long and getting killed by the driver.

Sidicas wrote:It's either that.. Or perhaps the shader is just sitting around waiting for GPU memory to become available that simply won't because you're trying to allocate beyond the range of memory.. And the shader gets axed that way...


Nope, CUDA will tell you if you can't allocate that much memory. That's why the tester can report such failures right at the start. Try allocating 512 MB on a 256MB card - even if you have gobs of free system memory, it won't work :).

Sidicas wrote:Either way, I'd imagine there's got to be a better way to detect such situations than reporting 4 billion errors.. I might dig into the memtest code myself, it can't be too complicated, right? It's just a memtester.... Famous last words :D


Yes, there are certainly better ways; it's just a weakness of the API as currently written that I don't pass back better status information. Like I said, it's something I want to fix but haven't had time to. If you do want to take a look and contribute a patch, you're certainly more than welcome to! The code really isn't all that bad - especially if you ignore the GPU code and only look at the host-side code.
User avatar
ihaque
Pande Group Member
 
Posts: 234
Joined: Mon Dec 03, 2007 5:20 am
Location: Stanford

Re: Announcing release: standalone memory tester for NVIDIA GPUs

Postby a_fool » Sat Aug 28, 2010 7:32 pm

I've noticed that v1.1 is available at simtk (which has large memory bug-fixes and support for Fermi). However, v1.1 is also listed on the FaH utilities page. Upon downloading, the program and associated readme list v1.00. Is this a typo, or was the wrong version uploaded to the FaH page?

I don't really want to register for an account on simtk in order to verify.
a_fool
 
Posts: 12
Joined: Fri Feb 12, 2010 5:35 am

Re: Announcing release: standalone memory tester for NVIDIA GPUs

Postby ihaque » Sat Aug 28, 2010 8:14 pm

a_fool wrote:I've noticed that v1.1 is available at simtk (which has large memory bug-fixes and support for Fermi). However, v1.1 is also listed on the FaH utilities page. Upon downloading, the program and associated readme list v1.00. Is this a typo, or was the wrong version uploaded to the FaH page?

I don't really want to register for an account on simtk in order to verify.


Unless you downloaded the OSX version (which I never updated), typo.
User avatar
ihaque
Pande Group Member
 
Posts: 234
Joined: Mon Dec 03, 2007 5:20 am
Location: Stanford

Previous

Return to NVIDIA specific issues

Who is online

Users browsing this forum: No registered users and 1 guest

cron