[Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Moderators: slegrand, Site Moderators, PandeGroup

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE er

Postby Qinsp » Wed Nov 10, 2010 12:57 am

If you have control over the memory speed and/or voltage, raise the voltage, and drop the speed.
Quality Inspection - Corona, CA, USA
Dimensional Inspection Laboratory
Pat McSwain, President
Qinsp
 
Posts: 596
Joined: Sun Oct 17, 2010 2:34 pm

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE er

Postby eastrider » Wed Nov 10, 2010 1:25 am

Qinsp wrote:If you have control over the memory speed and/or voltage, raise the voltage, and drop the speed.


I may try dropping the speed on my second GPU. It's better to have a memory downlocked on FAH (Not that important after all) than having to RMA the card.
eastrider
 
Posts: 7
Joined: Tue Nov 17, 2009 8:31 pm

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE er

Postby eastrider » Wed Nov 10, 2010 1:47 am

Underclocked as low as I could (900 -> 675 MHz).

Still having problems. With the same time separarion between them. I guess it's unfixable.
eastrider
 
Posts: 7
Joined: Tue Nov 17, 2009 8:31 pm

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE er

Postby toTOW » Thu Nov 11, 2010 8:28 pm

This card is dead sorry ... time for RMA (but I doubt it's still under warranty) or to get a new one ...
Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.

FAH-Addict : latest news, tests and reviews about Folding@Home project.

Image
User avatar
toTOW
Site Moderator
 
Posts: 8917
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE er

Postby Golden Dragoon » Wed Dec 15, 2010 4:34 am

Keep getting NANs, haven't got the details on the other units but this is the one it has failed on right now:
P5770 R13 C98 G343
353 points, failing at 1%
GPU: nVidia GT240 (96 shaders, stock speeds, downclocked or overclocked makes no difference) also have a GTX460 768mb in the machine as the main gpu, overclocked to 1750mhz on the shaders, and it hasn't had any problems with the gpu3 client and is returning all work units fine.
CPU: Q6600 (stock or 408mhzx9 oc, no difference)
Driver: 260.99
OS: Windows 7 64bit

Code: Select all
[01:42:54] + Processing work unit
[01:42:54] Core required: FahCore_11.exe
[01:42:54] Core found.
[01:42:54] Working on queue slot 05 [December 15 01:42:54 UTC]
[01:42:54] + Working ...
[01:42:54] - Calling '.\FahCore_11.exe -dir work/ -suffix 05 -checkpoint 9 -verbose -lifeline 3240 -version 620'

[01:42:54]
[01:42:54] *------------------------------*
[01:42:54] Folding@Home GPU Core
[01:42:54] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[01:42:54]
[01:42:54] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[01:42:54] Build host: amoeba
[01:42:54] Board Type: Nvidia
[01:42:54] Core      :
[01:42:54] Preparing to commence simulation
[01:42:54] - Looking at optimizations...
[01:42:54] DeleteFrameFiles: successfully deleted file=work/wudata_05.ckp
[01:42:54] - Created dyn
[01:42:54] - Files status OK
[01:42:54] - Expanded 45442 -> 251112 (decompressed 552.5 percent)
[01:42:54] Called DecompressByteArray: compressed_data_size=45442 data_size=251112, decompressed_data_size=251112 diff=0
[01:42:54] - Digital signature verified
[01:42:54]
[01:42:54] Project: 5770 (Run 13, Clone 98, Gen 343)
[01:42:54]
[01:42:54] Assembly optimizations on if available.
[01:42:54] Entering M.D.
[01:43:00] Tpr hash work/wudata_05.tpr:  1558581946 3877418189 4090186750 618695897 145251632
[01:43:00]
[01:43:00] Calling fah_main args: 14 usage=100
[01:43:00]
[01:43:01] Working on Protein
[01:43:02] Client config found, loading data.
[01:43:02] Starting GUI Server
[01:44:13] Completed 1%
[01:44:13] mdrun_gpu returned
[01:44:13] NANs detected on GPU
[01:44:13]
[01:44:13] Folding@home Core Shutdown: UNSTABLE_MACHINE
[01:44:16] CoreStatus = 7A (122)
[01:44:16] Sending work to server
[01:44:16] Project: 5770 (Run 13, Clone 98, Gen 343)
[01:44:16] - Error: Could not get length of results file work/wuresults_05.dat
[01:44:16] - Error: Could not read unit 05 file. Removing from queue.
[01:44:16] EUE limit exceeded. Pausing 24 hours.


Is getting to the point that only 1 in 5 work units is being completed, which isn't very good for the cause really.
Golden Dragoon
 
Posts: 8
Joined: Sat Mar 14, 2009 7:41 pm

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE er

Postby drbricks » Tue Mar 22, 2011 8:15 pm

  • Failing projects (please add a list of exact project numbers if you have them)
    Project: 6801 (Run 4394, Clone 1, Gen 2)
  • Failing hardware (please add the exact GPU designation if you know it. ie 9800GTX+)
    EVGA GTX480
    [20:37:18] Gpu type=3 species=20.
  • Failing OS
    * Windows 7 Professional SP1 64 bits
  • Failing drivers (enter here the version number of the driver you use)
    Nvidia 266.58

  • Comments (add below any detail you might find useful to the report)
    I deleted the work queue and queue.dat and reconfigured the client by deleting the client.cfg and starting over; I keep getting this error.

[08:36:09] Project: 11247 (Run 2, Clone 55, Gen 32) Prior work unit completed fine
[08:37:35] Project: 6801 (Run 4394, Clone 1, Gen 2)
[08:38:59] Project: 6801 (Run 4394, Clone 1, Gen 2) failed
[08:39:05] Project: 6801 (Run 4394, Clone 1, Gen 2) failed
[08:41:47] Project: 6801 (Run 4394, Clone 1, Gen 2) failed
[08:41:53] Project: 11218 (Run 1, Clone 60, Gen 15) completed fine

Edit by Mod: Re-posted information here:
viewtopic.php?f=19&t=17974
drbricks
 
Posts: 8
Joined: Thu May 13, 2010 2:18 pm

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE er

Postby HendricksSA » Tue Mar 22, 2011 8:27 pm

Golden Dragoon, from your log it seems you are running client version 6.20. This is pretty old. Suggest you finish your current work unit, remove the work directory, queue file, and core 11 file (it is current but I would still remove it). Then install the latest 6.41r2 client and it will download the core 11 file again. Let us know how it goes. Good luck!
HendricksSA
 
Posts: 557
Joined: Fri Jun 26, 2009 4:34 am

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE er

Postby HendricksSA » Tue Mar 22, 2011 8:40 pm

Drbricks, one of the seniors will have to check Project: 6801 (Run 4394, Clone 1, Gen 2) for you. If this is an infrequent failure on this GPU, it might be better to post this in the "Issues with a Specific Work Unit" thread. You may not get the visibility you need in this topic, so one of the mods may move this. I know the seniors will want to see your log if the work unit isn't bad. Please ensure you have the -verbosity 9 flag in your client as it adds useful troubleshooting info to the log. See the instructions at: http://fahwiki.net/index.php/How_do_I_a ... _client%3F Thanks.
HendricksSA
 
Posts: 557
Joined: Fri Jun 26, 2009 4:34 am

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE er

Postby bruce » Tue Mar 22, 2011 9:30 pm

I'm closing this topic. It has become a catch-all for several DIFFERENT types of problems and each one has it's own signature even though all may give you the NaNs detected message.

1) It may be a bad WU. The post, above, by "drbricks" is an excellent example of that problem. The same WU failed three times but there's no idication of a problem with other WUs. "HendricksSA" answered that one accurately. I'll copy the information from drbricks' post into a new post in that forum, though an extract from a log would be helpful. (It's not clear if he had an UNSTABLE_MACHINE error or not.)

2) Be sure your software has been upgraded to the latest version. Golden Dragoon's log shows that, and again, HendricksSA suggested the proper next step.

3) There's also the possibility of a hardware problem. GPUs do sometimes fail. They can overheat (particularly the single slot variety which leaves all the heat inside the case). They can be installed in systems which do not provide enough power. All of these options should be considered if you're still seeing UNSTABLE_MACHINE after eliminating (1) and (2).

EDIT By PantherX-> A good place to start troubleshooting is by reading this post -> viewtopic.php?f=59&t=15196&start=15#p160068
bruce
 
Posts: 21341
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Previous

Return to NVIDIA specific issues

Who is online

Users browsing this forum: No registered users and 1 guest

cron