[Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Moderators: slegrand, Site Moderators, PandeGroup

[Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby toTOW » Thu Jan 15, 2009 5:27 pm

Before posting any report, and if you're only seeing issue on an individual WU (it can fail up to 6 times in a row before moving to another one), please check if it has already been reported as a bad WU in this forum : viewforum.php?f=19 ... if you're having issue with multiple WUs (different Project/Run/Clone/Gen numbers), please do what is described below.

I've recently seen many reports of errors on NV GPU, so I decided to create this thread. It is intended to centralize reports, and to split them from discussions around the issue. The goal is to help find a pattern that could trigger the issue. To make your report, please quote my post, and remove any line in each section that doesn't apply to your case.

  • Failing projects (please add a list of exact project numbers if you have them)
    • 480 points WUs (project ranges : 5013-5016, 5504-5507, 5801)
    • 384 points WUs (project range : 5757-5764)
    • 511 points WUs (project range : 5749-5756)
    • 353 points Wus (project range : 5765-5772)
    • core 14 WUs (project range : 590x)
  • Failing hardware (please add the exact GPU designation if you know it. ie 9800GTX+)
    • GTX2 xxx series
    • 9xxx series
    • 8xxx series
  • Failing OS
    • Windows XP 32 bits
    • Windows XP 64 bits
    • Windows Vista 32 bits
    • Windows Vista 64 bits
  • Failing drivers (enter here the version number of the driver you use)
    ...
  • Comments (add below any detail you might find useful to the report)
    ...

If you want a place talk for general talks on this issue, you can start here with this thread : viewtopic.php?f=19&t=7953 or find any other you like.
Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.

FAH-Addict : latest news, tests and reviews about Folding@Home project.

Image
User avatar
toTOW
Site Moderator
 
Posts: 7999
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France

Re: [Please read] NaNS detected on GPU - UNSTABLE_MACHINE error

Postby pmsfh2008 » Thu Jan 15, 2009 7:44 pm

* 353 points Wus (project range : 5765-5772)
* 9xxx series (edit: 3 x EVGA 9600 gso 384MB)
* Windows Vista 32 bits
* 180.60
Comments -
Projects > 5765 got NANS on GPU 2 & 3
Reinstalled 178.08 drivers, found 7/1/08 amdcalc.dll and amdcalt.dll in 2 & 3, replaced with 8/6/8 versions.
Restarted
[edit more, edit project more info]
* 353 points Wus (project range : 5765-5772)
5765 (R5, C182, G18)
5767 (R12, C231, G9)
5768
* 9xxx series
3 x EVGA 9600 gso 384MB
* Windows Vista 32 bits
* 178.08 (from replaced 180.60)
Comments -
GPU 1 processed project 5765, 5761,5749,5757 ... 5758, 5759, 5760, 5762, 5764, 5771 and continues for now
GPU 2 processed 5764, 5760, 5768 then NANS /EUE (slot 3-7); restarted NANS on 5765
GPU 3 processed 5764, 5759, 5765 then NANS /EUE (slot 3-7); restarted NANS on 5765
edit: 1/17/08 added 5765, 5768 problem WU's; Note: 5765, 5771 OK on GPU 1
Last edited by pmsfh2008 on Sun Jan 18, 2009 3:13 am, edited 6 times in total.
pmsfh2008
 
Posts: 75
Joined: Sat Jul 26, 2008 12:17 am
Location: Bradenton, FL

Re: [Please read] NaNS detected on GPU - UNSTABLE_MACHINE error

Postby Tobit » Thu Jan 15, 2009 7:48 pm

pmsfh2008: Which 9xxx series card do you have?
User avatar
Tobit
 
Posts: 743
Joined: Thu Apr 17, 2008 2:35 pm
Location: Manchester, NH USA

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby two00lbwaster » Thu Jan 15, 2009 8:12 pm

[*]Failing projects (please add a list of exact project numbers if you have them)

* 384 points WUs (project range : 5757-5764)
Project: 5757 (Run 10, Clone 89, Gen 141) 81%
Project: 5757 (Run 10, Clone 89, Gen 141) 55%
Project: 5757 (Run 10, Clone 89, Gen 141) 34%
Project: 5757 (Run 10, Clone 89, Gen 141) 100% :!:

Project: 5763 (Run 2, Clone 81, Gen 116) 2%
Project: 5763 (Run 2, Clone 81, Gen 116) 1%
Project: 5763 (Run 2, Clone 81, Gen 116) 1%
Project: 5763 (Run 2, Clone 81, Gen 116) 1%
Project: 5763 (Run 2, Clone 81, Gen 116) 1%

Project: 5763 (Run 6, Clone 69, Gen 102) 1%
Project: 5763 (Run 6, Clone 69, Gen 102) 2%
Project: 5763 (Run 6, Clone 69, Gen 102) 2%
Project: 5763 (Run 6, Clone 69, Gen 102) 1%

Project: 5759 (Run 4, Clone 51, Gen 92) 1%

Project: 5759 (Run 8, Clone 103, Gen 106) 1%
Project: 5759 (Run 8, Clone 103, Gen 106) 1%
Project: 5759 (Run 8, Clone 103, Gen 106) 1%
Project: 5759 (Run 8, Clone 103, Gen 106) 2%
Project: 5759 (Run 8, Clone 103, Gen 106) 1%
Project: 5759 (Run 8, Clone 103, Gen 106) 1%

* 511 points WUs (project range : 5749-5756)
Project: 5753 (Run 0, Clone 263, Gen 1) 52%
Project: 5753 (Run 0, Clone 263, Gen 1) 100% :!:

[*]Failing hardware (please add the exact GPU designation if you know it. ie 9800GTX+)
* 8800GT series PNY 256MB

[*]Failing OS
* Windows XP 64 bits

[*]Failing drivers (enter here the version number of the driver you use)
180.60 - No successful units completed
181.20 - Has started to return units successfully after acting like the 180.60 drivers for some time

[*]Comments (add below any detail you might find useful to the report)
This card ran for several weeks without EUEing on its own in a WinXP 64bit machine with 180.60 drivers on Intel Core2 hardware.

After being transplanted into a new machine it near constantly EUEed with the 180.60 drivers and after trying the 181.20 drivers with the same results I turned off the client and left it for a day. The following day I started the client up again which resulted in the strange units that progressively failed and then suddenly completed. Since then I've had this occur again. AMD 790FX 64x2 hardware this time.

This machine also has three other identical cards folding, and this card is GPU0. This is the only card with issues.
two00lbwaster
 
Posts: 53
Joined: Sat May 24, 2008 9:48 pm

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby HenryW » Thu Jan 15, 2009 8:29 pm

I made a post in another thread about this but I thought I should do it here also.

* 384 points WUs (project range : 5757-5764)
* 9600 GT
* Windows XP 32 bits

This WU only problem, all other WUs OK so far. Project: 5763 (Run 1, Clone 1, Gen 85)
Prior to that (Jan 2 till now (beginning of that particular log file) I had: Project: 5763 (Run 12, Clone 6, Gen 80), Project: 5763 (Run 6, Clone 2, Gen 89), Project: 5763 (Run 14, Clone 407, Gen 24), Project: 5763 (Run 12, Clone 58, Gen 79), Project: 5763 (Run 4, Clone 38, Gen 54). No problem with any of these.

Folding 1 SMP & GPU on: XP 32-bit, Q6600 (OC 2.8), MSI P35 Neo-F, PNY 9600GT-180.48 drivers
Last edited by HenryW on Sat Jan 17, 2009 8:21 pm, edited 1 time in total.
Image
User avatar
HenryW
 
Posts: 19
Joined: Tue Oct 28, 2008 4:12 pm

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby statesidecoma » Thu Jan 15, 2009 10:28 pm

  • Failing projects (please add a list of exact project numbers if you have them)
    * 353 points Wus (project range : 5765-5772) 5766 projects
  • Failing hardware (please add the exact GPU designation if you know it. ie 9800GTX+)
    8800 GT 512 MB
  • Failing drivers (enter here the version number of the driver you use)
    ...
  • Comments
    It is the 5766 project on one of my compters and it doesn't matter if you use 8 or 9 series video cards. it doesn't happen on all computers either. Just my Dell XPS 400, no overclocks.
statesidecoma
 
Posts: 71
Joined: Thu Jan 15, 2009 3:51 am
Location: Grove, Oklahoma

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby Ytterbium » Fri Jan 16, 2009 12:34 am

•Failing projects (please add a list of exact project numbers if you have them)

5765 (Run 3, Clone 251, Gen 8)
5765 (Run 3, Clone 372, Gen 7)
5765 (Run 4, Clone 148, Gen 14)
5765 (Run 6, Clone 387, Gen 22)
5765 (Run 7, Clone 371, Gen 9)
5765 (Run 13, Clone 176, Gen 43)
5765 (Run 13, Clone 301, Gen 4)
5765 (Run 2, Clone 458, Gen 6)
5766 (Run 0, Clone 350, Gen 3)
5766 (Run 6, Clone 239, Gen 12)
5766 (Run 6, Clone 325, Gen 9)
5766 (Run 8, Clone 398, Gen 3)
5766 (Run 13, Clone 484, Gen 16)
5768 (Run 10, Clone 122, Gen 9)
5768 (Run 12, Clone 85, Gen 56)
5771 (Run 12, Clone 48, Gen 6)

•Failing hardware (please add the exact GPU designation if you know it. ie 9800GTX+)
* 8800GTS 320Mb series

•Failing OS
* Windows Vista 64 bits

•Failing drivers (enter here the version number of the driver you use)
181.20

•Comments (add below any detail you might find useful to the report

No overclocks, runs at 70dC full load.


Bought a new GFX card, I hope it goes away :) Card Installed, We'll see what happens
Last edited by Ytterbium on Tue Jan 27, 2009 11:42 pm, edited 14 times in total.
Image
Ytterbium
 
Posts: 23
Joined: Sat Nov 15, 2008 11:24 am

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby T_Flight » Fri Jan 16, 2009 5:36 am

Failing projects (please add a list of exact project numbers if you have them)
* 384 points WUs (project range : 5757-5764)
I had 5 - 5763 units EUE due to NAN's on GPU and UNSTABLE_MACHINE Error's

Failing hardware (please add the exact GPU designation if you know it. ie 9800GTX+)
* GTX2 xxx series
EVGA GTX 280 SSC 670/1451/1180 via EVGA Precision Runs 56C GPU Tested Stable on numerous...over 400+ Units were succsessful before this inciodent with not a single EUE until the losses posted above.

Failing OS
* Windows Vista 64 bit
Note: OS is not actually failing. System is extremely stable, but this is the OS I'm using. UAC is ON.

Failing drivers (enter here the version number of the driver you use)
nVidia 185.20 PhysX v8.11.15

Comments (add below any detail you might find useful to the report)
This system has folded these units absolutely flawlessly since Christmas. I've done two driver changes since that time and each has performed with a 100% success rate up until those 5 WU's above failed which filled up the Queue.dat and shut the client down. I attempted to restart and the Client again failed. I then shutdown the client to investigate. At rthat point I was stuck with the question "Should I Uninstall and Clean the Folders and Files and Reinstall and attempt to get work done overnight, or wait?" I decided the best course of action was to Uninstall, clean the reg of the FAH folders which effectively got rid of the Queue.dat and all the associated files, and get the client up and running again by reinstalling a clean install. I felt like ti was better for the project to get the work done and have to recycle those WU's which I had leanred were recycled fairly quickly, then to let my system idle overnight. I felt like I had to choose between the worst of two evils. I hope I chose correctly.

The Client since that time has been folding flawlessly with not a single WU lost. I don't know if this was a client issue, or a Bad WU, but it seems to be related to those 5763's. I had 5 out of 5 fail 100%. I have since folded 9 out of 9 WU's total with a 100% success rate, and it's currently 91% of the way through a 5763 and looks like it's gonna finish.

Edited to add: The 5763 noted in the above paragrpah did complete since clicking send on this post. I do NOT understand this now.

Since the wipe and reinstall it does seem like the client is a little quicker, but I will have to watch it a couple more days before I can say that for sure.

This machine has some of the best hardware that money can buy. It has an i7 965 CPU on a Rampage II Extreme Motherboard. RAM is Corsair Dominator 1600 C8. The GPU has been tested stable with numerous programs including Furmark, LightsMark, 3DMark apps, and has literally been pounded with Furmark for hours before I started folding with it back just before Christmas. Up until that incident above it's never given me a hint that anything was ammiss, or unstable.
Image
T_Flight
 
Posts: 23
Joined: Tue Apr 29, 2008 4:34 am

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby OldChap » Fri Jan 16, 2009 6:10 pm

5751 (2,253,7) immediate
9800 gx2 (evga ssc)
xp 32
180.60

First failure in 2-3 weeks, in fact since re-installing all folding on 2*gx2's temps on this card typically <75deg C
Image
OldChap
 
Posts: 68
Joined: Thu Jan 01, 2009 10:27 am

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby fractal » Fri Jan 16, 2009 9:36 pm

System: 2 x ASUS 9600gso/TOP @ 600/900/1700, msi k9a2 platinum, ea380 psu, ForceWare 177.84, 6.20r1, 1.19, XP 32 SP3, X2-3800, everything as it came out of the box. Nothing has changed on my end for many months, This machine originally got few errors, then 20% errors with shutting down, now back to this.

3 UNSTABLE_MACHINE out of 405 WU's in the current logs (0.7%). All on the first card. No failures on second card. No display or terminator on second card.

5757 (6,87,84) @ 38%
5764 (6,86,98) @ 35%
5759 (1,112,68) @ 51%
fractal
 
Posts: 118
Joined: Mon Dec 03, 2007 12:42 am

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby am3k » Fri Jan 16, 2009 10:41 pm

# Failing projects (please add a list of exact project numbers if you have them)
* 384 points WUs (project range : 5757-5764)
Specially noted 5760 comming back to haunt me again and again
# Failing hardware (please add the exact GPU designation if you know it. ie 9800GTX+)
* 8800 GTX
# Failing OS
* Windows SBS 2008 (64 bits)
# Failing drivers (enter here the version number of the driver you use)
Newest atm 181.20
I also had the issue with the 180.48 drivers

# Comments (add below any detail you might find useful to the report)
Image
User avatar
am3k
 
Posts: 4
Joined: Tue Oct 21, 2008 3:41 pm
Location: Norway

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby VijayPande » Sat Jan 17, 2009 12:31 am

Thank you for posting this information. We are looking into it. If you could post specific Project, Run, Clone, Gen/Frame info, that would be particularly useful, since we would then go to look at that particular WU and run it in house and give it a strong scrubbing.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
User avatar
VijayPande
Pande Group Member
 
Posts: 2728
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby P5-133XL » Sat Jan 17, 2009 1:18 am

My experiance is that when I have repeatedly failing folding issues a reboot often fixes the problem. This seems to be true for both Nvidia and ATI. It doesn't fix all problems, just the vast majority of them. the GPU clients just plain don't seem to be stable 24x7 but rather after a few weeks of continious folding, they inevitablly start EUE'ing or unstable machining and then after a reboot they start working again.
Image
P5-133XL
Site Moderator
 
Posts: 4001
Joined: Sun Dec 02, 2007 4:36 am
Location: Salem. OR USA

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby tomc001 » Sat Jan 17, 2009 2:48 am

[list][*]Failing projects (please add a list of exact project numbers if you have them)
No longer have log file

[*]Failing hardware (please add the exact GPU designation if you know it. ie 9800GTX+)
Quadro FX570m

[*]Failing OS
* Windows Vista 64 bits

[*]Failing drivers (enter here the version number of the driver you use)
Lenovo ThinkPad driver 7.15.11.7597


[*]Comments (add below any detail you might find useful to the report)
Tried GPU on my ThinkPad T61p. It worked much of the time but would sometimes fail with Unstable error and NAN error. I figured the GPU cooling in the laptop wasn't good enough for sustained folding and gave up and now just run FAH smp on the ThinkPad.
tomc001
 
Posts: 26
Joined: Mon Jan 05, 2009 12:58 am

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby two00lbwaster » Sat Jan 17, 2009 10:36 am

[*]Failing projects (please add a list of exact project numbers if you have them)

Project: 5757 (Run 10, Clone 89, Gen 141) 81%
Project: 5757 (Run 10, Clone 89, Gen 141) 55%
Project: 5757 (Run 10, Clone 89, Gen 141) 34%
Project: 5757 (Run 10, Clone 89, Gen 141) 100%

Project: 5753 (Run 0, Clone 263, Gen 1) 52%
Project: 5753 (Run 0, Clone 263, Gen 1) 100%

Project: 5762 (Run 1, Clone 15, Gen 159) 55%
Project: 5762 (Run 1, Clone 15, Gen 159) 90%

Project: 5771 (Run 1, Clone 256, Gen 1) 29%
Project: 5771 (Run 1, Clone 256, Gen 1) 100%

Project: 5759 (Run 14, Clone 74, Gen 156) 25%
Project: 5759 (Run 14, Clone 74, Gen 156) 61%
Project: 5759 (Run 14, Clone 74, Gen 156) 85%
Separate work Unit completed - Project: 5750 (Run 5, Clone 265, Gen 5)
Project: 5759 (Run 14, Clone 74, Gen 156) 9%
Project: 5759 (Run 14, Clone 74, Gen 156) 100%

Project: 5760 (Run 12, Clone 34, Gen 71) 65%
Project: 5760 (Run 12, Clone 34, Gen 71) 38%
Project: 5760 (Run 12, Clone 34, Gen 71) 100%

Project: 5770 (Run 9, Clone 250, Gen 3) 9%
Project: 5770 (Run 9, Clone 250, Gen 3) 18%

Project: 5760 (Run 8, Clone 9, Gen 118) 59%
Project: 5760 (Run 8, Clone 9, Gen 118) 44%
Project: 5760 (Run 8, Clone 9, Gen 118) 38%

Computer shuts down folding for 24hrs due to the last five mdrun errors listed above

[*]Failing hardware (please add the exact GPU designation if you know it. ie 9800GTX+)
* 8800GT series PNY 256MB

[*]Failing OS
* Windows XP 64 bits

[*]Failing drivers (enter here the version number of the driver you use)
180.60 - No successful units completed
181.20 - Has started to return units successfully after acting like the 180.60 drivers for some time

[*]Comments (add below any detail you might find useful to the report)
This card ran for several weeks without EUEing on its own in a WinXP 64bit machine with 180.60 drivers on Intel Core2 hardware.

After being transplanted into a new machine it near constantly EUEed with the 180.60 drivers and after trying the 181.20 drivers with the same results I turned off the client and left it for a day. The following day I started the client up again which resulted in the strange units that progressively failed and then suddenly completed. Since then I've had this occur again. AMD 790FX 64x2 hardware this time.

This machine also has three other identical cards folding, and this card is GPU0. This is the only card with issues.

In addition, temps are ~38degC at load (my attic is cold in the winter) and stock clocks on the problem card (lowering the clocks doesn't help)

After every error I get the following:

Code: Select all
Error: Could not get length of results file work/wuresults_0*.dat
Error: Could not read unit 0* file. Removing from queue.


I take it that this is normal? The units are not getting deleted though as my work folders are full of old units. I just cleared out a folder and it had 70MB of WU files in it

Update:
I just got the folowing errors when trying to restart the GPU
Code: Select all
[10:19:11] Project: 5760 (Run 8, Clone 9, Gen 118)
[10:19:11]
[10:19:11] Assembly optimizations on if available.
[10:19:11] Entering M.D.
[10:19:20] Working on Protein
[10:19:20] mdrun_gpu returned
[10:19:20] Self-test failure
[10:19:20]
[10:19:20] Folding@home Core Shutdown: UNSTABLE_MACHINE
[10:19:24] CoreStatus = 7A (122)
[10:19:24] Sending work to server
[10:19:24] Project: 5760 (Run 8, Clone 9, Gen 118)

five times then shutdown.

I have deleted the work folder contents and the queue and will see what I get.
two00lbwaster
 
Posts: 53
Joined: Sat May 24, 2008 9:48 pm

Next

Return to NVIDIA specific issues

Who is online

Users browsing this forum: No registered users and 1 guest