[Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Moderators: slegrand, Site Moderators, PandeGroup

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby bruce » Tue Aug 25, 2009 5:10 pm

I can't say. It might be helpful if somebody could locate a FAHlog from Gen (N-1) but finding one is going to be a bit of a problem since you have to wait for someone to report a problem with Gen (N) before you know your FAHlog might be useful.
bruce
 
Posts: 23737
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby Archangelboy » Tue Aug 25, 2009 5:14 pm

Point. I don't know enough about it. IF i've got a log from Gen N-1 around, but say I've reset my client or restarted my computer since i've run it, can I pull the log, and if so, how?
Archangelboy
 
Posts: 19
Joined: Thu Jul 09, 2009 6:52 pm
Location: Bozeman, MT

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby bruce » Tue Aug 25, 2009 5:20 pm

FAHlog continues to grow until you restart the client. If, at that time, it's larger than 50k, it is renamed to FAHlog-Prev and FAHlog starts from an empty file. (If there was an older FAHlog-Prev, it is discarded.

Any text editor will allow you to extract information from either file and paste it here.
bruce
 
Posts: 23737
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby Archangelboy » Tue Aug 25, 2009 5:26 pm

Thanks, Bruce. How about in the case of multiple restarts, and/or a log less than 50k, or more than one 50+k previous file? Do they overwrite, create multiple sequential copies, add to the FAHlog-prev file? I try to give the clients long runs, but sometimes that just doesn't happen, so it's good to know what I can/can't do.

To clarify, what I'd be hypothetically looking for under a problem with, say WU 5xxx(Rxx, Cxx, Gxxx) is a completed (or not) log for 5xxx (Rxx, Cxx, (Gxxx-1)), correct?
Archangelboy
 
Posts: 19
Joined: Thu Jul 09, 2009 6:52 pm
Location: Bozeman, MT

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby toTOW » Tue Aug 25, 2009 5:34 pm

We're not talking about the right log here ... when something goes wrong in the simulation, useful details for developers are in the wudata_XX.log file ... which is included with result file, and erased from local drive each time a WU completes.
Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.

FAH-Addict : latest news, tests and reviews about Folding@Home project.

Image
User avatar
toTOW
Site Moderator
 
Posts: 9177
Joined: Sun Dec 02, 2007 11:38 am
Location: Bordeaux, France

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby Archangelboy » Tue Aug 25, 2009 5:35 pm

Hmm, could that be changed for assistance with error fixing? or is that file sent with the WU?
Archangelboy
 
Posts: 19
Joined: Thu Jul 09, 2009 6:52 pm
Location: Bozeman, MT

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby bruce » Tue Aug 25, 2009 9:24 pm

The file is sent with the WU -- if the WU is uploaded -- and apparently it was or the system wouldn't have generated Gen (N), so I guess the developers already have whatever information they need.
bruce
 
Posts: 23737
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby Archangelboy » Tue Aug 25, 2009 9:28 pm

That would be the reasonable assumption. Why then the errors we've been seeing? I'd actually feel better if it were limited to me and my machines. Seems to be somewhat widespread.

Since we're on the topic, what about error checking? When a WU is successfully completed, is that an indicator that the work was done accurately? Or is that how NANs/EUEs work, based on some percieved error in calculations? Where is some good reading on error and accuracy related to DC folding?
Archangelboy
 
Posts: 19
Joined: Thu Jul 09, 2009 6:52 pm
Location: Bozeman, MT

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby Kougar » Thu Aug 27, 2009 9:43 am

bruce wrote:I can't say. It might be helpful if somebody could locate a FAHlog from Gen (N-1) but finding one is going to be a bit of a problem since you have to wait for someone to report a problem with Gen (N) before you know your FAHlog might be useful.


I posted a completed and a failed log for the exact same work unit using the exact same GPU / system config, would that count?

bruce wrote:The file is sent with the WU -- if the WU is uploaded -- and apparently it was or the system wouldn't have generated Gen (N), so I guess the developers already have whatever information they need.


If that isn't the right logs in question, both results were submitted to the F@H servers best I recall... it might be an interesting case study? The logs are at: this link

To add to my previous post in this thread, the GPU isn't the cause. I replaced the GTX 260 with a GTX 285 I borrowed (which I got to fold 100% error free in another computer) into this system, and it was giving the same errors with about the same frequency. Every driver crash and failed work unit yields the same SEH code "3221225477" if that helps anything. Given this issue occured under the Windows 7 Beta install I'm almost wondering if its some problem with the motherboard hardware, out of all the folding rigs I've worked on or troubleshooted for folding teams this problem has been a unique case for me.
Kougar
 
Posts: 172
Joined: Fri Apr 11, 2008 3:39 am
Location: Texas

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby RobertAllanHinton » Tue Sep 01, 2009 12:41 pm

I have recently installed a RADEON 4650 video card into my COMPAQ SR1910NX computer system. This card has functioned flawlessly for about two weeks, earning about 750 PPD folding proteins.

Just yesterday, I received my first error message of "mdrun - gpu returned NANs detected on GPU". After three successive aborted runs - I began to study the LOG, and learned that all three runs were identical! To Wit:

Project 5742
Run 3
Clone 64
Gen 373

The first abortion occurred at 50 percent completion. The second at 56 percent completion. The third at 38 percent completion.

After the third abortion, I interrupted the F@H processing (both CPU and GPU clients) - updated my virus software - and restarted the computer system (twice). Then I restarted the CPU client - and then the GPU client. At first, the GPU protein folding appeared to resume normally - at about 8 percent of project completion. However, at 20 percent, the error message was recorded, and the project was terminated (UNSTABLE_MACHINE).

Why can't the server to programmed, to "Passover" any project which has been abnormally terminated - and download some other project - rather than banging the 'head against the wall' on the same project - for a multiplicity of attempts? Seems that 'moving on' to another project, and receiving more error messages, would help to identify the local machine and/or software as the cause of the difficulty? Rather than a 'badly behaving' project?
RobertAllanHinton
 
Posts: 20
Joined: Sat Apr 25, 2009 7:14 pm

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby bruce » Wed Sep 02, 2009 4:25 am

There are two distinct scenarios. The server is programmed to "passover" any WU for which your client uploads a partial (or a complete) result. Abnormally terminated WUs often do not upload anything, hence it is reassigned.

The real question is why you got errors at 50%, and 56%, and at 38%. It's very likely that you've got stability problems perhaps from overclocking, perhaps from overheating, or perhaps from a martinal power supply. The client is trying to tell you to fix your hardware, and assigning a different WU to you won't do any good if your hardware is marginal.
bruce
 
Posts: 23737
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby Nathan V » Tue Sep 22, 2009 6:07 am

  • Failing projects (please add a list of exact project numbers if you have them)
    • See Below
  • Failing hardware (please add the exact GPU designation if you know it. ie 9800GTX+)
    • EVGA 8800GT x2, stock clocking
  • Failing OS
    • Windows 7 RC x64
  • Failing drivers (enter here the version number of the driver you use)
    • 190.56
  • Comments (add below any detail you might find useful to the report)
    • This machine folds 24x7 lately with no errors I noticed until recently. I'd love to get this resolved. :(

Code: Select all
[00:06:58] Project: 5770 (Run 0, Clone 152, Gen 209)
[00:06:58]
[00:06:58] Assembly optimizations on if available.
[00:06:58] Entering M.D.
[00:07:04] mdrun_gpu returned
[00:07:04] Going to send back what have done -- stepsTotalG=0
[00:07:04] Work fraction=0.0000 steps=0.
[00:07:08] logfile size=0 infoLength=0 edr=0 trr=25
[00:07:08] - Writing 637 bytes of core data to disk...
[00:07:08] Done: 125 -> 123 (compressed to 98.4 percent)
[00:07:08]   ... Done.
[00:07:08]
[00:07:08] Folding@home Core Shutdown: UNSTABLE_MACHINE


Code: Select all
[00:07:22] Project: 5770 (Run 12, Clone 161, Gen 204)
[00:07:22]
[00:07:22] Assembly optimizations on if available.
[00:07:22] Entering M.D.
[00:07:28] mdrun_gpu returned
[00:07:28] Going to send back what have done -- stepsTotalG=0
[00:07:28] Work fraction=0.0000 steps=0.
[00:07:32] logfile size=0 infoLength=0 edr=0 trr=25
[00:07:32] - Writing 637 bytes of core data to disk...
[00:07:32] Done: 125 -> 124 (compressed to 99.2 percent)
[00:07:32]   ... Done.
[00:07:33]
[00:07:33] Folding@home Core Shutdown: UNSTABLE_MACHINE


Code: Select all
[00:07:47] Project: 5767 (Run 11, Clone 132, Gen 898)
[00:07:47]
[00:07:47] Assembly optimizations on if available.
[00:07:47] Entering M.D.
[00:07:53] mdrun_gpu returned
[00:07:53] Going to send back what have done -- stepsTotalG=0
[00:07:53] Work fraction=0.0000 steps=0.
[00:07:57] logfile size=0 infoLength=0 edr=0 trr=25
[00:07:57] - Writing 637 bytes of core data to disk...
[00:07:57] Done: 125 -> 124 (compressed to 99.2 percent)
[00:07:57]   ... Done.
[00:07:57]
[00:07:57] Folding@home Core Shutdown: UNSTABLE_MACHINE


Code: Select all
[00:08:11] Project: 5769 (Run 6, Clone 271, Gen 1117)
[00:08:11]
[00:08:11] Assembly optimizations on if available.
[00:08:11] Entering M.D.
[00:08:18] mdrun_gpu returned
[00:08:18] Going to send back what have done -- stepsTotalG=0
[00:08:18] Work fraction=0.0000 steps=0.
[00:08:22] logfile size=0 infoLength=0 edr=0 trr=25
[00:08:22] - Writing 637 bytes of core data to disk...
[00:08:22] Done: 125 -> 124 (compressed to 99.2 percent)
[00:08:22]   ... Done.
[00:08:22]
[00:08:22] Folding@home Core Shutdown: UNSTABLE_MACHINE


Code: Select all
[00:08:36] Project: 5765 (Run 5, Clone 294, Gen 867)
[00:08:36]
[00:08:36] Assembly optimizations on if available.
[00:08:36] Entering M.D.
[00:08:42] mdrun_gpu returned
[00:08:42] Going to send back what have done -- stepsTotalG=0
[00:08:42] Work fraction=0.0000 steps=0.
[00:08:46] logfile size=0 infoLength=0 edr=0 trr=25
[00:08:46] - Writing 637 bytes of core data to disk...
[00:08:46] Done: 125 -> 124 (compressed to 99.2 percent)
[00:08:46]   ... Done.
[00:08:46]
[00:08:46] Folding@home Core Shutdown: UNSTABLE_MACHINE


Code: Select all
[21:44:50] Project: 5770 (Run 11, Clone 229, Gen 430)
[21:44:50]
[21:44:50] Assembly optimizations on if available.
[21:44:50] Entering M.D.
[21:44:56] mdrun_gpu returned
[21:44:56] Going to send back what have done -- stepsTotalG=0
[21:44:56] Work fraction=0.0000 steps=0.
[21:45:01] logfile size=0 infoLength=0 edr=0 trr=25
[21:45:01] - Writing 637 bytes of core data to disk...
[21:45:01] Done: 125 -> 124 (compressed to 99.2 percent)
[21:45:01]   ... Done.
[21:45:01]
[21:45:01] Folding@home Core Shutdown: UNSTABLE_MACHINE


Code: Select all
[21:45:15] Project: 5769 (Run 0, Clone 44, Gen 177)
[21:45:15]
[21:45:15] Assembly optimizations on if available.
[21:45:15] Entering M.D.
[21:45:21] mdrun_gpu returned
[21:45:21] Going to send back what have done -- stepsTotalG=0
[21:45:21] Work fraction=0.0000 steps=0.
[21:45:25] logfile size=0 infoLength=0 edr=0 trr=25
[21:45:25] - Writing 637 bytes of core data to disk...
[21:45:25] Done: 125 -> 124 (compressed to 99.2 percent)
[21:45:25]   ... Done.
[21:45:25]
[21:45:25] Folding@home Core Shutdown: UNSTABLE_MACHINE


Code: Select all
[21:45:39] Project: 5769 (Run 13, Clone 187, Gen 155)
[21:45:39]
[21:45:39] Assembly optimizations on if available.
[21:45:39] Entering M.D.
[21:45:45] mdrun_gpu returned
[21:45:45] Going to send back what have done -- stepsTotalG=0
[21:45:45] Work fraction=0.0000 steps=0.
[21:45:49] logfile size=0 infoLength=0 edr=0 trr=25
[21:45:49] - Writing 637 bytes of core data to disk...
[21:45:49] Done: 125 -> 124 (compressed to 99.2 percent)
[21:45:49]   ... Done.
[21:45:50]
[21:45:50] Folding@home Core Shutdown: UNSTABLE_MACHINE


Code: Select all
[21:46:04] Project: 5766 (Run 12, Clone 412, Gen 98)
[21:46:04]
[21:46:04] Assembly optimizations on if available.
[21:46:04] Entering M.D.
[21:46:10] mdrun_gpu returned
[21:46:10] Going to send back what have done -- stepsTotalG=0
[21:46:10] Work fraction=0.0000 steps=0.
[21:46:14] logfile size=0 infoLength=0 edr=0 trr=25
[21:46:14] - Writing 637 bytes of core data to disk...
[21:46:14] Done: 125 -> 124 (compressed to 99.2 percent)
[21:46:14]   ... Done.
[21:46:14]
[21:46:14] Folding@home Core Shutdown: UNSTABLE_MACHINE


Code: Select all
[21:46:28] Project: 5767 (Run 12, Clone 231, Gen 132)
[21:46:28]
[21:46:28] Assembly optimizations on if available.
[21:46:28] Entering M.D.
[21:46:34] mdrun_gpu returned
[21:46:34] Going to send back what have done -- stepsTotalG=0
[21:46:34] Work fraction=0.0000 steps=0.
[21:46:38] logfile size=0 infoLength=0 edr=0 trr=25
[21:46:38] - Writing 637 bytes of core data to disk...
[21:46:38] Done: 125 -> 124 (compressed to 99.2 percent)
[21:46:38]   ... Done.
[21:46:38]
[21:46:38] Folding@home Core Shutdown: UNSTABLE_MACHINE






Memtestg80 --gpu 0 256 50

Code: Select all
Test iteration 50 (GPU 0, 256 MiB): 0 errors so far
        Moving Inversions (ones and zeros): 0 errors (47 ms)
        Memtest86 Walking 8-bit: 0 errors (327 ms)
        True Walking zeros (8-bit): 0 errors (172 ms)
        True Walking ones (8-bit): 0 errors (172 ms)
        Moving Inversions (random): 0 errors (46 ms)
        Memtest86 Walking zeros (32-bit): 0 errors (656 ms)
        Memtest86 Walking ones (32-bit): 0 errors (639 ms)
        Random blocks: 0 errors (281 ms)
        Memtest86 Modulo-20: 0 errors (2949 ms)
        Logic (one iteration): 0 errors (62 ms)
        Logic (4 iterations): 0 errors (140 ms)
        Logic (shared memory, one iteration): 0 errors (78 ms)
        Logic (shared-memory, 4 iterations): 0 errors (234 ms)

Final error count after 50 iterations over 256 MiB of GPU memory: 0 errors


Memtestg80 --gpu 1 256 50

Code: Select all
Test iteration 50 (GPU 1, 256 MiB): 0 errors so far
        Moving Inversions (ones and zeros): 0 errors (46 ms)
        Memtest86 Walking 8-bit: 0 errors (328 ms)
        True Walking zeros (8-bit): 0 errors (156 ms)
        True Walking ones (8-bit): 0 errors (172 ms)
        Moving Inversions (random): 0 errors (31 ms)
        Memtest86 Walking zeros (32-bit): 0 errors (639 ms)
        Memtest86 Walking ones (32-bit): 0 errors (640 ms)
        Random blocks: 0 errors (296 ms)
        Memtest86 Modulo-20: 0 errors (2980 ms)
        Logic (one iteration): 0 errors (47 ms)
        Logic (4 iterations): 0 errors (156 ms)
        Logic (shared memory, one iteration): 0 errors (62 ms)
        Logic (shared-memory, 4 iterations): 0 errors (234 ms)

Final error count after 50 iterations over 256 MiB of GPU memory: 0 errors
Nathan V
 
Posts: 11
Joined: Thu Jul 09, 2009 3:07 am

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby moontube » Sat Sep 26, 2009 3:58 am

For the past few days, I have been having the NANS UNSTABLE MACHINE error followed by a 24 hour pause. This appears to be the same thing discussed in the posts in this thread in January. This appears to occur on 353 point work units only. Has there been any resolution on this? Productivity suffers if I cannot count on automatic launching of the next work unit after one has completed.
moontube
 
Posts: 7
Joined: Sat Jan 10, 2009 8:09 am

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby CrustyCat » Sat Sep 26, 2009 5:02 pm

moontube wrote:For the past few days, I have been having the NANS UNSTABLE MACHINE error followed by a 24 hour pause. This appears to be the same thing discussed in the posts in this thread in January. This appears to occur on 353 point work units only. Has there been any resolution on this? Productivity suffers if I cannot count on automatic launching of the next work unit after one has completed.

I've had a lot of these recently also. It's a pain and as far as I can tell, nothing can be done about it. My 9800GX2 is not overclocked and does the other wu's fine. Even some of the 353's. But the 353's are the one's I get the NaNs on.
CrustyCat
 
Posts: 43
Joined: Sun Jun 28, 2009 11:10 pm

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby shdbcamping » Sun Sep 27, 2009 9:34 am

I have had EUE's on everything other than 353 WU's on 2 cores of my GTX295's (2 of 6 clients). Just a post to say that you're not alone in the "unexplained" realm.
Sean
shdbcamping
 
Posts: 519
Joined: Mon Nov 10, 2008 8:57 am

PreviousNext

Return to NVIDIA specific issues

Who is online

Users browsing this forum: No registered users and 2 guests

cron