[Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Moderators: slegrand, Site Moderators, PandeGroup

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby cristipurdel » Wed Dec 02, 2009 10:38 am

# Failing projects (please add a list of exact project numbers if you have them)
Project: 5765 (Run 5, Clone 341, Gen 1344)
Project: 5772 (Run 5, Clone 19, Gen 652)

# Failing hardware (please add the exact GPU designation if you know it. ie 9800GTX+)
Quadro FX 580

# Failing OS
Windows 7 64 bits

# Failing drivers (enter here the version number of the driver you use)
191.00 & 191.66

# Comments (add below any detail you might find useful to the report)
I don't remember if the NVIDIA client ever worked on the FX 580, it was at 70%, then I stopped folding, and now it gives constant NaNs.
cristipurdel
 
Posts: 73
Joined: Wed Mar 12, 2008 11:40 pm

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby wkhudson » Fri Dec 04, 2009 4:52 am

I'm not sure if anyone else has reported this yet, but if I turn off the hyper-threading on my CPU, the nVidia GPU client works without detecting NANs. With the hyper-threading on, I get constant NANS.
I'm running the 6.23 GPUv2 client on a Win7 Ultimate 64-bit machine (Intel I7 940). Geforce GTX 280 using the 191.07 nVidea driver. Nothing is over or under clocked.

Cheers
Last edited by wkhudson on Mon Dec 28, 2009 4:55 pm, edited 1 time in total.
Image
wkhudson
 
Posts: 10
Joined: Fri Dec 04, 2009 4:39 am

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby DavidMudkips » Fri Dec 18, 2009 12:29 am

# Failing projects (please add a list of exact project numbers if you have them)
Project: 5795 (Run 0, Clone 940, Gen 0)

# Failing hardware (please add the exact GPU designation if you know it. ie 9800GTX+)
Geforce 8600M GT

# Failing OS
Windows Vista-32

# Failing drivers (enter here the version number of the driver you use)
186.81

# Comments (add below any detail you might find useful to the report)
I just had this one NaN on me at 54% and give me the dreaded file read error (see below), which from experience tells me I don't get the points for the partial WU.
mdrun_gpu returned
NANs detected on GPU
Error: Could not get length of results file work/wuresults_07.dat
Error: Could not read unit 07 file. Removing from queue.


I get NaNs and EUEs on other work units too with some regularity (~30%), but I usually upload the partial WU and get credit. Anyone know why some NaNs don't upload? I checked the work folder and I still have the .dat file, so what's going on here?

I can post the other WUs that fail if you want.

Oh and this was on core 1.19. I just saw 1.31 was released - would upgrading to that do anything?
DavidMudkips
 
Posts: 16
Joined: Wed Sep 23, 2009 6:04 am

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby MtM » Fri Dec 18, 2009 12:34 am

The nan caused a problem most likely at a time when the client was accesing the partial results file, which resulted in corrupted file ( note: length does not match what is expected ). You can get credit for EUE's if the results file is still valid and readable, as it then contains scientific data which tell the people why it EUE'd. Some wu's will EUE on all systems, because the starting conditions ( atom position, temperate ect ) can only lead to an early unit end, and not because of hw problems.

You can not get partial credit for something which can not be checked, it's a failsafe. If you could, people would trash wu's for partial credit ( in some cases partial credit for a x number of steps is more beneficial then completing the unit if the unit gives less pdd then others ).
MtM
 
Posts: 3054
Joined: Fri Jun 27, 2008 2:20 pm
Location: The Netherlands

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby Ripshod » Fri Dec 18, 2009 9:46 pm

I've been watching this topic for quite a while now, having struggled to complete more than 20% of units and never completing any of the units 5765-5768.
While reading other forums I got it into my head that maybe this problem has something to do with the infamous nvlddmkm.sys driver problem. In a post someone suggested disabling UAC (user account control).
So I did. The first WU I pulled after was a 5765 which completed 100%, then the next, and the next. 24hrs now without a single hiccup. Fingers crossed I've fixed my problem. I'm not saying that this is a guaranteed fix for every situation, I'm just saying it worked for me.

AMD 64 X2 3800+
PNY 8600GTS
XTreme-G 191.07 Drivers
Works with nVidia 195.62 too
1090T / HD5770 / HD7950
Ripshod
 
Posts: 32
Joined: Fri Nov 27, 2009 2:27 pm

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby powerarmour » Thu Dec 24, 2009 10:32 am

With the new FahCore 11 v1.31 I've just started getting NAN's on my GeForce GT 240 :-

Code: Select all
31:10] - Error: Could not get length of results file work/wuresults_06.dat
[08:31:10] - Error: Could not read unit 06 file. Removing from queue.
[08:31:10] - Preparing to get new work unit...
[08:31:10] + Attempting to get work packet
[08:31:10] - Connecting to assignment server
[08:31:12] - Successful: assigned to (171.67.108.11).
[08:31:12] + News From Folding@Home: Welcome to Folding@Home
[08:31:12] Loaded queue successfully.
[08:31:14] + Closed connections
[08:31:19]
[08:31:19] + Processing work unit
[08:31:19] Core required: FahCore_11.exe
[08:31:19] Core found.
[08:31:19] Working on queue slot 07 [December 24 08:31:19 UTC]
[08:31:19] + Working ...
[08:31:19]
[08:31:19] *------------------------------*
[08:31:19] Folding@Home GPU Core
[08:31:19] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[08:31:19]
[08:31:19] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[08:31:19] Build host: amoeba
[08:31:19] Board Type: Nvidia
[08:31:19] Core      :
[08:31:19] Preparing to commence simulation
[08:31:19] - Looking at optimizations...
[08:31:19] DeleteFrameFiles: successfully deleted file=work/wudata_07.ckp
[08:31:19] - Created dyn
[08:31:19] - Files status OK
[08:31:19] - Expanded 46719 -> 252912 (decompressed 541.3 percent)
[08:31:19] Called DecompressByteArray: compressed_data_size=46719 data_size=252912, decompressed_data_size=252912 diff=0
[08:31:19] - Digital signature verified
[08:31:19]
[08:31:19] Project: 5766 (Run 12, Clone 363, Gen 571)
[08:31:19]
[08:31:19] Assembly optimizations on if available.
[08:31:19] Entering M.D.
[08:31:25] Tpr hash work/wudata_07.tpr:  85820465 1062214972 2158366696 3621451030 280862858
[08:31:25]
[08:31:25] Calling fah_main args: 14 usage=100
[08:31:25]
[08:31:26] Working on Protein
[08:31:27] mdrun_gpu returned
[08:31:27] Self-test failure
[08:31:27]
[08:31:27] Folding@home Core Shutdown: UNSTABLE_MACHINE
[08:31:29] CoreStatus = 7A (122)
[08:31:29] Sending work to server
[08:31:29] Project: 5766 (Run 12, Clone 363, Gen 571)
[08:31:29] - Read packet limit of 540015616... Set to 524286976.
[08:31:29] - Error: Could not get length of results file work/wuresults_07.dat
[08:31:29] - Error: Could not read unit 07 file. Removing from queue.
[08:31:29] - Preparing to get new work unit...
[08:31:29] + Attempting to get work packet
[08:31:29] - Connecting to assignment server
[08:31:30] - Successful: assigned to (171.67.108.11).
[08:31:30] + News From Folding@Home: Welcome to Folding@Home
[08:31:30] Loaded queue successfully.
[08:31:32] + Closed connections
[08:31:37]
[08:31:37] + Processing work unit
[08:31:37] Core required: FahCore_11.exe
[08:31:37] Core found.
[08:31:37] Working on queue slot 08 [December 24 08:31:37 UTC]
[08:31:37] + Working ...
[08:31:37]
[08:31:37] *------------------------------*
[08:31:37] Folding@Home GPU Core
[08:31:37] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[08:31:37]
[08:31:37] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[08:31:37] Build host: amoeba
[08:31:37] Board Type: Nvidia
[08:31:37] Core      :
[08:31:37] Preparing to commence simulation
[08:31:37] - Looking at optimizations...
[08:31:37] DeleteFrameFiles: successfully deleted file=work/wudata_08.ckp
[08:31:37] - Created dyn
[08:31:37] - Files status OK
[08:31:37] - Expanded 46719 -> 252912 (decompressed 541.3 percent)
[08:31:37] Called DecompressByteArray: compressed_data_size=46719 data_size=252912, decompressed_data_size=252912 diff=0
[08:31:37] - Digital signature verified
[08:31:37]
[08:31:37] Project: 5766 (Run 12, Clone 363, Gen 571)
[08:31:37]
[08:31:37] Assembly optimizations on if available.
[08:31:37] Entering M.D.
[08:31:43] Tpr hash work/wudata_08.tpr:  85820465 1062214972 2158366696 3621451030 280862858
[08:31:43]
[08:31:43] Calling fah_main args: 14 usage=100
[08:31:43]
[08:31:44] Working on Protein
[08:31:45] mdrun_gpu returned
[08:31:45] Self-test failure
[08:31:45]
[08:31:45] Folding@home Core Shutdown: UNSTABLE_MACHINE
[08:31:47] CoreStatus = 7A (122)
[08:31:47] Sending work to server
[08:31:47] Project: 5766 (Run 12, Clone 363, Gen 571)
[08:31:47] - Read packet limit of 540015616... Set to 524286976.
[08:31:47] - Error: Could not get length of results file work/wuresults_08.dat
[08:31:47] - Error: Could not read unit 08 file. Removing from queue.
[08:31:47] - Preparing to get new work unit...
[08:31:47] + Attempting to get work packet
[08:31:47] - Connecting to assignment server
[08:31:48] - Successful: assigned to (171.67.108.11).
[08:31:48] + News From Folding@Home: Welcome to Folding@Home
[08:31:48] Loaded queue successfully.
[08:31:50] + Closed connections
[08:31:55]
[08:31:55] + Processing work unit
[08:31:55] Core required: FahCore_11.exe
[08:31:55] Core found.
[08:31:55] Working on queue slot 09 [December 24 08:31:55 UTC]
[08:31:55] + Working ...
[08:31:55]
[08:31:55] *------------------------------*
[08:31:55] Folding@Home GPU Core
[08:31:55] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[08:31:55]
[08:31:55] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[08:31:55] Build host: amoeba
[08:31:55] Board Type: Nvidia
[08:31:55] Core      :
[08:31:55] Preparing to commence simulation
[08:31:55] - Looking at optimizations...
[08:31:55] DeleteFrameFiles: successfully deleted file=work/wudata_09.ckp
[08:31:55] - Created dyn
[08:31:55] - Files status OK
[08:31:55] - Expanded 46719 -> 252912 (decompressed 541.3 percent)
[08:31:55] Called DecompressByteArray: compressed_data_size=46719 data_size=252912, decompressed_data_size=252912 diff=0
[08:31:55] - Digital signature verified
[08:31:55]
[08:31:55] Project: 5766 (Run 12, Clone 363, Gen 571)
[08:31:55]
[08:31:55] Assembly optimizations on if available.
[08:31:55] Entering M.D.
[08:32:01] Tpr hash work/wudata_09.tpr:  85820465 1062214972 2158366696 3621451030 280862858
[08:32:01]
[08:32:01] Calling fah_main args: 14 usage=100
[08:32:01]
[08:32:01] Working on Protein
[08:32:02] mdrun_gpu returned
[08:32:02] Self-test failure
[08:32:02]
[08:32:02] Folding@home Core Shutdown: UNSTABLE_MACHINE
[08:32:05] CoreStatus = 7A (122)
[08:32:05] Sending work to server
[08:32:05] Project: 5766 (Run 12, Clone 363, Gen 571)
[08:32:05] - Read packet limit of 540015616... Set to 524286976.
[08:32:05] - Error: Could not get length of results file work/wuresults_09.dat
[08:32:05] - Error: Could not read unit 09 file. Removing from queue.
[08:32:05] - Preparing to get new work unit...
[08:32:05] + Attempting to get work packet
[08:32:05] - Connecting to assignment server
[08:32:06] - Successful: assigned to (171.67.108.11).
[08:32:06] + News From Folding@Home: Welcome to Folding@Home
[08:32:06] Loaded queue successfully.
[08:32:08] + Closed connections
[08:32:13]
[08:32:13] + Processing work unit
[08:32:13] Core required: FahCore_11.exe
[08:32:13] Core found.
[08:32:13] Working on queue slot 00 [December 24 08:32:13 UTC]
[08:32:13] + Working ...
[08:32:13]
[08:32:13] *------------------------------*
[08:32:13] Folding@Home GPU Core
[08:32:13] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[08:32:13]
[08:32:13] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[08:32:13] Build host: amoeba
[08:32:13] Board Type: Nvidia
[08:32:13] Core      :
[08:32:13] Preparing to commence simulation
[08:32:13] - Looking at optimizations...
[08:32:13] DeleteFrameFiles: successfully deleted file=work/wudata_00.ckp
[08:32:13] - Created dyn
[08:32:13] - Files status OK
[08:32:13] - Expanded 46719 -> 252912 (decompressed 541.3 percent)
[08:32:13] Called DecompressByteArray: compressed_data_size=46719 data_size=252912, decompressed_data_size=252912 diff=0
[08:32:13] - Digital signature verified
[08:32:13]
[08:32:13] Project: 5766 (Run 12, Clone 363, Gen 571)
[08:32:13]
[08:32:13] Assembly optimizations on if available.
[08:32:13] Entering M.D.
[08:32:19] Tpr hash work/wudata_00.tpr:  85820465 1062214972 2158366696 3621451030 280862858
[08:32:19]
[08:32:19] Calling fah_main args: 14 usage=100
[08:32:19]
[08:32:19] Working on Protein
[08:32:20] mdrun_gpu returned
[08:32:20] Self-test failure
[08:32:20]
[08:32:20] Folding@home Core Shutdown: UNSTABLE_MACHINE
[08:32:23] CoreStatus = 7A (122)
[08:32:23] Sending work to server
[08:32:23] Project: 5766 (Run 12, Clone 363, Gen 571)
[08:32:23] - Read packet limit of 540015616... Set to 524286976.
[08:32:23] - Error: Could not get length of results file work/wuresults_00.dat
[08:32:23] - Error: Could not read unit 00 file. Removing from queue.
[08:32:23] EUE limit exceeded. Pausing 24 hours.

Folding@Home Client Shutdown.


--- Opening Log file [December 24 10:26:08 UTC]


# Windows GPU Console Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.23

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Users\jona\AppData\Roaming\Folding@home-gpu


[10:26:08] - Ask before connecting: No
[10:26:08] - User name: powerarmour (Team 33432)
[10:26:08] - User ID: 5E336FA52B23ED2C
[10:26:08] - Machine ID: 2
[10:26:08]
[10:26:09] Loaded queue successfully.
[10:26:09] Initialization complete
[10:26:09] - Preparing to get new work unit...
[10:26:09] + Attempting to get work packet
[10:26:09] - Connecting to assignment server
[10:26:09] - Successful: assigned to (171.67.108.11).
[10:26:09] + News From Folding@Home: Welcome to Folding@Home
[10:26:10] Loaded queue successfully.
[10:26:11] - Attempt #1  to get work failed, and no other work to do.
Waiting before retry.
[10:26:20] + Attempting to get work packet
[10:26:20] - Connecting to assignment server
[10:26:21] - Successful: assigned to (171.67.108.11).
[10:26:21] + News From Folding@Home: Welcome to Folding@Home
[10:26:21] Loaded queue successfully.
[10:26:23] + Closed connections
[10:26:23]
[10:26:23] + Processing work unit
[10:26:23] Core required: FahCore_11.exe
[10:26:23] Core found.
[10:26:23] Working on queue slot 01 [December 24 10:26:23 UTC]
[10:26:23] + Working ...
[10:26:23]
[10:26:23] *------------------------------*
[10:26:23] Folding@Home GPU Core
[10:26:23] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[10:26:23]
[10:26:23] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[10:26:23] Build host: amoeba
[10:26:23] Board Type: Nvidia
[10:26:23] Core      :
[10:26:23] Preparing to commence simulation
[10:26:23] - Looking at optimizations...
[10:26:23] DeleteFrameFiles: successfully deleted file=work/wudata_01.ckp
[10:26:23] - Created dyn
[10:26:23] - Files status OK
[10:26:23] - Expanded 46626 -> 252912 (decompressed 542.4 percent)
[10:26:23] Called DecompressByteArray: compressed_data_size=46626 data_size=252912, decompressed_data_size=252912 diff=0
[10:26:23] - Digital signature verified
[10:26:23]
[10:26:23] Project: 5768 (Run 10, Clone 57, Gen 879)
[10:26:23]
[10:26:23] Assembly optimizations on if available.
[10:26:23] Entering M.D.
[10:26:29] Tpr hash work/wudata_01.tpr:  3154682942 3086700535 360642085 3202961865 2871004127
[10:26:29]
[10:26:29] Calling fah_main args: 14 usage=100
[10:26:29]
[10:26:29] Working on Protein
[10:26:30] Client config found, loading data.
[10:26:30] mdrun_gpu returned
[10:26:30] NANs detected on GPU
[10:26:30]
[10:26:30] Folding@home Core Shutdown: UNSTABLE_MACHINE
[10:26:30] Starting GUI Server
[10:26:33] CoreStatus = 7A (122)
[10:26:33] Sending work to server
[10:26:33] Project: 5768 (Run 10, Clone 57, Gen 879)
[10:26:33] - Read packet limit of 540015616... Set to 524286976.
[10:26:33] - Error: Could not get length of results file work/wuresults_01.dat
[10:26:33] - Error: Could not read unit 01 file. Removing from queue.
[10:26:33] - Preparing to get new work unit...
[10:26:33] + Attempting to get work packet
[10:26:33] - Connecting to assignment server
[10:26:34] - Successful: assigned to (171.67.108.11).
[10:26:34] + News From Folding@Home: Welcome to Folding@Home
[10:26:34] Loaded queue successfully.
[10:26:36] + Closed connections
[10:26:41]
[10:26:41] + Processing work unit
[10:26:41] Core required: FahCore_11.exe
[10:26:41] Core found.
[10:26:41] Working on queue slot 02 [December 24 10:26:41 UTC]
[10:26:41] + Working ...
[10:26:41]
[10:26:41] *------------------------------*
[10:26:41] Folding@Home GPU Core
[10:26:41] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[10:26:41]
[10:26:41] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[10:26:41] Build host: amoeba
[10:26:41] Board Type: Nvidia
[10:26:41] Core      :
[10:26:41] Preparing to commence simulation
[10:26:41] - Looking at optimizations...
[10:26:41] DeleteFrameFiles: successfully deleted file=work/wudata_02.ckp
[10:26:41] - Created dyn
[10:26:41] - Files status OK
[10:26:41] - Expanded 46626 -> 252912 (decompressed 542.4 percent)
[10:26:41] Called DecompressByteArray: compressed_data_size=46626 data_size=252912, decompressed_data_size=252912 diff=0
[10:26:41] - Digital signature verified
[10:26:41]
[10:26:41] Project: 5768 (Run 10, Clone 57, Gen 879)
[10:26:41]
[10:26:41] Assembly optimizations on if available.
[10:26:41] Entering M.D.
[10:26:47] Tpr hash work/wudata_02.tpr:  3154682942 3086700535 360642085 3202961865 2871004127
[10:26:47]
[10:26:47] Calling fah_main args: 14 usage=100
[10:26:47]
[10:26:48] Working on Protein
[10:26:49] Client config found, loading data.
[10:26:49] mdrun_gpu returned
[10:26:49] NANs detected on GPU
[10:26:49]
[10:26:49] Folding@home Core Shutdown: UNSTABLE_MACHINE
[10:26:51] CoreStatus = 7A (122)
[10:26:51] Sending work to server
[10:26:51] Project: 5768 (Run 10, Clone 57, Gen 879)
[10:26:51] - Read packet limit of 540015616... Set to 524286976.
[10:26:51] - Error: Could not get length of results file work/wuresults_02.dat
[10:26:51] - Error: Could not read unit 02 file. Removing from queue.
[10:26:51] - Preparing to get new work unit...
[10:26:51] + Attempting to get work packet
[10:26:51] - Connecting to assignment server
[10:26:53] - Successful: assigned to (171.67.108.11).
[10:26:53] + News From Folding@Home: Welcome to Folding@Home
[10:26:53] Loaded queue successfully.
[10:26:55] + Closed connections
[10:27:00]
[10:27:00] + Processing work unit
[10:27:00] Core required: FahCore_11.exe
[10:27:00] Core found.
[10:27:00] Working on queue slot 03 [December 24 10:27:00 UTC]
[10:27:00] + Working ...
[10:27:00]
[10:27:00] *------------------------------*
[10:27:00] Folding@Home GPU Core
[10:27:00] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[10:27:00]
[10:27:00] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[10:27:00] Build host: amoeba
[10:27:00] Board Type: Nvidia
[10:27:00] Core      :
[10:27:00] Preparing to commence simulation
[10:27:00] - Looking at optimizations...
[10:27:00] DeleteFrameFiles: successfully deleted file=work/wudata_03.ckp
[10:27:00] - Created dyn
[10:27:00] - Files status OK
[10:27:00] - Expanded 46626 -> 252912 (decompressed 542.4 percent)
[10:27:00] Called DecompressByteArray: compressed_data_size=46626 data_size=252912, decompressed_data_size=252912 diff=0
[10:27:00] - Digital signature verified
[10:27:00]
[10:27:00] Project: 5768 (Run 10, Clone 57, Gen 879)
[10:27:00]
[10:27:00] Assembly optimizations on if available.
[10:27:00] Entering M.D.
[10:27:06] Tpr hash work/wudata_03.tpr:  3154682942 3086700535 360642085 3202961865 2871004127
[10:27:06]
[10:27:06] Calling fah_main args: 14 usage=100
[10:27:06]
[10:27:07] Working on Protein
[10:27:08] Client config found, loading data.
[10:27:08] mdrun_gpu returned
[10:27:08] NANs detected on GPU
[10:27:08]
[10:27:08] Folding@home Core Shutdown: UNSTABLE_MACHINE


Was fine previously with the older version...? (v1.19 I think), and anything that uses FahCore 13 runs perfectly fine also, it's just these FahCore 11 projects.
Image
powerarmour
 
Posts: 169
Joined: Wed Oct 29, 2008 1:00 am
Location: Surrey, UK

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby toTOW » Thu Dec 24, 2009 11:36 am

Is your card overclocked and sufficiently powered ?

On my 9800 GTX+ that's now gone for RMA, Self test failure and NaNs errors were the sign of a dying board :(
Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.

FAH-Addict : latest news, tests and reviews about Folding@Home project.

Image
User avatar
toTOW
Site Moderator
 
Posts: 8782
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby powerarmour » Thu Dec 24, 2009 12:57 pm

toTOW wrote:Is your card overclocked and sufficiently powered ?

On my 9800 GTX+ that's now gone for RMA, Self test failure and NaNs errors were the sign of a dying board :(


It's brand new..!, only about three weeks old, factory stock etc, nothing overclocked.

I only get the problem with the new v1.31 FahCore 11 after manually deleting my older FahCore 11 the other day, prior to that it was running 24/7 fine... :wink:

The FahCore 13 projects like the 1888 pointers still work ok also, I just haven't had many of those recently.


Would be annoying if these killed the card already though, I suppose it can't be ruled out... :(
powerarmour
 
Posts: 169
Joined: Wed Oct 29, 2008 1:00 am
Location: Surrey, UK

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby powerarmour » Thu Dec 24, 2009 1:10 pm

Quick update, actually the p5794 787pointers work ok on FahCore 11 v1.31, it seems it's just the lower point WU's that won't run..., hmm

Image
powerarmour
 
Posts: 169
Joined: Wed Oct 29, 2008 1:00 am
Location: Surrey, UK

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby road-runner » Thu Dec 24, 2009 4:37 pm

I keep getting this error on a 8800GT...

Code: Select all
[16:32:29] Project: 5765 (Run 3, Clone 100, Gen 477)
[16:32:29]
[16:32:29] Assembly optimizations on if available.
[16:32:29] Entering M.D.
[16:32:35] Tpr hash work/wudata_02.tpr:  385980687 2289837096 488673022 1335760962 410090359
[16:32:35]
[16:32:35] Calling fah_main args: 14 usage=100
[16:32:35]
[16:32:35] Working on Protein
[16:32:36] Run: exception thrown during GuardedRun
[16:32:36] Run: exception thrown in GuardedRun -- Gromacs cannot continue further.
[16:32:36] Going to send back what have done -- stepsTotalG=0
[16:32:36] Work fraction=0.0000 steps=0.
[16:32:39] logfile size=0 infoLength=0 edr=0 trr=23
[16:32:39] + Opened results file
[16:32:39] - Writing 635 bytes of core data to disk...
[16:32:39] Done: 123 -> 124 (compressed to 100.8 percent)
[16:32:39]   ... Done.
[16:32:39] DeleteFrameFiles: successfully deleted file=work/wudata_02.ckp
[16:32:39]
[16:32:39] Folding@home Core Shutdown: UNSTABLE_MACHINE
Image
User avatar
road-runner
 
Posts: 466
Joined: Sun Dec 02, 2007 4:01 am
Location: Houston, Texas

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby fomoco » Sun Jan 10, 2010 11:06 pm

glad its not just me! I've been pulling my hair out with this issue

PROJECTS: (these are only the most recent ones since I don't back up log files)
Project: 5765 (Run 10, Clone 218, Gen 1270) 0%
Project: 5769 (Run 8, Clone 145, Gen 1235) 36%
Project: 5772 (Run 0, Clone 274, Gen 709) 3%
Project: 5769 (Run 8, Clone 333, Gen 842) 0%
Project: 5767 (Run 14, Clone 169, Gen 835) 0%
Project: 5765 (Run 5, Clone 320, Gen 684) 0%
Project: 5769 (Run 3, Clone 245, Gen 2072) 1%
Project: 5765 (Run 4, Clone 251, Gen 1484) 7%
Project: 5768 (Run 5, Clone 99, Gen 1683) 0%
Project: 5768 (Run 9, Clone 102, Gen 823) 0%
Project: 5769 (Run 4, Clone 208, Gen 2147) 0%
Project: 5767 (Run 0, Clone 133, Gen 1013) 8%
Project: 5914 (Run 10, Clone 861, Gen 8) 0%
Project: 5768 (Run 7, Clone 190, Gen 1772) 2%
Project: 5771 (Run 6, Clone 54, Gen 917) 4%
Project: 5914 (Run 7, Clone 704, Gen 62) 16%
Project: 5767 (Run 8, Clone 37, Gen 742) 6%


GPU: 8800gt 512mb clocked at 500/800/1620 (core/mem/sp) for folding, later lowered it to 500/800/1512 trying to stop errors

OS: vista ultimate x64

DRIVER: 191.07

EDIT: come to think of it I didn't have any problems before upgrading to the 191 driver, I had been running 178.xx (or something close to there) for a long time then upgraded to the newest drivers when I reinstalled windows last, also dropping the shader clock seems to have solved it, guess the newer drivers can't handle overclocking and folding like the old ones could :?
Image
kickin' cancer's *** since 2006
x2 4000+ @ 2.84ghz, biostar TA770 A2+, 2gb crucial ballistix 667 @ 804 4-4-4-12-1t 2.1v,
250gb 7200.10, 8800gt @ 625/1732/900, corsair 620w
fomoco
 
Posts: 7
Joined: Wed Jan 16, 2008 6:08 pm

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby efok » Tue Jan 12, 2010 9:32 pm

The new 10101 (Run 337, Clone9, Gen 30) keeps crashing my 8600GT with 256M video memory

Leads to nvidia driver version 195.62 BSOD in Windows 7 x64
efok
 
Posts: 2
Joined: Wed Nov 18, 2009 8:04 pm

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby BuddhaChu » Mon Jan 18, 2010 5:27 pm

Failed today:

Project: 5772 (Run 6, Clone 182, Gen 671)
OS: Windows 7 Pro
Card: 8800 GT
Drivers: 195.55

Notes: Core 14 worked (works) great. Only now getting failed WUs on the new core 11. Restarted the client and it's now working on a core 11 WU (5768 (Run 3, Clone 150, Gen 631)) and has completed 12%.

Log (starts with a WU that worked fine and ends with the core 11 WU mentioned above that's now working with the 5772 failures in the middle)

Code: Select all
[06:17:08] Loaded queue successfully.
[06:17:08] Initialization complete
[06:17:08]
[06:17:08] + Processing work unit
[06:17:08] Core required: FahCore_14.exe
[06:17:08] Core found.
[06:17:08] Working on queue slot 07 [January 18 06:17:08 UTC]
[06:17:08] + Working ...
[06:17:08]
[06:17:08] *------------------------------*
[06:17:08] Folding@Home GPU Core - Beta
[06:17:08] Version 1.26 (Wed Oct 14 13:09:26 PDT 2009)
[06:17:08]
[06:17:08] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[06:17:08] Build host: vspm46
[06:17:08] Board Type: Nvidia
[06:17:08] Core      :
[06:17:08] Preparing to commence simulation
[06:17:08] - Looking at optimizations...
[06:17:08] - Files status OK
[06:17:08] - Expanded 68692 -> 357580 (decompressed 520.5 percent)
[06:17:08] Called DecompressByteArray: compressed_data_size=68692 data_size=357580, decompressed_data_size=357580 diff=0
[06:17:08] - Digital signature verified
[06:17:08]
[06:17:08] Project: 5913 (Run 9, Clone 711, Gen 54)
[06:17:08]
[06:17:08] Assembly optimizations on if available.
[06:17:08] Entering M.D.
[06:17:14] Will resume from checkpoint file
[06:17:14] Tpr hash work/wudata_07.tpr:  1993501836 1752422515 4101791051 2710726491 416600783
[06:17:15] Working on Protein
[06:17:16] Client config found, loading data.
[06:17:16] Resuming from checkpoint
[06:17:16] fcCheckPointResume: retrieved and current tpr file hash:
[06:17:16]    0   1993501836   1993501836
[06:17:16]    1   1752422515   1752422515
[06:17:16]    2   4101791051   4101791051
[06:17:16]    3   2710726491   2710726491
[06:17:16]    4    416600783    416600783
[06:17:16] Verified work/wudata_07.log
[06:17:16] Verified work/wudata_07.edr
[06:17:16] Verified work/wudata_07.xtc
[06:17:16] Completed 3%
[06:17:16] Starting GUI Server
[06:22:55] Completed 4%
[06:29:00] Completed 5%
[06:35:36] Completed 6%
[06:42:03] Completed 7%
[06:48:58] Completed 8%
[06:55:19] Completed 9%
[07:01:45] Completed 10%
[07:08:03] Completed 11%
[07:14:17] Completed 12%
[07:20:38] Completed 13%
[07:26:51] Completed 14%
[07:33:52] Completed 15%
[07:40:07] Completed 16%
[07:46:01] Completed 17%
[07:52:16] Completed 18%
[07:58:23] Completed 19%
[08:04:37] Completed 20%
[08:10:45] Completed 21%
[08:16:59] Completed 22%
[08:23:20] Completed 23%
[08:29:34] Completed 24%
[08:35:36] Completed 25%
[08:41:43] Completed 26%
[08:47:51] Completed 27%
[08:54:23] Completed 28%
[09:00:31] Completed 29%
[09:06:38] Completed 30%
[09:12:46] Completed 31%
[09:18:48] Completed 32%
[09:25:03] Completed 33%
[09:31:12] Completed 34%
[09:37:32] Completed 35%
[09:43:51] Completed 36%
[09:49:59] Completed 37%
[09:56:08] Completed 38%
[10:02:16] Completed 39%
[10:08:25] Completed 40%
[10:14:40] Completed 41%
[10:20:42] Completed 42%
[10:27:01] Completed 43%
[10:33:01] Completed 44%
[10:39:08] Completed 45%
[10:45:26] Completed 46%
[10:51:27] Completed 47%
[10:57:27] Completed 48%
[11:03:26] Completed 49%
[11:09:47] Completed 50%
[11:16:04] Completed 51%
[11:22:16] Completed 52%
[11:28:28] Completed 53%
[11:34:16] Completed 54%
[11:40:14] Completed 55%
[11:46:39] Completed 56%
[11:52:45] Completed 57%
[11:59:01] Completed 58%
[12:05:17] Completed 59%
[12:11:13] Completed 60%
[12:17:08] + Working...
[12:17:21] Completed 61%
[12:23:24] Completed 62%
[12:29:36] Completed 63%
[12:35:56] Completed 64%
[12:42:05] Completed 65%
[12:48:05] Completed 66%
[12:54:13] Completed 67%
[13:00:34] Completed 68%
[13:06:43] Completed 69%
[13:12:44] Completed 70%
[13:18:41] Completed 71%
[13:24:53] Completed 72%
[13:31:02] Completed 73%
[13:37:17] Completed 74%
[13:43:40] Completed 75%
[13:49:52] Completed 76%
[13:56:12] Completed 77%
[14:02:24] Completed 78%
[14:08:21] Completed 79%
[14:14:30] Completed 80%
[14:20:56] Completed 81%
[14:27:19] Completed 82%
[14:33:46] Completed 83%
[14:40:01] Completed 84%
[14:46:13] Completed 85%
[14:52:22] Completed 86%
[14:58:26] Completed 87%
[15:04:46] Completed 88%
[15:11:01] Completed 89%
[15:17:16] Completed 90%
[15:22:19] Completed 91%
[15:28:29] Completed 92%
[15:34:32] Completed 93%
[15:41:00] Completed 94%
[15:46:59] Completed 95%
[15:52:43] Completed 96%
[15:58:14] Completed 97%
[16:04:53] Completed 98%
[16:11:30] Completed 99%
[16:18:03] Completed 100%
[16:18:04] Successful run
[16:18:04] DynamicWrapper: Finished Work Unit: sleep=10000
[16:18:14] Reserved 34164 bytes for xtc file; Cosm status=0
[16:18:14] Allocated 34164 bytes for xtc file
[16:18:14] - Reading up to 34164 from "work/wudata_07.xtc": Read 34164
[16:18:14] Read 34164 bytes from xtc file; available packet space=786396300
[16:18:14] xtc file hash check passed.
[16:18:14] Reserved 23472 23472 786396300 bytes for arc file=<work/wudata_07.trr> Cosm status=0
[16:18:14] Allocated 23472 bytes for arc file
[16:18:14] - Reading up to 23472 from "work/wudata_07.trr": Read 23472
[16:18:14] Read 23472 bytes from arc file; available packet space=786372828
[16:18:14] trr file hash check passed.
[16:18:14] Allocated 560 bytes for edr file
[16:18:14] Read bedfile
[16:18:14] edr file hash check passed.
[16:18:14] Allocated 52075 bytes for logfile
[16:18:14] Read logfile
[16:18:14] GuardedRun: success in DynamicWrapper
[16:18:14] GuardedRun: done
[16:18:14] Run: GuardedRun completed.
[16:18:18] - Writing 110783 bytes of core data to disk...
[16:18:18] Done: 110271 -> 65498 (compressed to 59.3 percent)
[16:18:18]   ... Done.
[16:18:19] - Shutting down core
[16:18:19]
[16:18:19] Folding@home Core Shutdown: FINISHED_UNIT
[16:18:21] CoreStatus = 64 (100)
[16:18:21] Sending work to server
[16:18:21] Project: 5913 (Run 9, Clone 711, Gen 54)


[16:18:21] + Attempting to send results [January 18 16:18:21 UTC]
[16:18:22] + Results successfully sent
[16:18:22] Thank you for your contribution to Folding@Home.
[16:18:22] + Starting local stats count at 1
[16:18:26] - Preparing to get new work unit...
[16:18:26] + Attempting to get work packet
[16:18:26] - Connecting to assignment server
[16:18:27] - Successful: assigned to (171.67.108.11).
[16:18:27] + News From Folding@Home: Welcome to Folding@Home
[16:18:27] Loaded queue successfully.
[16:18:28] + Closed connections
[16:18:28]
[16:18:28] + Processing work unit
[16:18:28] Core required: FahCore_11.exe
[16:18:28] Core found.
[16:18:28] Working on queue slot 08 [January 18 16:18:28 UTC]
[16:18:28] + Working ...
[16:18:28]
[16:18:28] *------------------------------*
[16:18:28] Folding@Home GPU Core
[16:18:28] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[16:18:28]
[16:18:28] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[16:18:28] Build host: amoeba
[16:18:28] Board Type: Nvidia
[16:18:28] Core      :
[16:18:28] Preparing to commence simulation
[16:18:28] - Looking at optimizations...
[16:18:28] DeleteFrameFiles: successfully deleted file=work/wudata_08.ckp
[16:18:28] - Created dyn
[16:18:28] - Files status OK
[16:18:28] - Expanded 45694 -> 251112 (decompressed 549.5 percent)
[16:18:28] Called DecompressByteArray: compressed_data_size=45694 data_size=251112, decompressed_data_size=251112 diff=0
[16:18:28] - Digital signature verified
[16:18:28]
[16:18:28] Project: 5772 (Run 6, Clone 182, Gen 671)
[16:18:28]
[16:18:28] Assembly optimizations on if available.
[16:18:28] Entering M.D.
[16:18:34] Tpr hash work/wudata_08.tpr:  3566180424 1664061760 566461962 121813205 101343748
[16:18:34]
[16:18:34] Calling fah_main args: 14 usage=100
[16:18:34]
[16:18:34] Working on Protein
[16:18:34] mdrun_gpu returned
[16:18:34] Self-test failure
[16:18:34]
[16:18:34] Folding@home Core Shutdown: UNSTABLE_MACHINE
[16:18:38] CoreStatus = 7A (122)
[16:18:38] Sending work to server
[16:18:38] Project: 5772 (Run 6, Clone 182, Gen 671)
[16:18:38] - Error: Could not get length of results file work/wuresults_08.dat
[16:18:38] - Error: Could not read unit 08 file. Removing from queue.
[16:18:38] - Preparing to get new work unit...
[16:18:38] + Attempting to get work packet
[16:18:38] - Connecting to assignment server
[16:18:38] - Successful: assigned to (171.67.108.11).
[16:18:38] + News From Folding@Home: Welcome to Folding@Home
[16:18:39] Loaded queue successfully.
[16:18:39] + Closed connections
[16:18:44]
[16:18:44] + Processing work unit
[16:18:44] Core required: FahCore_11.exe
[16:18:44] Core found.
[16:18:44] Working on queue slot 09 [January 18 16:18:44 UTC]
[16:18:44] + Working ...
[16:18:44]
[16:18:44] *------------------------------*
[16:18:44] Folding@Home GPU Core
[16:18:44] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[16:18:44]
[16:18:44] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[16:18:44] Build host: amoeba
[16:18:44] Board Type: Nvidia
[16:18:44] Core      :
[16:18:44] Preparing to commence simulation
[16:18:44] - Looking at optimizations...
[16:18:44] DeleteFrameFiles: successfully deleted file=work/wudata_09.ckp
[16:18:44] - Created dyn
[16:18:44] - Files status OK
[16:18:44] - Expanded 45694 -> 251112 (decompressed 549.5 percent)
[16:18:44] Called DecompressByteArray: compressed_data_size=45694 data_size=251112, decompressed_data_size=251112 diff=0
[16:18:44] - Digital signature verified
[16:18:44]
[16:18:44] Project: 5772 (Run 6, Clone 182, Gen 671)
[16:18:44]
[16:18:44] Assembly optimizations on if available.
[16:18:44] Entering M.D.
[16:18:50] Tpr hash work/wudata_09.tpr:  3566180424 1664061760 566461962 121813205 101343748
[16:18:50]
[16:18:50] Calling fah_main args: 14 usage=100
[16:18:50]
[16:18:51] Working on Protein
[16:18:51] mdrun_gpu returned
[16:18:51] Self-test failure
[16:18:51]
[16:18:51] Folding@home Core Shutdown: UNSTABLE_MACHINE
[16:18:54] CoreStatus = 7A (122)
[16:18:54] Sending work to server
[16:18:54] Project: 5772 (Run 6, Clone 182, Gen 671)
[16:18:54] - Error: Could not get length of results file work/wuresults_09.dat
[16:18:54] - Error: Could not read unit 09 file. Removing from queue.
[16:18:54] - Preparing to get new work unit...
[16:18:54] + Attempting to get work packet
[16:18:54] - Connecting to assignment server
[16:18:55] - Successful: assigned to (171.67.108.11).
[16:18:55] + News From Folding@Home: Welcome to Folding@Home
[16:18:55] Loaded queue successfully.
[16:18:56] + Closed connections
[16:19:01]
[16:19:01] + Processing work unit
[16:19:01] Core required: FahCore_11.exe
[16:19:01] Core found.
[16:19:01] Working on queue slot 00 [January 18 16:19:01 UTC]
[16:19:01] + Working ...
[16:19:01]
[16:19:01] *------------------------------*
[16:19:01] Folding@Home GPU Core
[16:19:01] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[16:19:01]
[16:19:01] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[16:19:01] Build host: amoeba
[16:19:01] Board Type: Nvidia
[16:19:01] Core      :
[16:19:01] Preparing to commence simulation
[16:19:01] - Looking at optimizations...
[16:19:01] DeleteFrameFiles: successfully deleted file=work/wudata_00.ckp
[16:19:01] - Created dyn
[16:19:01] - Files status OK
[16:19:01] - Expanded 45694 -> 251112 (decompressed 549.5 percent)
[16:19:01] Called DecompressByteArray: compressed_data_size=45694 data_size=251112, decompressed_data_size=251112 diff=0
[16:19:01] - Digital signature verified
[16:19:01]
[16:19:01] Project: 5772 (Run 6, Clone 182, Gen 671)
[16:19:01]
[16:19:01] Assembly optimizations on if available.
[16:19:01] Entering M.D.
[16:19:07] Tpr hash work/wudata_00.tpr:  3566180424 1664061760 566461962 121813205 101343748
[16:19:07]
[16:19:07] Calling fah_main args: 14 usage=100
[16:19:07]
[16:19:07] Working on Protein
[16:19:07] mdrun_gpu returned
[16:19:07] Self-test failure
[16:19:07]
[16:19:07] Folding@home Core Shutdown: UNSTABLE_MACHINE
[16:19:11] CoreStatus = 7A (122)
[16:19:11] Sending work to server
[16:19:11] Project: 5772 (Run 6, Clone 182, Gen 671)
[16:19:11] - Error: Could not get length of results file work/wuresults_00.dat
[16:19:11] - Error: Could not read unit 00 file. Removing from queue.
[16:19:11] - Preparing to get new work unit...
[16:19:11] + Attempting to get work packet
[16:19:11] - Connecting to assignment server
[16:19:11] - Successful: assigned to (171.67.108.11).
[16:19:11] + News From Folding@Home: Welcome to Folding@Home
[16:19:11] Loaded queue successfully.
[16:19:13] + Closed connections
[16:19:18]
[16:19:18] + Processing work unit
[16:19:18] Core required: FahCore_11.exe
[16:19:18] Core found.
[16:19:18] Working on queue slot 01 [January 18 16:19:18 UTC]
[16:19:18] + Working ...
[16:19:18]
[16:19:18] *------------------------------*
[16:19:18] Folding@Home GPU Core
[16:19:18] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[16:19:18]
[16:19:18] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[16:19:18] Build host: amoeba
[16:19:18] Board Type: Nvidia
[16:19:18] Core      :
[16:19:18] Preparing to commence simulation
[16:19:18] - Looking at optimizations...
[16:19:18] DeleteFrameFiles: successfully deleted file=work/wudata_01.ckp
[16:19:18] - Created dyn
[16:19:18] - Files status OK
[16:19:18] - Expanded 45694 -> 251112 (decompressed 549.5 percent)
[16:19:18] Called DecompressByteArray: compressed_data_size=45694 data_size=251112, decompressed_data_size=251112 diff=0
[16:19:18] - Digital signature verified
[16:19:18]
[16:19:18] Project: 5772 (Run 6, Clone 182, Gen 671)
[16:19:18]
[16:19:18] Assembly optimizations on if available.
[16:19:18] Entering M.D.
[16:19:24] Tpr hash work/wudata_01.tpr:  3566180424 1664061760 566461962 121813205 101343748
[16:19:24]
[16:19:24] Calling fah_main args: 14 usage=100
[16:19:24]
[16:19:24] Working on Protein
[16:19:24] mdrun_gpu returned
[16:19:24] Self-test failure
[16:19:24]
[16:19:24] Folding@home Core Shutdown: UNSTABLE_MACHINE
[16:19:28] CoreStatus = 7A (122)
[16:19:28] Sending work to server
[16:19:28] Project: 5772 (Run 6, Clone 182, Gen 671)
[16:19:28] - Error: Could not get length of results file work/wuresults_01.dat
[16:19:28] - Error: Could not read unit 01 file. Removing from queue.
[16:19:28] - Preparing to get new work unit...
[16:19:28] + Attempting to get work packet
[16:19:28] - Connecting to assignment server
[16:19:28] - Successful: assigned to (171.67.108.11).
[16:19:28] + News From Folding@Home: Welcome to Folding@Home
[16:19:28] Loaded queue successfully.
[16:19:29] + Closed connections
[16:19:34]
[16:19:34] + Processing work unit
[16:19:34] Core required: FahCore_11.exe
[16:19:34] Core found.
[16:19:34] Working on queue slot 02 [January 18 16:19:34 UTC]
[16:19:34] + Working ...
[16:19:34]
[16:19:34] *------------------------------*
[16:19:34] Folding@Home GPU Core
[16:19:34] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[16:19:34]
[16:19:34] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[16:19:34] Build host: amoeba
[16:19:34] Board Type: Nvidia
[16:19:34] Core      :
[16:19:34] Preparing to commence simulation
[16:19:34] - Looking at optimizations...
[16:19:34] DeleteFrameFiles: successfully deleted file=work/wudata_02.ckp
[16:19:34] - Created dyn
[16:19:34] - Files status OK
[16:19:34] - Expanded 45694 -> 251112 (decompressed 549.5 percent)
[16:19:34] Called DecompressByteArray: compressed_data_size=45694 data_size=251112, decompressed_data_size=251112 diff=0
[16:19:34] - Digital signature verified
[16:19:34]
[16:19:34] Project: 5772 (Run 6, Clone 182, Gen 671)
[16:19:34]
[16:19:34] Assembly optimizations on if available.
[16:19:34] Entering M.D.
[16:19:40] Tpr hash work/wudata_02.tpr:  3566180424 1664061760 566461962 121813205 101343748
[16:19:40]
[16:19:40] Calling fah_main args: 14 usage=100
[16:19:40]
[16:19:41] Working on Protein
[16:19:41] mdrun_gpu returned
[16:19:41] Self-test failure
[16:19:41]
[16:19:41] Folding@home Core Shutdown: UNSTABLE_MACHINE
[16:19:44] CoreStatus = 7A (122)
[16:19:44] Sending work to server
[16:19:44] Project: 5772 (Run 6, Clone 182, Gen 671)
[16:19:44] - Error: Could not get length of results file work/wuresults_02.dat
[16:19:44] - Error: Could not read unit 02 file. Removing from queue.
[16:19:44] EUE limit exceeded. Pausing 24 hours.

Folding@Home Client Shutdown.


--- Opening Log file [January 18 17:06:18 UTC]


# Windows GPU Console Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.23

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\F@H-GPU


[17:06:18] - Ask before connecting: No
[17:06:18] - User name: BuddhaMan (Team 14)
[17:06:18] - User ID: 7C074F461B0D4453
[17:06:18] - Machine ID: 2
[17:06:18]
[17:06:18] Loaded queue successfully.
[17:06:18] Initialization complete
[17:06:18] - Preparing to get new work unit...
[17:06:18] + Attempting to get work packet
[17:06:18] - Connecting to assignment server
[17:06:19] - Successful: assigned to (171.67.108.11).
[17:06:19] + News From Folding@Home: Welcome to Folding@Home
[17:06:19] Loaded queue successfully.
[17:06:20] - Attempt #1  to get work failed, and no other work to do.
Waiting before retry.
[17:06:37] + Attempting to get work packet
[17:06:37] - Connecting to assignment server
[17:06:38] - Successful: assigned to (171.67.108.11).
[17:06:38] + News From Folding@Home: Welcome to Folding@Home
[17:06:38] Loaded queue successfully.
[17:06:39] + Closed connections
[17:06:39]
[17:06:39] + Processing work unit
[17:06:39] Core required: FahCore_11.exe
[17:06:39] Core found.
[17:06:39] Working on queue slot 03 [January 18 17:06:39 UTC]
[17:06:39] + Working ...
[17:06:39]
[17:06:39] *------------------------------*
[17:06:39] Folding@Home GPU Core
[17:06:39] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[17:06:39]
[17:06:39] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[17:06:39] Build host: amoeba
[17:06:39] Board Type: Nvidia
[17:06:39] Core      :
[17:06:39] Preparing to commence simulation
[17:06:39] - Looking at optimizations...
[17:06:39] DeleteFrameFiles: successfully deleted file=work/wudata_03.ckp
[17:06:39] - Created dyn
[17:06:39] - Files status OK
[17:06:39] - Expanded 46642 -> 252912 (decompressed 542.2 percent)
[17:06:39] Called DecompressByteArray: compressed_data_size=46642 data_size=252912, decompressed_data_size=252912 diff=0
[17:06:39] - Digital signature verified
[17:06:39]
[17:06:39] Project: 5768 (Run 3, Clone 150, Gen 631)
[17:06:39]
[17:06:39] Assembly optimizations on if available.
[17:06:39] Entering M.D.
[17:06:45] Tpr hash work/wudata_03.tpr:  3031373490 126100368 2143326572 1716792703 1450597082
[17:06:45]
[17:06:45] Calling fah_main args: 14 usage=100
[17:06:45]
[17:06:45] Working on Protein
[17:06:46] Client config found, loading data.
[17:06:46] Starting GUI Server
[17:07:40] Completed 1%
[17:08:34] Completed 2%
[17:09:28] Completed 3%
[17:10:22] Completed 4%
[17:11:16] Completed 5%
[17:12:10] Completed 6%
[17:13:03] Completed 7%
[17:13:59] Completed 8%
[17:14:53] Completed 9%
[17:15:47] Completed 10%
[17:16:40] Completed 11%
[17:17:34] Completed 12%
BuddhaChu
 
Posts: 149
Joined: Wed Apr 16, 2008 2:38 am

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby HaloJones » Mon Jan 25, 2010 1:45 pm

8800GT failing on everything. Deleted Core 11 and Core 14. Since then working flawlessly. Could just be that I was getting a bad batch but the card passed memtestg80 without issue. Really hate how software flaws appear to shut down a folding card when it may not be the card at fault.
1x Titan X, 1x 1070ti, 4x 1070
HaloJones
 
Posts: 350
Joined: Thu Jul 24, 2008 10:16 am

Re: [Please read] NaNs detected on GPU - UNSTABLE_MACHINE error

Postby PaJaSoft » Wed Jan 27, 2010 11:57 am

NVidia 8600GT on core11 crashed on everything (self-test failure), on core14 all is ok for all of the time... strange...:-(

PS: Last year also on core11 all was OK - last update of core11 (announced this year) is problematic.
PaJaSoft
 
Posts: 3
Joined: Thu Oct 08, 2009 1:30 pm

PreviousNext

Return to NVIDIA specific issues

Who is online

Users browsing this forum: No registered users and 2 guests

cron