Unstable Machine

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: slegrand, Site Moderators, PandeGroup

Unstable Machine

Postby jsanthara » Wed May 09, 2012 1:15 pm

I have been getting an error with the 80xx WU's. Here is the code:

Code: Select all

[06:16:27] - Machine ID: 2
[06:16:27]
[06:16:27] Gpu type=3 species=21.
[06:16:28] Loaded queue successfully.
[06:16:28] - Preparing to get new work unit...
[06:16:28] Cleaning up work directory
[06:16:28] + Attempting to get work packet
[06:16:28] Passkey found
[06:16:28] Gpu type=3 species=21.
[06:16:28] - Connecting to assignment server
[06:16:28] - Successful: assigned to (171.67.108.143).
[06:16:28] + News From Folding@Home: Welcome to Folding@Home
[06:16:28] Loaded queue successfully.
[06:16:28] Gpu type=3 species=21.
[06:16:30] + Closed connections
[06:16:30]
[06:16:30] + Processing work unit
[06:16:30] Core required: FahCore_15.exe
[06:16:30] Core found.
[06:16:30] Working on queue slot 06 [May 9 06:16:30 UTC]
[06:16:30] + Working ...
[06:16:30]
[06:16:30] *------------------------------*
[06:16:30] Folding@Home GPU Core
[06:16:30] Version                2.22 (Thu Dec 8 17:08:05 PST 2011)
[06:16:30] Build host             SimbiosNvdWin7
[06:16:30] Board Type             NVIDIA/CUDA
[06:16:30] Core                   15
[06:16:30]
[06:16:30] Window's signal control handler registered.
[06:16:30] Preparing to commence simulation
[06:16:30] - Looking at optimizations...
[06:16:30] DeleteFrameFiles: successfully deleted file=work/wudata_06.ckp
[06:16:30] - Created dyn
[06:16:30] - Files status OK
[06:16:30] sizeof(CORE_PACKET_HDR) = 512 file=<>
[06:16:30] - Expanded 146324 -> 660994 (decompressed 451.7 percent)
[06:16:30] Called DecompressByteArray: compressed_data_size=146324 data_size=660994, decompressed_data_size=660994 diff=0
[06:16:30] - Digital signature verified
[06:16:30]
[06:16:30] Project: 8020 (Run 1, Clone 343, Gen 7)
[06:16:30]
[06:16:30] Assembly optimizations on if available.
[06:16:30] Entering M.D.
[06:16:32] Tpr hash work/wudata_06.tpr:  2513114052 2874948355 3171489514 4072215001 1299962209
[06:16:32] GPU device info: vendor=0 device=0 name=<NA> match=0
[06:16:32] Working on Gromacs Runs On Most of All Computer Systems
[06:16:32] Client config found, loading data.
[06:16:32] Starting GUI Server
[06:17:34] Setting checkpoint frequency: 250000
[06:17:34] Completed         3 out of 25000000 steps (0%).
[06:24:06] Completed    250000 out of 25000000 steps (1%).
[06:30:39] Completed    500000 out of 25000000 steps (2%).
[06:30:41] mdrun_gpu returned 52
[06:30:41] NANs detected on GPU
[06:30:41]
[06:30:41] Folding@home Core Shutdown: UNSTABLE_MACHINE
[06:30:44] CoreStatus = 7A (122)
[06:30:44] Sending work to server
[06:30:44] Project: 8020 (Run 1, Clone 343, Gen 7)
[06:30:44] - Read packet limit of 540015616... Set to 524286976.
[06:30:44] - Error: Could not get length of results file work/wuresults_06.dat
[06:30:44] - Error: Could not read unit 06 file. Removing from queue.
[06:30:44] - Preparing to get new work unit...
[06:30:44] Cleaning up work directory
[06:30:44] + Attempting to get work packet
[06:30:44] Passkey found
[06:30:44] Gpu type=3 species=21.
[06:30:44] - Connecting to assignment server
[06:30:45] - Successful: assigned to (171.67.108.143).
[06:30:45] + News From Folding@Home: Welcome to Folding@Home
[06:30:45] Loaded queue successfully.
[06:30:45] Gpu type=3 species=21.
[06:30:46] + Closed connections
[06:30:51]
[06:30:51] + Processing work unit
[06:30:51] Core required: FahCore_15.exe
[06:30:51] Core found.
[06:30:51] Working on queue slot 07 [May 9 06:30:51 UTC]
[06:30:51] + Working ...
[06:30:51]
[06:30:51] *------------------------------*
[06:30:51] Folding@Home GPU Core
[06:30:51] Version                2.22 (Thu Dec 8 17:08:05 PST 2011)
[06:30:51] Build host             SimbiosNvdWin7
[06:30:51] Board Type             NVIDIA/CUDA
[06:30:51] Core                   15
[06:30:51]
[06:30:51] Window's signal control handler registered.
[06:30:51] Preparing to commence simulation
[06:30:51] - Looking at optimizations...
[06:30:51] DeleteFrameFiles: successfully deleted file=work/wudata_07.ckp
[06:30:51] - Created dyn
[06:30:51] - Files status OK
[06:30:51] sizeof(CORE_PACKET_HDR) = 512 file=<>
[06:30:51] - Expanded 146324 -> 660994 (decompressed 451.7 percent)
[06:30:51] Called DecompressByteArray: compressed_data_size=146324 data_size=660994, decompressed_data_size=660994 diff=0
[06:30:51] - Digital signature verified
[06:30:51]
[06:30:51] Project: 8020 (Run 1, Clone 343, Gen 7)
[06:30:51]
[06:30:52] Assembly optimizations on if available.
[06:30:52] Entering M.D.
[06:30:54] Tpr hash work/wudata_07.tpr:  2513114052 2874948355 3171489514 4072215001 1299962209
[06:30:54] GPU device info: vendor=0 device=0 name=<NA> match=0
[06:30:54] Working on Gromacs Runs On Most of All Computer Systems
[06:30:54] Client config found, loading data.
[06:30:54] Starting GUI Server
[06:31:56] Setting checkpoint frequency: 250000
[06:31:56] Completed         3 out of 25000000 steps (0%).
[06:38:28] Completed    250000 out of 25000000 steps (1%).
[06:45:02] Completed    500000 out of 25000000 steps (2%).
[06:45:03] mdrun_gpu returned 52
[06:45:03] NANs detected on GPU
[06:45:03]
[06:45:03] Folding@home Core Shutdown: UNSTABLE_MACHINE
[06:45:05] CoreStatus = 7A (122)
[06:45:05] Sending work to server
[06:45:05] Project: 8020 (Run 1, Clone 343, Gen 7)
[06:45:05] - Read packet limit of 540015616... Set to 524286976.
[06:45:05] - Error: Could not get length of results file work/wuresults_07.dat
[06:45:05] - Error: Could not read unit 07 file. Removing from queue.
[06:45:05] - Preparing to get new work unit...
[06:45:05] Cleaning up work directory
[06:45:05] + Attempting to get work packet
[06:45:05] Passkey found
[06:45:05] Gpu type=3 species=21.
[06:45:05] - Connecting to assignment server
[06:45:06] - Successful: assigned to (171.67.108.143).
[06:45:06] + News From Folding@Home: Welcome to Folding@Home
[06:45:06] Loaded queue successfully.
[06:45:06] Gpu type=3 species=21.
[06:45:07] + Closed connections
[06:45:12]
[06:45:12] + Processing work unit
[06:45:12] Core required: FahCore_15.exe
[06:45:12] Core found.
[06:45:12] Working on queue slot 08 [May 9 06:45:12 UTC]
[06:45:12] + Working ...
[06:45:12]
[06:45:12] *------------------------------*
[06:45:12] Folding@Home GPU Core
[06:45:12] Version                2.22 (Thu Dec 8 17:08:05 PST 2011)
[06:45:12] Build host             SimbiosNvdWin7
[06:45:12] Board Type             NVIDIA/CUDA
[06:45:12] Core                   15
[06:45:12]
[06:45:12] Window's signal control handler registered.
[06:45:12] Preparing to commence simulation
[06:45:12] - Looking at optimizations...
[06:45:12] DeleteFrameFiles: successfully deleted file=work/wudata_08.ckp
[06:45:12] - Created dyn
[06:45:12] - Files status OK
[06:45:12] sizeof(CORE_PACKET_HDR) = 512 file=<>
[06:45:12] - Expanded 146324 -> 660994 (decompressed 451.7 percent)
[06:45:12] Called DecompressByteArray: compressed_data_size=146324 data_size=660994, decompressed_data_size=660994 diff=0
[06:45:12] - Digital signature verified
[06:45:12]
[06:45:12] Project: 8020 (Run 1, Clone 343, Gen 7)
[06:45:12]
[06:45:13] Assembly optimizations on if available.
[06:45:13] Entering M.D.
[06:45:15] Tpr hash work/wudata_08.tpr:  2513114052 2874948355 3171489514 4072215001 1299962209
[06:45:15] GPU device info: vendor=0 device=0 name=<NA> match=0
[06:45:15] Working on Gromacs Runs On Most of All Computer Systems
[06:45:15] Client config found, loading data.
[06:45:15] Starting GUI Server
[06:46:16] Setting checkpoint frequency: 250000
[06:46:16] Completed         3 out of 25000000 steps (0%).
[06:52:49] Completed    250000 out of 25000000 steps (1%).
[06:59:23] Completed    500000 out of 25000000 steps (2%).
[06:59:24] mdrun_gpu returned 52
[06:59:24] NANs detected on GPU
[06:59:24]
[06:59:24] Folding@home Core Shutdown: UNSTABLE_MACHINE
[06:59:27] CoreStatus = 7A (122)
[06:59:27] Sending work to server
[06:59:27] Project: 8020 (Run 1, Clone 343, Gen 7)
[06:59:27] - Read packet limit of 540015616... Set to 524286976.
[06:59:27] - Error: Could not get length of results file work/wuresults_08.dat
[06:59:27] - Error: Could not read unit 08 file. Removing from queue.
[06:59:27] - Preparing to get new work unit...
[06:59:27] Cleaning up work directory
[06:59:27] + Attempting to get work packet
[06:59:27] Passkey found
[06:59:27] Gpu type=3 species=21.
[06:59:27] - Connecting to assignment server
[06:59:27] - Successful: assigned to (171.67.108.143).
[06:59:27] + News From Folding@Home: Welcome to Folding@Home
[06:59:28] Loaded queue successfully.
[06:59:28] Gpu type=3 species=21.
[06:59:29] + Closed connections
[06:59:34]
[06:59:34] + Processing work unit
[06:59:34] Core required: FahCore_15.exe
[06:59:34] Core found.
[06:59:34] Working on queue slot 09 [May 9 06:59:34 UTC]
[06:59:34] + Working ...
[06:59:34]
[06:59:34] *------------------------------*
[06:59:34] Folding@Home GPU Core
[06:59:34] Version                2.22 (Thu Dec 8 17:08:05 PST 2011)
[06:59:34] Build host             SimbiosNvdWin7
[06:59:34] Board Type             NVIDIA/CUDA
[06:59:34] Core                   15
[06:59:34]
[06:59:34] Window's signal control handler registered.
[06:59:34] Preparing to commence simulation
[06:59:34] - Looking at optimizations...
[06:59:34] DeleteFrameFiles: successfully deleted file=work/wudata_09.ckp
[06:59:34] - Created dyn
[06:59:34] - Files status OK
[06:59:34] sizeof(CORE_PACKET_HDR) = 512 file=<>
[06:59:34] - Expanded 146324 -> 660994 (decompressed 451.7 percent)
[06:59:34] Called DecompressByteArray: compressed_data_size=146324 data_size=660994, decompressed_data_size=660994 diff=0
[06:59:34] - Digital signature verified
[06:59:34]
[06:59:34] Project: 8020 (Run 1, Clone 343, Gen 7)
[06:59:34]
[06:59:34] Assembly optimizations on if available.
[06:59:34] Entering M.D.
[06:59:36] Tpr hash work/wudata_09.tpr:  2513114052 2874948355 3171489514 4072215001 1299962209
[06:59:36] GPU device info: vendor=0 device=0 name=<NA> match=0
[06:59:36] Working on Gromacs Runs On Most of All Computer Systems
[06:59:36] Client config found, loading data.
[06:59:36] Starting GUI Server
[07:00:38] Setting checkpoint frequency: 250000
[07:00:38] Completed         3 out of 25000000 steps (0%).
[07:07:10] Completed    250000 out of 25000000 steps (1%).
[07:07:11] mdrun_gpu returned 52
[07:07:11] NANs detected on GPU
[07:07:11]
[07:07:11] Folding@home Core Shutdown: UNSTABLE_MACHINE
[07:07:14] CoreStatus = 7A (122)
[07:07:14] Sending work to server
[07:07:14] Project: 8020 (Run 1, Clone 343, Gen 7)
[07:07:14] - Read packet limit of 540015616... Set to 524286976.
[07:07:14] - Error: Could not get length of results file work/wuresults_09.dat
[07:07:14] - Error: Could not read unit 09 file. Removing from queue.
[07:07:14] - Preparing to get new work unit...
[07:07:14] Cleaning up work directory
[07:07:14] + Attempting to get work packet
[07:07:14] Passkey found
[07:07:14] Gpu type=3 species=21.
[07:07:14] - Connecting to assignment server
[07:07:15] - Successful: assigned to (171.67.108.143).
[07:07:15] + News From Folding@Home: Welcome to Folding@Home
[07:07:15] Loaded queue successfully.
[07:07:15] Gpu type=3 species=21.
[07:07:16] + Closed connections
[07:07:21]
[07:07:21] + Processing work unit
[07:07:21] Core required: FahCore_15.exe
[07:07:21] Core found.
[07:07:21] Working on queue slot 00 [May 9 07:07:21 UTC]
[07:07:21] + Working ...
[07:07:21]
[07:07:21] *------------------------------*
[07:07:21] Folding@Home GPU Core
[07:07:21] Version                2.22 (Thu Dec 8 17:08:05 PST 2011)
[07:07:21] Build host             SimbiosNvdWin7
[07:07:21] Board Type             NVIDIA/CUDA
[07:07:21] Core                   15
[07:07:21]
[07:07:21] Window's signal control handler registered.
[07:07:21] Preparing to commence simulation
[07:07:21] - Looking at optimizations...
[07:07:21] DeleteFrameFiles: successfully deleted file=work/wudata_00.ckp
[07:07:21] - Created dyn
[07:07:21] - Files status OK
[07:07:21] sizeof(CORE_PACKET_HDR) = 512 file=<>
[07:07:21] - Expanded 146324 -> 660994 (decompressed 451.7 percent)
[07:07:21] Called DecompressByteArray: compressed_data_size=146324 data_size=660994, decompressed_data_size=660994 diff=0
[07:07:21] - Digital signature verified
[07:07:21]
[07:07:21] Project: 8020 (Run 1, Clone 343, Gen 7)
[07:07:21]
[07:07:22] Assembly optimizations on if available.
[07:07:22] Entering M.D.
[07:07:24] Tpr hash work/wudata_00.tpr:  2513114052 2874948355 3171489514 4072215001 1299962209
[07:07:24] GPU device info: vendor=0 device=0 name=<NA> match=0
[07:07:24] Working on Gromacs Runs On Most of All Computer Systems
[07:07:24] Client config found, loading data.
[07:07:24] Starting GUI Server
[07:08:25] Setting checkpoint frequency: 250000
[07:08:25] Completed         3 out of 25000000 steps (0%).
[07:14:58] Completed    250000 out of 25000000 steps (1%).
[07:14:59] mdrun_gpu returned 52
[07:14:59] NANs detected on GPU
[07:14:59]
[07:14:59] Folding@home Core Shutdown: UNSTABLE_MACHINE
[07:15:01] CoreStatus = 7A (122)
[07:15:01] Sending work to server
[07:15:01] Project: 8020 (Run 1, Clone 343, Gen 7)
[07:15:01] - Read packet limit of 540015616... Set to 524286976.
[07:15:01] - Error: Could not get length of results file work/wuresults_00.dat
[07:15:01] - Error: Could not read unit 00 file. Removing from queue.
[07:15:01] EUE limit exceeded. Pausing 24 hours.


I recently RMA'd a crashed EVGA GTX 460 SE. I received a new one last week and started folding with the console client. All good for 2 days and then "UNSTABLE MACHINE". I did a little searching on the forums to find that this sometimes happens, even with factory OC'ed GPU's.

I ran FurMark for 2 1/2 hours with no problems.

I updated to 301.24 drivers (from 296.10) and it seemed to work again... for another 2 days. Now I get the code you see above (same error as before). I looked around on the forums, but I haven't been able to find a clear solution.

I would prefer not to, but if I have to I will downclock my GPU a bit; which is sad because I am not even overclocking. Also, my GTX 560ti is catching these units (80xx) and has not shown any problems with them. Could someone please help me figure this out?

Further info:

I read about a lot of theories for why this particular error occurs (drivers, lack of memory, etc.), so just in case it is relevant:

Gigabyte Z68-UD3H-B3
i5-2500k
8GB RAM
EVGA GTX 460 SE (301.24 beta drivers)
User avatar
jsanthara
 
Posts: 12
Joined: Fri Aug 26, 2011 9:31 pm

Re: Unstable Machine

Postby 7im » Wed May 09, 2012 1:31 pm

Down-clocking is a quick and easy test...

Also be sure to disable any power saving modes.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
User avatar
7im
 
Posts: 14648
Joined: Thu Nov 29, 2007 4:30 pm
Location: Arizona

Re: Unstable Machine

Postby P5-133XL » Wed May 09, 2012 2:57 pm

There are specific drivers that seem to cause problems as well as drivers that are relatively problem free (for the GTX 460 the 260.99 seems to be a good one)
Image
P5-133XL
 
Posts: 4034
Joined: Sun Dec 02, 2007 4:36 am
Location: Salem. OR USA

Re: Unstable Machine

Postby jsanthara » Wed May 09, 2012 9:59 pm

I downclocked a bit. Seems stable thus far (running for 5+ hours), but it also worked for 2 days after switching the drivers, so we will see what happens.
User avatar
jsanthara
 
Posts: 12
Joined: Fri Aug 26, 2011 9:31 pm

Re: Unstable Machine

Postby derrickmcc » Wed May 09, 2012 10:13 pm

It may also be related to overheating.

I use Speedfan to monitor the temperatures of my 2 GTX460's.

I also recently had some errors folding 80xx WU's on one card only, no further problems since I dropped the OC on that GPU to 1600Mhz, the other GPU is still happily folding at 1650Mhz.

Note that the position of the GPU in the case will affect the cooling.
Image
User avatar
derrickmcc
 
Posts: 329
Joined: Fri Jul 24, 2009 12:30 am
Location: Malvern, UK

Re: Unstable Machine

Postby jsanthara » Sat May 12, 2012 12:18 am

Update: I downclocked a bit and have not had any issues since (2+ days). I will let you all know if the error returns.

I sure hope they stop sending me these 80xx units, so I can return to normal clock speeds.

derrickmcc wrote:It may also be related to overheating...


Temps were, and still are solid at around 64C - 72C (closer to 64C since the downclock). Thank you for your input.
User avatar
jsanthara
 
Posts: 12
Joined: Fri Aug 26, 2011 9:31 pm


Return to V6 GPU3 beta (including Fermi) OpenMM

Who is online

Users browsing this forum: No registered users and 2 guests

cron