7970: Getting failure "Unstable_Machine" on stock clocks

It seems that a lot of GPU problems revolve around specific versions of drivers. Though AMD has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

Post Reply
stickg1
Posts: 19
Joined: Tue Jun 12, 2012 10:47 pm

7970: Getting failure "Unstable_Machine" on stock clocks

Post by stickg1 »

I noticed that my progress was always below 10% completion then I realized it was because I kept restarting new projects so I checked the logs. Then I reset my GPU to stock settings and still get this error..
01:34:16:WU00:FS00:0x16:Assembly optimizations on if available.
01:34:16:WU00:FS00:0x16:Entering M.D.
01:34:18:WU00:FS00:0x16:Tpr hash 00/wudata_01.tpr: 1840862616 1960897209 313171957 151549198 1593571991
01:34:18:WU00:FS00:0x16:Working on ALZHEIMER DISEASE AMYLOID
01:34:18:WU00:FS00:0x16:Client config unavailable.
01:34:18:WU00:FS00:0x16:Starting GUI Server
01:34:19:WU00:FS00:0x16:Setting checkpoint frequency: 599999
01:34:19:WU00:FS00:0x16:Completed 0 out of 59999936 steps (0%).
01:35:28:Saving configuration to config.xml
01:35:28:<config>
01:35:28: <!-- Folding Slot Configuration -->
01:35:28: <gpu v='true'/>
01:35:28: <smp v='false'/>
01:35:28:
01:35:28: <!-- Network -->
01:35:28: <proxy v=':8080'/>
01:35:28:
01:35:28: <!-- User Information -->
01:35:28: <passkey v='********************************'/>
01:35:28: <team v='37726'/>
01:35:28: <user v='stickg1'/>
01:35:28:
01:35:28: <!-- Folding Slots -->
01:35:28: <slot id='0' type='GPU'/>
01:35:28:</config>
01:38:17:WU00:FS00:0x16:Completed 600000 out of 59999936 steps (1%).
01:41:47:Server connection id=2 on 0.0.0.0:36330 from 127.0.0.1
01:42:28:WU00:FS00:0x16:Completed 1199999 out of 59999936 steps (2%).
01:49:48:WU00:FS00:0x16:mdrun_gpu returned 52
01:49:48:WU00:FS00:0x16:NANs detected on GPU
01:49:48:WU00:FS00:0x16:
01:49:48:WU00:FS00:0x16:Folding@home Core Shutdown: UNSTABLE_MACHINE
01:49:48:WARNING:WU00:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: 7970: Getting failure "Unstable_Machine" on stock clocks

Post by P5-133XL »

Drivers? Try drivers no more recent than 12.8

http://www.overclock.net/t/1323729/amd- ... ta-drivers
Image
stickg1
Posts: 19
Joined: Tue Jun 12, 2012 10:47 pm

Re: 7970: Getting failure "Unstable_Machine" on stock clocks

Post by stickg1 »

I'm using 12.11b, I thought it would be fine as long as I used the APP SDK 2.7 or 2.8 installed seperately. I guess I could try 12.8 drivers but it will kill my gaming performance :(
stickg1
Posts: 19
Joined: Tue Jun 12, 2012 10:47 pm

Re: 7970: Getting failure "Unstable_Machine" on stock clocks

Post by stickg1 »

It might have actually been GPU stability, I bumped the voltage up slightly on the stock clocks and so far 30% through a work unit. We shall see.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 7970: Getting failure "Unstable_Machine" on stock clocks

Post by bruce »

Unfortunately the message NANs detected on GPU indicates that a calculation error was detected but it doesn't give a clue about why it happened. Errors like that frequently mean the hardware wasn't operating correctly -- perhaps voltage, perhaps overclocking, perhaps defective hardware, etc. -- but it can also be caused by software or driver problems. I'm glad you found it fairly quickly.
stickg1
Posts: 19
Joined: Tue Jun 12, 2012 10:47 pm

Re: 7970: Getting failure "Unstable_Machine" on stock clocks

Post by stickg1 »

Yeah I successfully cranked out two WU's since adding some voltage and I am watching it like a hawk. I took apart my GPU and added some Cool Labs Liquid Pro, at full usage during folding I'm at 55C which is great for a reference card IMO. So I could add more voltage if need be. I just wanted to make sure there wasn't an issue with drivers too. But many people on Overclock.net are using 12.11 drivers with the APP SDK 2.7. I just joined a competition folding team over there so I didn't want to be dead weight failing work units left and right!!
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 7970: Getting failure "Unstable_Machine" on stock clocks

Post by bruce »

NOBODY wants unnecessary failed WUs, whether they're competitive or not.
Humanoid1
Posts: 17
Joined: Mon Jun 04, 2012 1:33 pm
Hardware configuration: CPU: Xeon X5650 @ 4GHz
Gfx: Asus 7950 Direct CUII TOP @ 1.05GHz
RAM: Tri Channel 12GB Kingston LoVo 1,600MHz 8-9-8-21 1T
MB: Gigabyte X58A-OC
OS: Win 7 Pro 64bit

Re: 7970: Getting failure "Unstable_Machine" on stock clocks

Post by Humanoid1 »

I have something to add to this - been meaning to pop by and start a Thread anyways...
Been folding for years, and a few months back I finally upgraded my trusty old 5850 (had not problems on this card + I "fairly" often check logs) to an Asus 7950 Direct CUII TOP v1 (I added shims to the screws to fix the reputed loose cooler issue later remedied in the v2)
- of course I completely cleaned out the old drivers and added the 12.10 driver.

These cards have Great potential and in raw horsepower terms are considered superior to NVidia's offerings I understand. Hoping for updates to FAH to take advantage of these cards, though I know that Could be far off?
- besides its wise to help the underdog and maintain a healthy price war ;)

I did not immediately notice, but I was getting NANs detected on GPU when I was looking through the log - having seen less completed WU's than expected on the ExtremeOverclocking stats pages(this error is not immediately obvious to most people I imagine)

Will bullet the story into a short version:
- I was OC'd at the time to 1.1GHz @ stock volts running Cool with 62c when folding and getting NaN's seeming sometimes and not others
- Reduced speed + increased volts in stages which at 1st seemed to do the trick several times --- but still getting NaNs intermittently & often
- ended up under clocked to 800MHz at 1.25v with no fix. (stock volts = 1.090)

- In other testing / gaming I had no problem
(obviously my sole task at the time)

-- BUT... I did eventually figure out part of the issue:

- I normally have Many FireFox tabs open, browsing sites and watching some youtube videos now and then.
- Watching movies using VLC also causes NaN's
(it appears that a WU can Seem to be ok, stop playing videos etc and will Later fail, like at 75% I noted twice, hours after cause of NaNs )
- Doing this has a 90%+ chance of causing NaNs while GPU folding. (SomeTimes I can complete a whole WU, but is Very Rare)


Currently I Pause GPU during the day and set it running, after closing FireFox, only when I am not using the PC for a while + overnight.

Am now GPU folding : 1.05GHz @ stock volts and Never get NaNs (incidentally I get between 8,088 and 10k (10k at a glance on that newer WU) PPD )
I do Almost Nothing! when folding GPU now as I want to avoid more NaN's but have not tested is the Windows 7 aero theme can cause them... by this I mean, I do not move windows or pretty much do anything on the desktop while GPU folding.

Obviously testing this NaN problem can take a Long Time due to 5 hours per WU and not always failing until 30+++ % when watching VLC or browsing with FireFox (I imagine a mix of some flash Adds and youtube cause the error)

Would be Great to be able to fold GPU 24/7 again worry free. Good Luck finding the issue at its core!

I am using :
F@H Client v7.1.52
Win 7 Pro 64bit
FireFox 18
Adobe Flash Player 11
VLC media player 2.0.1 Twoflower
Catalyst 12.10 Cleaned previous driver Thoroughly + manually checked after app cleaning
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 7970: Getting failure "Unstable_Machine" on stock clocks

Post by bruce »

NaN errors can come from many very different things including overheating, bad WUs, etc. and there's rarely enough information to know what to try first.

I've been thinking about your story of delayed errors and although I don't have a definitive answer I do have one guess.

Video memory is a bit of a mystery to me, but I do know it has some similarities to the old methods of using CPU memory before virtual memory was invented. When you start an application that uses video memory, that application is responsible for allocating whatever VRAM it needs and then it's responsible for reallocating it when it is finished with it. All memory probably comes from a common pool so any uncooperative application that fails to free up the memory it allocated can cause problems for other applications. I suspect that almost every GPU-based program is tested by the software designer when it has exclusive use of the GPU so detecting so isolating and fixing a memory leak (if there is one) is going to be a hit-and-miss proposition.

I don't know if there's any way to see how much VRAM is allocated and/or which program owns it but somewhere there should be a way to check and decide what to do about it. Rebooting (and resetting the GPU) is a poor way to diagnose that kind of an issue.

This comment may or may not be related to your problem, especially the last sentence:
Humanoid1 wrote:But bear in mind that things like playing youtube videos (flash player) pulls GFx core speed down by about the % (not GPU usage) some of the "speeds" you guys are experiencing.
To get speed back up, the browser window that had a video played in it must be closed.
Humanoid1
Posts: 17
Joined: Mon Jun 04, 2012 1:33 pm
Hardware configuration: CPU: Xeon X5650 @ 4GHz
Gfx: Asus 7950 Direct CUII TOP @ 1.05GHz
RAM: Tri Channel 12GB Kingston LoVo 1,600MHz 8-9-8-21 1T
MB: Gigabyte X58A-OC
OS: Win 7 Pro 64bit

Re: 7970: Getting failure "Unstable_Machine" on stock clocks

Post by Humanoid1 »

Thanks for your reply bruce.

I had been spending Some time testing things.

Long testing desc. made Short while still on Catalyst 12.10 :
Found that almost any PC interaction could make GPU WU's fail
Sometimes would fail even from almost Nothing it seemed
Got to point of Pausing (killing current progressing %) to prevent failure of WU on doing Anything with PC
Quite Quite boring.

So based on that + last post as well as what I wrote Below - I would conclude that the OpenCL APP SDK packaged with the 12.10 Catalyst is not fully compatible with FAH 7.1.52 and leads to NaNs failing WUs while folding on a 7950

Solution Found + no more NaN at all for me:

Due to another application requiring Exactly Catalyst 13.1, I cleaned out Completely the 12.10 for this.
As expected from others descriptions 13.1 gives Terrible PPD. approx just over 2,000 instead of my usual 8,000
(did not test long enough to check for NaNs)

So what I did was rip out the 13.1 OpenCL APP (Accelerated Parallel Processing)
Then Install the 12.8 OpenCL APP

Now I almost have my normal GPU folding PPD (getting 7,015 now, used to get 8,300) and can do Anything I want like browsing YouTube videos etc etc with no failed GPU WU with NaNs

So Very Happy again :D
- especially as despite my testing earlier in this thread saying the GFx card was ok, I was "starting" to wonder if my 7950 was bad :(



For any interested:

For either of the following obviously don't forget to stop GPU folding 1st.

Obviously You can completely uninstall the Current 13.1 Catalyst (and use a http://sites.amd.com/us/game/downloads/ ... ility.aspx) and install the 12.8 OpenCL APP followed by Catalyst 13.1 (making sure not to overwrite the 12.8 OpenCL APP)

OR
Remove the OpenCL APP from 13.1 and install 12.8 like this (or your similar Catalyst version No situation):
(an easy guide for Win7)

Unpack the 13.1 Catalyst if needed by running it, extracting it. Then Cancel the installation.
Now run:
C:\AMD\Support\13-1_vista_win7_win8_64_dd_ccc_whql\Packages\Apps\OpenCL64\OpenCL.msi
(Need to use the same OpenCL.msi that installed it to remove it)

Remove the OpenCL APP SDK Runtime.

Then Goto

C:\Windows\System32
- and Delete:
amdocl64.dll
OpenCL.dll
OpenVideo64.dll
OVDecode64.dll
SlotMaximizerAg.dll
SlotMaximizerBe.dll

and

C:\Windows\SysWOW64
- and Delete:
amdocl.dll
OpenCL.dll
OpenVideo.dll
OVDecode.dll
SlotMaximizerAg.dll
SlotMaximizerBe.dll

Restart PC

Then run:
C:\AMD\Support\12-8_vista_win7_win8_64_dd_ccc_whql\Packages\Apps\OpenCL64\OpenCL.msi
and install its OpenCL APP

Restart PC
And should be Done back to folding and keeping the latest 13.1 drivers for games etc :)

All confirms stickg1 above experience really.
But I wanted to get to the bottom of what was causing the NaNs.
DarkFoss
Posts: 103
Joined: Fri Apr 16, 2010 11:43 pm
Hardware configuration: AMD 5800X3D Asus ROG Strix X570-E Gaming WiFi II bios 5003 G-Skill TridentZ Neo 3600mhz Asrock Tachi RX 7900XTX Corsair rm850x psu Asus PG32UQXR EK Elite 360 D-rgb aio Win 11pro/Kubuntu 22.04.4 LTS UPS BX1500G
Location: Galifrey

Re: 7970: Getting failure "Unstable_Machine" on stock clocks

Post by DarkFoss »

Oldish thread I know but you can use MSI Afterburner to monitor the memory usage of the first gpu.
Edit I rebooted then disabled crossfire before opening afterburner then started FAH GPU1 was using 134mb of ram with Firefox open for webcontrol. GPU2 was only running FAH and was using a whole 5MB of vram.
I just noticed after disabling crossfire, GPU-Z can also monitor both cards and the memory allocation is in 2 parts dedicated and dynamic GPU2 was 5mb dedicated and 11mb dynamic.
Image
Post Reply