Rash of bad WU all 13000/13001

Moderators: Site Moderators, FAHC Science Team

pinetor
Posts: 13
Joined: Fri Jun 06, 2014 12:10 am

Re: Bad WU: 13001 (Run 532, Clone 3, Gen 4)

Post by pinetor »

As to the CPU threads: very little return for the CPU folding versus the GPU... fan noise ect..

As to the GPU OC.. none. I cant see OCing a GPU and then running it 24/7 . This is my one and only rig, so no super duper juice.
pinetor
Posts: 13
Joined: Fri Jun 06, 2014 12:10 am

Re: Rash of bad WU all 13000/13001

Post by pinetor »

The first time ..I cant recall which WU. the WU stuck at 99.9% I had to reboot.
I then deleted the Work folder and it picked another very short WU ( 9xxx) this ran to completion
It then picked up another 13000 WU. this locked up the GPU
I re-booted and let the WU reload... again today the GPU locked up
I had to hard boot yet again and the WU re=loaded at 80%
I have yet to complete any 13000/13001 WU
Currently I have all work paused, as there is no use in just running to less than 100%. Plus I am not here when the GPU locks up, so I am not sure what may be happening. I do run gpuz. The CPU is water cooled, but as noted not doing anything.

I will be glad to DL the 14.4 if you think that will get me back to folding.
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Bad WU: 13000 (Run 952, Clone 0, Gen 22)

Post by PantherX »

Welcome to the F@H Forum pinetor,

Please note that if you don't want to fold on the CPU, simply remove the CPU Slot by following these instructions:
1) Open up Advanced Control (AKA FAHControl)
2) Click Configure
3) Select the Slots Tab
4) Select the appropriate Slot
5) Click Remove
6) Click Save

Thus, you will only have a single GPU Slot which you can fold on.

Moreover, please refrain from manually deleting the work folder since it delays the progress of science.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Rash of bad WU all 13000/13001

Post by PantherX »

pinetor wrote:The first time ..I cant recall which WU. the WU stuck at 99.9% I had to reboot...
In majority of cases, it is caused by the driver being reloaded by the OS. Can you search Windows Event Log for messages related to driver reloading? Also, while you have stated that the GPU isn't overclocked, is it factory overclocked by chance? If so, it is possible that the factory overclock is unstable so "down-clock" to the AMD stock frequencies.
pinetor wrote:...It then picked up another 13000 WU. this locked up the GPU
I re-booted and let the WU reload... again today the GPU locked up
I had to hard boot yet again and the WU re=loaded at 80%
I have yet to complete any 13000/13001 WU
Currently I have all work paused, as there is no use in just running to less than 100%. Plus I am not here when the GPU locks up, so I am not sure what may be happening. I do run gpuz. The CPU is water cooled, but as noted not doing anything...
Could you please explain by what "locked up the GPU" means? If you mean that the cursor moves very slowly across the screen, this is called screen lag. The cause of the screen lag can be a combination of drivers and GPU model (among other factors) and the reason for this is that the GPU works on a First In First Out (FIFO) manner which doesn't have any kind of priority/scheduling system like the CPU to manage tasks. If you encounter screen lag, the best solution is to configure the GPU to fold only when the system is idle (http://folding.stanford.edu/home/faq/fa ... ion/#ntoc3).
pinetor wrote:...I will be glad to DL the 14.4 if you think that will get me back to folding.
It has been reported by donors that 14.4 WHQL improves performance over 13.X WHQL driver series for folding. Thus, you can try it out.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
pinetor
Posts: 13
Joined: Fri Jun 06, 2014 12:10 am

Re: Rash of bad WU all 13000/13001

Post by pinetor »

Thanks for all the assistance.
I have updated the Catalyst ( complete package) to the latest.
I have also manually over ridden the GPU fan speed to 55%. Generally it never gets above 40% even after long hours at 98% load.

What I mean by GPU lock-up is that the screen has a cyan "checker-board pattern" and the entire system is non-responsive. I have seen a GPU go bad ( during mining) and the situation seems similar thus i conclude the GPU is locked up. Given the CPU is doing very little ( 13% load) and the memory use is below 3GB ( 2.75) out of 8GB. I don't think of those sub-systems are to blame. I was able to fold about 560k worth of points before any problems popped up ( I know thats not impressive, but it represent at least two weeks worth of error free folding).
pinetor
Posts: 13
Joined: Fri Jun 06, 2014 12:10 am

Re: Bad WU: 13000 (Run 952, Clone 0, Gen 22)

Post by pinetor »

Thank you for the welcome!

It certainly would not be my first thought (to delete the work folder) But at the time. the WU was stuck at 99% and my GPU ( the slot assigned to it) was at 0% load. I let it set this way for several hours ( while browsing the forums) after a few more re-boots, I gave up and deleted the folder. This DID get me a non-13000 WU which then did run to completion. However the next WU and all since then ( on the GPU) have been 13000/13001 WU.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Bad WU: 13001 (Run 486, Clone 5, Gen 15)

Post by bruce »

P5-133XL wrote:Hi pinetor (team 224497),
Your WU (P13001 R486 C5 G15) was added to the stats database on 2014-06-04 22:03:33 for 13869.6 points of credit.
Partial credit, in spite of the error. Presumably the WU has been reassigned to see if somebody else can complete it but we'll have to wait until they have time to process it before we can report the final status.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Bad WU: 13000 (Run 952, Clone 0, Gen 22)

Post by bruce »

Having the GPU at 0% while the WU appears to be stuck at 99% has been reported many times. There are several possible causes, especially overclocking or overheating or MSRemoteDesktop or a Sleep state, all of which can reset the GPU, thereby stopping all progress. Unfortunately, the estimated progress continues to increase from whatever point the reset happened ... until it reaches 99% in the GUI ... although the log stops reporting progress.

If you discover that the progress indications in the log stop and become unsynchronized with the GUI, you can manually recover by doing a Pause, followed by a Fold.

You'll need to eliminate all but one of the causes the OS has decided to reset the GPU and then prevent that from causing it to reset.

I see you have reported several WUs with the message "Bad State detected... attempting to resume from last good checkpoint" Other people are not having that problem although we'll have to wait to confirm that others have successfully complete the same WUs. I have a strong hunch that those messages are an indication of the same problem as what I've called an OS-initiated GPU Reset. There's a good chance that FAH puts heavier computational demands on the GPU that SHA64 so GPUs which appeared to be stable are now demonstrably unstable.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Rash of bad WU all 13000/13001

Post by bruce »

Project 13000/13001 puts heavier demands on your system than many other applications. FAH is very good at uncovering systems which have appeared to be stable until now but which, in fact, are marginal under high load.

See also the answer I provided in one of your other topics.
viewtopic.php?f=19&t=26436&p=265705#p265705
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Rash of bad WU all 13000/13001

Post by PantherX »

pinetor wrote:...What I mean by GPU lock-up is that the screen has a cyan "checker-board pattern" and the entire system is non-responsive. I have seen a GPU go bad ( during mining) and the situation seems similar thus i conclude the GPU is locked up. Given the CPU is doing very little ( 13% load) and the memory use is below 3GB ( 2.75) out of 8GB. I don't think of those sub-systems are to blame. I was able to fold about 560k worth of points before any problems popped up ( I know thats not impressive, but it represent at least two weeks worth of error free folding).
It sounds like a VRAM issue. Maybe your GPU is failing or encountering some serious hardware issue. As a test, can you run some GPU benchmarks (http://www.techpowerup.com/downloads/Benchmarking/) and see if you spot any visual artifacts (http://www.playtool.com/pages/artifacts/artifacts.html) or if the system locks-up/crashes?
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
pinetor
Posts: 13
Joined: Fri Jun 06, 2014 12:10 am

Re: Bad WU: 13000 (Run 952, Clone 0, Gen 22)

Post by pinetor »

Bruce (bruce)
Again thanks for your patience with me. I suspect your are correct in that 13000 projects are taxing the GPU more any other task set to it. i actually have this WU back and have been running since I posted all this last night... and we have made it through the day!!! I am at 78% completion. The GPU is loaded at 98 to 99% and holding at 50 to 51C ( according to gpuz). Either the new drivers or the fan at constant 55% seems to be doing the trick.
I can rule out:
any OC ( CPU, RAM, or GPU)
Remote desktop
Sleep

however, I do have lots of the typical processes that might decide to break things.. synapse updates, java,AMD, qnap.

PS.. I never did SHA64 .. to late to get into BTC.. I did Scrypt, running up to 3 GPUs.. but I sold them all but the lowly 7850

Still I think the load is the issue.. either the driver could not handle it or jsut not enough cooling. I can bump the GPU fans up to 60% but... its just too nosiy.

Thanks again!
pinetor
Posts: 13
Joined: Fri Jun 06, 2014 12:10 am

Re: Rash of bad WU all 13000/13001

Post by pinetor »

and stop my folding???? (grins)

so far so good today... fingers crossed.(new drivers/manual fan control)
7im
Posts: 10189
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Bad WU: 13001 (Run 532, Clone 3, Gen 4)

Post by 7im »

Depending on the model of your HD 7800, a lot of GPUs come factory overclocked these days. It's easy to miss unless looking for it.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Bad WU: 13000 (Run 952, Clone 0, Gen 22)

Post by bruce »

You've reported several WUs that crashed in separate topics (which makes sense if the problem is associated with a specific WU). You have also opened a general topic about multiple p13000/13001 WUs. I'm going to merge them into a single topic, on the theory that they're all caused by the same sort of problem with your GPU hardware. The posts will be in chronological order so may appear to be intermixed, but at least all the answers that might be applicable might be in one place.
bollix47
Posts: 2941
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Rash of bad WU all 13000/13001

Post by bollix47 »

Another folder was able to complete the WU successfully:

Hi xxxx (team xxxx),
Your WU (P13001 R486 C5 G15) was added to the stats database on 2014-06-24 04:03:47 for 17123 points of credit.
Image
Post Reply