FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Moderators: Site Moderators, FAHC Science Team

billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by billford »

I think it probably isn't a bad WU; I've been trying to find the temperature/overclock limits of my GPU, it's (unseasonably) warm here this evening and I think I've just found one of them :(

Error was:

ERROR:exception: Error downloading array energyBuffer: clEnqueueReadBuffer (-36)

and I had to reboot to get it folding again!

A question- if (like this one) I get a "bad WU" that is probably my own fault, should I still report it?
Image
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by P5-133XL »

For general release projects, the servers will keep track and deal with bad WU's automatically so you do not need to report each and every one. Only if there is a real question, like points not being credited, that needs to be dealt with by a human would a general release WU typically need to be reported.

If you are beta or alpha testing the standard is much looser and the info supplied needs to be much greater so as to be able to detect patterns in the cause. There the researcher wants to know of problems with the project quickly rather than relying upon the servers to automatically detect a problem. However, if you know that it was your fault then even here they need not be reported unless you were trying to test a failure mode where the info might be helpful.
Image
billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by billford »

That all sounds sensible, thanks :)
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by bruce »

Depending on the type of error, the FAH client almost always uploads an error report so that the WU can be reassigned promptly and after a designated number of failures, removed from circulation. In rare instances, no report is uploaded and the WU remains idle until it times out. In either case, there's not a lot you can do. If you do report it, we can check to see if others have reported problems with the same WU.
billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by billford »

I noticed that an assortment of files were uploaded, including a log file, so I assumed an error report would be in there somewhere so the WU could be re-issued. Something to bear in mind if there's no indication of an upload, I assume the "waiting for time-out" status can be manually over-ridden if desired?
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by bruce »

Unfortunately there's no tool to override "waiting for timeout" but if a WU has had a number of failures but the Server has not yet removed it from circulation, we can submit a request to remove it from circulation. In this case, there's only one error report and it's from "lbford"
billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by billford »

bruce wrote:In this case, there's only one error report and it's from "lbford"
Yup, that's me- I don't fold using the same name as I post. No ulterior motive, it just happened that way.
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by bruce »

NP.

FAH always retries faulty WUs. Some fail repeatedly (called "bad WU") and some are successfully completed by another system (assumed to be a local stabilty problem, but the software doesn't have a good way to confirm that).

Stanford has done stability testing on consumer grade GPUs and has determined they're pretty good but not 100% so they do their best to deal with their need for getting solid scientific results from whatever equipment donors provide. Other DC projects routinely reassign every WU and require multiple completions of the same analysis to match. I guess part of that depends on whether you have more work to be assigned than Donors or you have more Donors than it takes to do all the work twice (plus).
billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by billford »

bruce wrote:Stanford has done stability testing on consumer grade GPUs and has determined they're pretty good but not 100%
So I'm discovering :wink:

It's a pity there isn't a way to find the stability limits of a GPU without exceeding one of them… I half-expected a failure of some sort, the room temperature was nearly 30ºC and I wasn't comfortable in there!

I was hoping any error would come up as "Bad state detected", then I could turn the clock down a bit and it would (hopefully) complete the WU, But it went for the nuclear option instead :(
bruce wrote:Other DC projects routinely reassign every WU and require multiple completions of the same analysis to match. I guess part of that depends on whether you have more work to be assigned than Donors or you have more Donors than it takes to do all the work twice (plus).
I didn't know that. And, presumably, why the client keeps "sanity checking" the results as it goes so it can (mostly!) back up to the last known good state… it's a more efficient paradigm in my opinion.
Image
7im
Posts: 10189
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by 7im »

Please tell us you are not stability testing on live data!
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by billford »

7im wrote:Please tell us you are not stability testing on live data!
If I were testing I'd be running a lot more over-clock to find out where it broke at normal room temperatures… now I know to keep an eye on the room thermometer (and the weather forecast!) and back off a bit at ambients above about 26ºC.

I don't cheerfully junk WUs, but I don't see the point in unnecessarily wide safety margins either.
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by bruce »

It's currently 95F / 35C in this room and I'm not comfortable either, but my GPUs are happy. Tomorrow is supposed to be cooler.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by P5-133XL »

I have 4 computers with 6 folding GPU's in a 150 sq ft room. They use around 2,200 continuous Watts as measured by a kill-a-watt. The temp is always between 20-30F higher than outside temps. The past couple of days, outside has peaked in the low 90's F. It is extremely uncomfortable inside. I still do not have any problems keeping the GPU's less than 80C which is my temp goal.
Image
billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by billford »

bruce wrote:It's currently 95F / 35C in this room and I'm not comfortable either, but my GPUs are happy.
P5-133XL wrote:The past couple of days, outside has peaked in the low 90's F. It is extremely uncomfortable inside. I still do not have any problems keeping the GPU's less than 80C which is my temp goal.
I'd guess that a) those conditions aren't particularly unusual where you live and b) you have other priorities than pandering to your GPUs (like earning a living!), so they're set up to cope with anything but the most extreme conditions they're likely to encounter.

For a), where I live they're not too common, though not rare, and b) I'm retired, thus able (and prepared) to run them closer to the limit under normal conditions and adjust when non-normal conditions occur (or look likely).

It helps that I've only got one GPU that needs such attention, and possibly the comparative costs of such kit between the US and UK provides an incentive to squeeze the most out of them :wink:

Everyone has their own way to run their own kit depending on their own particular circumstances… I think we're wandering off topic.
Image
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Post by PantherX »

billford wrote:I noticed that an assortment of files were uploaded, including a log file, so I assumed an error report would be in there somewhere so the WU could be re-issued...
AFAIK, only a single file is uploaded upon WU completion/failure. IIRC, it would be WUresults.dat (or something similar).

The log file that is displayed in Advanced Control (AKA FAHControl) is only uploaded (with your permission) when you submit a bug report from the Web Control.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Post Reply