Page 1 of 2

FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Fri May 16, 2014 7:01 pm
by billford
I think it probably isn't a bad WU; I've been trying to find the temperature/overclock limits of my GPU, it's (unseasonably) warm here this evening and I think I've just found one of them :(

Error was:

ERROR:exception: Error downloading array energyBuffer: clEnqueueReadBuffer (-36)

and I had to reboot to get it folding again!

A question- if (like this one) I get a "bad WU" that is probably my own fault, should I still report it?

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Fri May 16, 2014 7:41 pm
by P5-133XL
For general release projects, the servers will keep track and deal with bad WU's automatically so you do not need to report each and every one. Only if there is a real question, like points not being credited, that needs to be dealt with by a human would a general release WU typically need to be reported.

If you are beta or alpha testing the standard is much looser and the info supplied needs to be much greater so as to be able to detect patterns in the cause. There the researcher wants to know of problems with the project quickly rather than relying upon the servers to automatically detect a problem. However, if you know that it was your fault then even here they need not be reported unless you were trying to test a failure mode where the info might be helpful.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Fri May 16, 2014 7:47 pm
by billford
That all sounds sensible, thanks :)

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Fri May 16, 2014 8:43 pm
by bruce
Depending on the type of error, the FAH client almost always uploads an error report so that the WU can be reassigned promptly and after a designated number of failures, removed from circulation. In rare instances, no report is uploaded and the WU remains idle until it times out. In either case, there's not a lot you can do. If you do report it, we can check to see if others have reported problems with the same WU.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Fri May 16, 2014 9:03 pm
by billford
I noticed that an assortment of files were uploaded, including a log file, so I assumed an error report would be in there somewhere so the WU could be re-issued. Something to bear in mind if there's no indication of an upload, I assume the "waiting for time-out" status can be manually over-ridden if desired?

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Fri May 16, 2014 9:08 pm
by bruce
Unfortunately there's no tool to override "waiting for timeout" but if a WU has had a number of failures but the Server has not yet removed it from circulation, we can submit a request to remove it from circulation. In this case, there's only one error report and it's from "lbford"

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Fri May 16, 2014 9:12 pm
by billford
bruce wrote:In this case, there's only one error report and it's from "lbford"
Yup, that's me- I don't fold using the same name as I post. No ulterior motive, it just happened that way.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Fri May 16, 2014 9:34 pm
by bruce
NP.

FAH always retries faulty WUs. Some fail repeatedly (called "bad WU") and some are successfully completed by another system (assumed to be a local stabilty problem, but the software doesn't have a good way to confirm that).

Stanford has done stability testing on consumer grade GPUs and has determined they're pretty good but not 100% so they do their best to deal with their need for getting solid scientific results from whatever equipment donors provide. Other DC projects routinely reassign every WU and require multiple completions of the same analysis to match. I guess part of that depends on whether you have more work to be assigned than Donors or you have more Donors than it takes to do all the work twice (plus).

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Fri May 16, 2014 10:00 pm
by billford
bruce wrote:Stanford has done stability testing on consumer grade GPUs and has determined they're pretty good but not 100%
So I'm discovering :wink:

It's a pity there isn't a way to find the stability limits of a GPU without exceeding one of them… I half-expected a failure of some sort, the room temperature was nearly 30ºC and I wasn't comfortable in there!

I was hoping any error would come up as "Bad state detected", then I could turn the clock down a bit and it would (hopefully) complete the WU, But it went for the nuclear option instead :(
bruce wrote:Other DC projects routinely reassign every WU and require multiple completions of the same analysis to match. I guess part of that depends on whether you have more work to be assigned than Donors or you have more Donors than it takes to do all the work twice (plus).
I didn't know that. And, presumably, why the client keeps "sanity checking" the results as it goes so it can (mostly!) back up to the last known good state… it's a more efficient paradigm in my opinion.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Fri May 16, 2014 10:05 pm
by 7im
Please tell us you are not stability testing on live data!

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Fri May 16, 2014 10:19 pm
by billford
7im wrote:Please tell us you are not stability testing on live data!
If I were testing I'd be running a lot more over-clock to find out where it broke at normal room temperatures… now I know to keep an eye on the room thermometer (and the weather forecast!) and back off a bit at ambients above about 26ºC.

I don't cheerfully junk WUs, but I don't see the point in unnecessarily wide safety margins either.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Sat May 17, 2014 1:35 am
by bruce
It's currently 95F / 35C in this room and I'm not comfortable either, but my GPUs are happy. Tomorrow is supposed to be cooler.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Sat May 17, 2014 7:11 am
by P5-133XL
I have 4 computers with 6 folding GPU's in a 150 sq ft room. They use around 2,200 continuous Watts as measured by a kill-a-watt. The temp is always between 20-30F higher than outside temps. The past couple of days, outside has peaked in the low 90's F. It is extremely uncomfortable inside. I still do not have any problems keeping the GPU's less than 80C which is my temp goal.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Sat May 17, 2014 7:34 am
by billford
bruce wrote:It's currently 95F / 35C in this room and I'm not comfortable either, but my GPUs are happy.
P5-133XL wrote:The past couple of days, outside has peaked in the low 90's F. It is extremely uncomfortable inside. I still do not have any problems keeping the GPU's less than 80C which is my temp goal.
I'd guess that a) those conditions aren't particularly unusual where you live and b) you have other priorities than pandering to your GPUs (like earning a living!), so they're set up to cope with anything but the most extreme conditions they're likely to encounter.

For a), where I live they're not too common, though not rare, and b) I'm retired, thus able (and prepared) to run them closer to the limit under normal conditions and adjust when non-normal conditions occur (or look likely).

It helps that I've only got one GPU that needs such attention, and possibly the comparative costs of such kit between the US and UK provides an incentive to squeeze the most out of them :wink:

Everyone has their own way to run their own kit depending on their own particular circumstances… I think we're wandering off topic.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Posted: Sat May 17, 2014 12:08 pm
by PantherX
billford wrote:I noticed that an assortment of files were uploaded, including a log file, so I assumed an error report would be in there somewhere so the WU could be re-issued...
AFAIK, only a single file is uploaded upon WU completion/failure. IIRC, it would be WUresults.dat (or something similar).

The log file that is displayed in Advanced Control (AKA FAHControl) is only uploaded (with your permission) when you submit a bug report from the Web Control.