FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Moderators: Site Moderators, PandeGroup

FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby billford » Fri May 16, 2014 7:01 pm

I think it probably isn't a bad WU; I've been trying to find the temperature/overclock limits of my GPU, it's (unseasonably) warm here this evening and I think I've just found one of them :(

Error was:

ERROR:exception: Error downloading array energyBuffer: clEnqueueReadBuffer (-36)

and I had to reboot to get it folding again!

A question- if (like this one) I get a "bad WU" that is probably my own fault, should I still report it?
Image
billford
 
Posts: 1006
Joined: Thu May 02, 2013 8:46 pm
Location: Near Oxford, United Kingdom

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby P5-133XL » Fri May 16, 2014 7:41 pm

For general release projects, the servers will keep track and deal with bad WU's automatically so you do not need to report each and every one. Only if there is a real question, like points not being credited, that needs to be dealt with by a human would a general release WU typically need to be reported.

If you are beta or alpha testing the standard is much looser and the info supplied needs to be much greater so as to be able to detect patterns in the cause. There the researcher wants to know of problems with the project quickly rather than relying upon the servers to automatically detect a problem. However, if you know that it was your fault then even here they need not be reported unless you were trying to test a failure mode where the info might be helpful.
Image
P5-133XL
 
Posts: 4034
Joined: Sun Dec 02, 2007 4:36 am
Location: Salem. OR USA

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby billford » Fri May 16, 2014 7:47 pm

That all sounds sensible, thanks :)
billford
 
Posts: 1006
Joined: Thu May 02, 2013 8:46 pm
Location: Near Oxford, United Kingdom

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby bruce » Fri May 16, 2014 8:43 pm

Depending on the type of error, the FAH client almost always uploads an error report so that the WU can be reassigned promptly and after a designated number of failures, removed from circulation. In rare instances, no report is uploaded and the WU remains idle until it times out. In either case, there's not a lot you can do. If you do report it, we can check to see if others have reported problems with the same WU.
bruce
 
Posts: 22853
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby billford » Fri May 16, 2014 9:03 pm

I noticed that an assortment of files were uploaded, including a log file, so I assumed an error report would be in there somewhere so the WU could be re-issued. Something to bear in mind if there's no indication of an upload, I assume the "waiting for time-out" status can be manually over-ridden if desired?
billford
 
Posts: 1006
Joined: Thu May 02, 2013 8:46 pm
Location: Near Oxford, United Kingdom

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby bruce » Fri May 16, 2014 9:08 pm

Unfortunately there's no tool to override "waiting for timeout" but if a WU has had a number of failures but the Server has not yet removed it from circulation, we can submit a request to remove it from circulation. In this case, there's only one error report and it's from "lbford"
bruce
 
Posts: 22853
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby billford » Fri May 16, 2014 9:12 pm

bruce wrote:In this case, there's only one error report and it's from "lbford"

Yup, that's me- I don't fold using the same name as I post. No ulterior motive, it just happened that way.
billford
 
Posts: 1006
Joined: Thu May 02, 2013 8:46 pm
Location: Near Oxford, United Kingdom

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby bruce » Fri May 16, 2014 9:34 pm

NP.

FAH always retries faulty WUs. Some fail repeatedly (called "bad WU") and some are successfully completed by another system (assumed to be a local stabilty problem, but the software doesn't have a good way to confirm that).

Stanford has done stability testing on consumer grade GPUs and has determined they're pretty good but not 100% so they do their best to deal with their need for getting solid scientific results from whatever equipment donors provide. Other DC projects routinely reassign every WU and require multiple completions of the same analysis to match. I guess part of that depends on whether you have more work to be assigned than Donors or you have more Donors than it takes to do all the work twice (plus).
bruce
 
Posts: 22853
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby billford » Fri May 16, 2014 10:00 pm

bruce wrote:Stanford has done stability testing on consumer grade GPUs and has determined they're pretty good but not 100%

So I'm discovering :wink:

It's a pity there isn't a way to find the stability limits of a GPU without exceeding one of them… I half-expected a failure of some sort, the room temperature was nearly 30ºC and I wasn't comfortable in there!

I was hoping any error would come up as "Bad state detected", then I could turn the clock down a bit and it would (hopefully) complete the WU, But it went for the nuclear option instead :(

bruce wrote:Other DC projects routinely reassign every WU and require multiple completions of the same analysis to match. I guess part of that depends on whether you have more work to be assigned than Donors or you have more Donors than it takes to do all the work twice (plus).


I didn't know that. And, presumably, why the client keeps "sanity checking" the results as it goes so it can (mostly!) back up to the last known good state… it's a more efficient paradigm in my opinion.
billford
 
Posts: 1006
Joined: Thu May 02, 2013 8:46 pm
Location: Near Oxford, United Kingdom

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby 7im » Fri May 16, 2014 10:05 pm

Please tell us you are not stability testing on live data!
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
User avatar
7im
 
Posts: 14648
Joined: Thu Nov 29, 2007 4:30 pm
Location: Arizona

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby billford » Fri May 16, 2014 10:19 pm

7im wrote:Please tell us you are not stability testing on live data!

If I were testing I'd be running a lot more over-clock to find out where it broke at normal room temperatures… now I know to keep an eye on the room thermometer (and the weather forecast!) and back off a bit at ambients above about 26ºC.

I don't cheerfully junk WUs, but I don't see the point in unnecessarily wide safety margins either.
billford
 
Posts: 1006
Joined: Thu May 02, 2013 8:46 pm
Location: Near Oxford, United Kingdom

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby bruce » Sat May 17, 2014 1:35 am

It's currently 95F / 35C in this room and I'm not comfortable either, but my GPUs are happy. Tomorrow is supposed to be cooler.
bruce
 
Posts: 22853
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby P5-133XL » Sat May 17, 2014 7:11 am

I have 4 computers with 6 folding GPU's in a 150 sq ft room. They use around 2,200 continuous Watts as measured by a kill-a-watt. The temp is always between 20-30F higher than outside temps. The past couple of days, outside has peaked in the low 90's F. It is extremely uncomfortable inside. I still do not have any problems keeping the GPU's less than 80C which is my temp goal.
P5-133XL
 
Posts: 4034
Joined: Sun Dec 02, 2007 4:36 am
Location: Salem. OR USA

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby billford » Sat May 17, 2014 7:34 am

bruce wrote:It's currently 95F / 35C in this room and I'm not comfortable either, but my GPUs are happy.

P5-133XL wrote:The past couple of days, outside has peaked in the low 90's F. It is extremely uncomfortable inside. I still do not have any problems keeping the GPU's less than 80C which is my temp goal.

I'd guess that a) those conditions aren't particularly unusual where you live and b) you have other priorities than pandering to your GPUs (like earning a living!), so they're set up to cope with anything but the most extreme conditions they're likely to encounter.

For a), where I live they're not too common, though not rare, and b) I'm retired, thus able (and prepared) to run them closer to the limit under normal conditions and adjust when non-normal conditions occur (or look likely).

It helps that I've only got one GPU that needs such attention, and possibly the comparative costs of such kit between the US and UK provides an incentive to squeeze the most out of them :wink:

Everyone has their own way to run their own kit depending on their own particular circumstances… I think we're wandering off topic.
billford
 
Posts: 1006
Joined: Thu May 02, 2013 8:46 pm
Location: Near Oxford, United Kingdom

Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)

Postby PantherX » Sat May 17, 2014 12:08 pm

billford wrote:I noticed that an assortment of files were uploaded, including a log file, so I assumed an error report would be in there somewhere so the WU could be re-issued...

AFAIK, only a single file is uploaded upon WU completion/failure. IIRC, it would be WUresults.dat (or something similar).

The log file that is displayed in Advanced Control (AKA FAHControl) is only uploaded (with your permission) when you submit a bug report from the Web Control.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Chrome Folding App (Beta) Ӂ Troubleshooting "Bad WUs" Ӂ Troubleshooting Server Connectivity Issues
User avatar
PantherX
Site Moderator
 
Posts: 6321
Joined: Wed Dec 23, 2009 9:33 am

Next

Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 3 guests

cron