FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 1005
- Joined: Thu May 02, 2013 8:46 pm
- Hardware configuration: Full Time:
2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)
Retired:
3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop - Location: Near Oxford, United Kingdom
- Contact:
FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
I think it probably isn't a bad WU; I've been trying to find the temperature/overclock limits of my GPU, it's (unseasonably) warm here this evening and I think I've just found one of them
Error was:
ERROR:exception: Error downloading array energyBuffer: clEnqueueReadBuffer (-36)
and I had to reboot to get it folding again!
A question- if (like this one) I get a "bad WU" that is probably my own fault, should I still report it?
Error was:
ERROR:exception: Error downloading array energyBuffer: clEnqueueReadBuffer (-36)
and I had to reboot to get it folding again!
A question- if (like this one) I get a "bad WU" that is probably my own fault, should I still report it?
-
- Posts: 2948
- Joined: Sun Dec 02, 2007 4:36 am
- Hardware configuration: Machine #1:
Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).
Machine #2:
Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.
Machine 3:
Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32
I am currently folding just on the 5x GTX 460's for aprox. 70K PPD - Location: Salem. OR USA
Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
For general release projects, the servers will keep track and deal with bad WU's automatically so you do not need to report each and every one. Only if there is a real question, like points not being credited, that needs to be dealt with by a human would a general release WU typically need to be reported.
If you are beta or alpha testing the standard is much looser and the info supplied needs to be much greater so as to be able to detect patterns in the cause. There the researcher wants to know of problems with the project quickly rather than relying upon the servers to automatically detect a problem. However, if you know that it was your fault then even here they need not be reported unless you were trying to test a failure mode where the info might be helpful.
If you are beta or alpha testing the standard is much looser and the info supplied needs to be much greater so as to be able to detect patterns in the cause. There the researcher wants to know of problems with the project quickly rather than relying upon the servers to automatically detect a problem. However, if you know that it was your fault then even here they need not be reported unless you were trying to test a failure mode where the info might be helpful.
-
- Posts: 1005
- Joined: Thu May 02, 2013 8:46 pm
- Hardware configuration: Full Time:
2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)
Retired:
3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop - Location: Near Oxford, United Kingdom
- Contact:
Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
That all sounds sensible, thanks
Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
Depending on the type of error, the FAH client almost always uploads an error report so that the WU can be reassigned promptly and after a designated number of failures, removed from circulation. In rare instances, no report is uploaded and the WU remains idle until it times out. In either case, there's not a lot you can do. If you do report it, we can check to see if others have reported problems with the same WU.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 1005
- Joined: Thu May 02, 2013 8:46 pm
- Hardware configuration: Full Time:
2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)
Retired:
3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop - Location: Near Oxford, United Kingdom
- Contact:
Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
I noticed that an assortment of files were uploaded, including a log file, so I assumed an error report would be in there somewhere so the WU could be re-issued. Something to bear in mind if there's no indication of an upload, I assume the "waiting for time-out" status can be manually over-ridden if desired?
Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
Unfortunately there's no tool to override "waiting for timeout" but if a WU has had a number of failures but the Server has not yet removed it from circulation, we can submit a request to remove it from circulation. In this case, there's only one error report and it's from "lbford"
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 1005
- Joined: Thu May 02, 2013 8:46 pm
- Hardware configuration: Full Time:
2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)
Retired:
3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop - Location: Near Oxford, United Kingdom
- Contact:
Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
Yup, that's me- I don't fold using the same name as I post. No ulterior motive, it just happened that way.bruce wrote:In this case, there's only one error report and it's from "lbford"
Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
NP.
FAH always retries faulty WUs. Some fail repeatedly (called "bad WU") and some are successfully completed by another system (assumed to be a local stabilty problem, but the software doesn't have a good way to confirm that).
Stanford has done stability testing on consumer grade GPUs and has determined they're pretty good but not 100% so they do their best to deal with their need for getting solid scientific results from whatever equipment donors provide. Other DC projects routinely reassign every WU and require multiple completions of the same analysis to match. I guess part of that depends on whether you have more work to be assigned than Donors or you have more Donors than it takes to do all the work twice (plus).
FAH always retries faulty WUs. Some fail repeatedly (called "bad WU") and some are successfully completed by another system (assumed to be a local stabilty problem, but the software doesn't have a good way to confirm that).
Stanford has done stability testing on consumer grade GPUs and has determined they're pretty good but not 100% so they do their best to deal with their need for getting solid scientific results from whatever equipment donors provide. Other DC projects routinely reassign every WU and require multiple completions of the same analysis to match. I guess part of that depends on whether you have more work to be assigned than Donors or you have more Donors than it takes to do all the work twice (plus).
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 1005
- Joined: Thu May 02, 2013 8:46 pm
- Hardware configuration: Full Time:
2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)
Retired:
3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop - Location: Near Oxford, United Kingdom
- Contact:
Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
So I'm discoveringbruce wrote:Stanford has done stability testing on consumer grade GPUs and has determined they're pretty good but not 100%
It's a pity there isn't a way to find the stability limits of a GPU without exceeding one of them… I half-expected a failure of some sort, the room temperature was nearly 30ºC and I wasn't comfortable in there!
I was hoping any error would come up as "Bad state detected", then I could turn the clock down a bit and it would (hopefully) complete the WU, But it went for the nuclear option instead
I didn't know that. And, presumably, why the client keeps "sanity checking" the results as it goes so it can (mostly!) back up to the last known good state… it's a more efficient paradigm in my opinion.bruce wrote:Other DC projects routinely reassign every WU and require multiple completions of the same analysis to match. I guess part of that depends on whether you have more work to be assigned than Donors or you have more Donors than it takes to do all the work twice (plus).
-
- Posts: 10189
- Joined: Thu Nov 29, 2007 4:30 pm
- Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
- Location: Arizona
- Contact:
Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
Please tell us you are not stability testing on live data!
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Tell me and I forget. Teach me and I remember. Involve me and I learn.
-
- Posts: 1005
- Joined: Thu May 02, 2013 8:46 pm
- Hardware configuration: Full Time:
2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)
Retired:
3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop - Location: Near Oxford, United Kingdom
- Contact:
Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
If I were testing I'd be running a lot more over-clock to find out where it broke at normal room temperatures… now I know to keep an eye on the room thermometer (and the weather forecast!) and back off a bit at ambients above about 26ºC.7im wrote:Please tell us you are not stability testing on live data!
I don't cheerfully junk WUs, but I don't see the point in unnecessarily wide safety margins either.
Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
It's currently 95F / 35C in this room and I'm not comfortable either, but my GPUs are happy. Tomorrow is supposed to be cooler.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 2948
- Joined: Sun Dec 02, 2007 4:36 am
- Hardware configuration: Machine #1:
Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).
Machine #2:
Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.
Machine 3:
Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32
I am currently folding just on the 5x GTX 460's for aprox. 70K PPD - Location: Salem. OR USA
Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
I have 4 computers with 6 folding GPU's in a 150 sq ft room. They use around 2,200 continuous Watts as measured by a kill-a-watt. The temp is always between 20-30F higher than outside temps. The past couple of days, outside has peaked in the low 90's F. It is extremely uncomfortable inside. I still do not have any problems keeping the GPU's less than 80C which is my temp goal.
-
- Posts: 1005
- Joined: Thu May 02, 2013 8:46 pm
- Hardware configuration: Full Time:
2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)
Retired:
3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop - Location: Near Oxford, United Kingdom
- Contact:
Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
bruce wrote:It's currently 95F / 35C in this room and I'm not comfortable either, but my GPUs are happy.
I'd guess that a) those conditions aren't particularly unusual where you live and b) you have other priorities than pandering to your GPUs (like earning a living!), so they're set up to cope with anything but the most extreme conditions they're likely to encounter.P5-133XL wrote:The past couple of days, outside has peaked in the low 90's F. It is extremely uncomfortable inside. I still do not have any problems keeping the GPU's less than 80C which is my temp goal.
For a), where I live they're not too common, though not rare, and b) I'm retired, thus able (and prepared) to run them closer to the limit under normal conditions and adjust when non-normal conditions occur (or look likely).
It helps that I've only got one GPU that needs such attention, and possibly the comparative costs of such kit between the US and UK provides an incentive to squeeze the most out of them
Everyone has their own way to run their own kit depending on their own particular circumstances… I think we're wandering off topic.
-
- Site Moderator
- Posts: 7020
- Joined: Wed Dec 23, 2009 9:33 am
- Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB
Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400 - Location: Land Of The Long White Cloud
- Contact:
Re: FAULTY project:9101 run:136 clone:1 gen:31 (maybe)
AFAIK, only a single file is uploaded upon WU completion/failure. IIRC, it would be WUresults.dat (or something similar).billford wrote:I noticed that an assortment of files were uploaded, including a log file, so I assumed an error report would be in there somewhere so the WU could be re-issued...
The log file that is displayed in Advanced Control (AKA FAHControl) is only uploaded (with your permission) when you submit a bug report from the Web Control.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues