New Assignment Server feedback/problem

Moderators: Site Moderators, FAHC Science Team

Breach
Posts: 205
Joined: Sat Mar 09, 2013 8:07 pm
Location: Brussels, Belgium

Re: New Assignment Server feedback/problem

Post by Breach »

As the cores themselves didn't change overnight, I am guessing that the new AS server is now giving out projects/WUs to Maxwell cards which it shouldn't. Your GPUs are not trying to complete anything - the process has hung. You either get a bad unit failure or a core hang.
Windows 11 x64 / 5800X@5Ghz / 32GB DDR4 3800 CL14 / 4090 FE / Creative Titanium HD / Sennheiser 650 / PSU Corsair AX1200i
kookykrazee
Posts: 47
Joined: Sun May 16, 2010 1:44 am

Re: New Assignment Server feedback/problem

Post by kookykrazee »

I have had the 2nd pair fail/not complete, and been assigned 2 more, unfortunately. Any way to get these to stop showing up until this is resolved (probably Monday at this point, right?)
widsss
Posts: 16
Joined: Sun Oct 28, 2012 3:00 pm

Re: New Assignment Server feedback/problem

Post by widsss »

I've had nothing but Bad Work Units for a few hours from projects 9406, 13000 and 13001.
Kjetil
Posts: 178
Joined: Sat Apr 14, 2012 5:56 pm
Location: Stavanger Norway

Re: New Assignment Server feedback/problem

Post by Kjetil »

Yes you are not the only one. 6 maxwell doing nada. 3x 750Ti and 3x 980.
PS3EdOlkkola
Posts: 184
Joined: Tue Aug 26, 2014 9:48 pm
Hardware configuration: 10 SMP folding slots on Intel Phi "Knights Landing" system, configured as 24 CPUs/slot
9 AMD GPU folding slots
31 Nvidia GPU folding slots
50 total folding slots
Average PPD/slot = 459,500
Location: Dallas, TX

Re: New Assignment Server feedback/problem

Post by PS3EdOlkkola »

Similar issues here: Got a series of 9406 WU's on a new 980 Maxwell GPU that immediately failed. I've disabled the slot until a fix is apparent. The same 980 has completed about two dozen Core 15 work units prior to the slot failure on Core 17 WUs. Same failure mode every time:

22:46:33:WU02:FS01:0x17:ERROR:exception: Force RMSE error of 617.919 with threshold of 5
22:46:33:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)

980 GPU is under-clocked by 5% from stock speeds.
Image
Hardware config viewtopic.php?f=66&t=17997&p=277235#p277235
kookykrazee
Posts: 47
Joined: Sun May 16, 2010 1:44 am

Re: New Assignment Server feedback/problem

Post by kookykrazee »

Curious, why are you underclocked?
PS3EdOlkkola
Posts: 184
Joined: Tue Aug 26, 2014 9:48 pm
Hardware configuration: 10 SMP folding slots on Intel Phi "Knights Landing" system, configured as 24 CPUs/slot
9 AMD GPU folding slots
31 Nvidia GPU folding slots
50 total folding slots
Average PPD/slot = 459,500
Location: Dallas, TX

Re: New Assignment Server feedback/problem

Post by PS3EdOlkkola »

I under-clock all my GPUs.

On AMD-based GPU's they run more stable by under-clocking and reliably processing and then uploading WUs to the collection server. The under-clock also improves their longevity by reducing heat. I pack a lot of AMD GPUs (6 GPUs in a 4u highly modified server case, two systems configured this way), so managing heat levels is an important issue with that kind of density. They all run under 80 deg C in those enclosures when slightly under-clocked. Also, even though the power supply is spec'd at 1500 watts, the under-clock keeps consumed wattage at 1,200 which also helps preserve the life of the PS.

For Nvidia, the primary reason is for heat and longevity, since they tend to process and upload more reliably than the AMD GPUs (at least for me)

I have a planned life-cycle of 3 years for each GPU, at which point they are retired. Everyone has different priorities, but for me that balances a decent life-cycle investment with the time and energy it takes to maintain older hardware. After 3 years, it's time to give the GPU away or trash it and move on to newer hardware. Under-clocking virtually guarantees they make the 3 year time horizon for replacement.

That policy is also in effect for motherboards. I just decommissioned 2 AMD FX8350-based motherboards. A PCIe slot failed in one, and both could not support PCIe 3.0 spec that's needed for optimal performance (on a PPD basis) for highest-end Nvidia and AMD GPUs (both MBs replaced with i7-4790K based MBs). Next weekend, I'm decommissioning a 3rd AMD FX8350 motherboard and replacing it with an i7-5960X-based motherboard, retiring two AMD 7970's and replacing them with two Maxwell GTX 980's. Using the EVGA Classified X99 MB (socket 2011-v3) will be able to add a third 980 in that configuration at a later date.

I've wasted way too many hours trying to keep older hardware running reliably. Admittedly, it's a challenge to see if you can extract the last bit of life from a piece of ancient hardware, and even more interesting is how much I've learned about troubleshooting hardware issues, but I've only got so much time I can spend on my "FAH hobby" that I have to optimize around known, good and contemporary hardware systems.
Image
Hardware config viewtopic.php?f=66&t=17997&p=277235#p277235
kookykrazee
Posts: 47
Joined: Sun May 16, 2010 1:44 am

Re: New Assignment Server feedback/problem

Post by kookykrazee »

That makes sense, I was curious. I still wish the 9406s would stop coming...until they are fixed or such...
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: New Assignment Server feedback/problem

Post by VijayPande »

We've been asked by donors to give Maxwell's Core17 & Core18 and we've done so with the adv setting (please see the previous blog post). You can opt out by removing the adv setting.

Sounds like Maxwell's even with the latest drivers aren't ready and/or we need to see what we can do on the core side.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
kookykrazee
Posts: 47
Joined: Sun May 16, 2010 1:44 am

Re: New Assignment Server feedback/problem

Post by kookykrazee »

Will it help at all to downgrade the drivers?
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: New Assignment Server feedback/problem

Post by VijayPande »

We had reports that the newest drivers worked (and were looking good in our testing as well). We're going to
1) revert back to not giving Maxwell's the latest cores
2) Yutong (aka Proteineer) has a plan for upgrading the cores to work around the driver issue and will get on that on Monday if not sooner.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
kookykrazee
Posts: 47
Joined: Sun May 16, 2010 1:44 am

Re: New Assignment Server feedback/problem

Post by kookykrazee »

Sounds great, thanks for the update.
Mstenholm
Posts: 84
Joined: Fri Oct 22, 2010 10:17 pm
Hardware configuration: 4 x GTX 970. Win 7.

Re: New Assignment Server feedback/problem

Post by Mstenholm »

I had ten 9406 fail on my brand new 970. All with Force RMSE error. No OC, 344.16 driver. Last client 7.4.4. I removed FAH and did a fresh install but the first 9406 failed as well.
Breach
Posts: 205
Joined: Sat Mar 09, 2013 8:07 pm
Location: Brussels, Belgium

Re: New Assignment Server feedback/problem

Post by Breach »

@VijayPande, thanks, waiting for the revert. My experience: at the moment fah, advanced and beta all get 0x17 WUs which all fail and from what I have seen last week some, but not all 0x18 WUs insta-fail too.

What I don't understand is how a few months ago Maxwell was apparently folding Core 17 WUs and now it's not:
viewtopic.php?f=80&t=25887&start=120
Windows 11 x64 / 5800X@5Ghz / 32GB DDR4 3800 CL14 / 4090 FE / Creative Titanium HD / Sennheiser 650 / PSU Corsair AX1200i
JimF
Posts: 652
Joined: Thu Jan 21, 2010 2:03 pm

Re: New Assignment Server feedback/problem

Post by JimF »

I don't understand the need for the "advanced settings". I was getting Core 17 fine on all six of my GTX 750 Ti's (without the advanced tag) until a couple of days ago, when we were asked to use it. So I did, and have been getting only failures ever since.

My latest log is in this post.
viewtopic.php?f=80&t=25887&start=135
Post Reply