New Assignment Server feedback/problem

Moderators: Site Moderators, FAHC Science Team

bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: New Assignment Server feedback/problem

Post by bruce »

heikosch wrote:When the new AS was activated (just over night for me!) my GTX 750Ti began to throw errors with P1300x. Core 0x17 Version didn´t change and I didn´t change nVidia driver nor installed other Software or updates.
Maybe they changed not only the AS but independently the content of the P1300x WUs. Shortly after that they stopped to assign P1300x to Maxwell GPUs.
I'd question whether all of that is true or not. A given project cannot be changed without disrupting the validity of the science. If drivers and the FahCore version did not change, then either the percentage of WUs failing was higher than you think and you were lucky to not see the failures or maybe you're thinking of some other project that was assigned to use Core_18.

As has already been said, the Assignment Server is responsible for assigning your client to a specific Work Server that can supply your work for your system. During the first exposure of the Assignment Server to the world, some projects and/or cores were assigned to systems that could not process them. Those bugs seem to have eradicated so now Core_17 and Core_18 and Core_b0 are only assigned to systems that can process them, but either your system could process a WU before and (if it's still assigned) it still can process it or your system had problems with WUs from specific projects before and now they're not being assigned.
JimF
Posts: 652
Joined: Thu Jan 21, 2010 2:03 pm

Re: New Assignment Server feedback/problem

Post by JimF »

It certainly happened to me at the same time. I was surprised that PG did not catch it themselves then, and even more surprised that it is questioned now. It may be a coincidence, but that is another question.
viewtopic.php?f=18&t=26807&p=269497&hilit=GTX#p269497
viewtopic.php?f=80&t=25887&start=135
heikosch
Posts: 110
Joined: Thu Apr 30, 2009 7:31 pm
Hardware configuration: i7-3930K@4.1GHz
GTX680@1.275GHz

Q9300@2.4GHz
GTX460@800MHz
Location: Essen, Germany

Re: New Assignment Server feedback/problem

Post by heikosch »

OK, I looked it up in my log files. I refer to the "Force RMSE error" that occured in a lot of different projects, not only P1300x.
First occurence in my logs at2014-10-02T23:38:09Z which is in fact after the new AS was activated. But Prof. Pande announced "Upgraded Maxwell support for Core17" that day. Maybe that changes for Maxwell support and assignment configuration errors mixed up.

Heiko
Image Image
Image
Breach
Posts: 205
Joined: Sat Mar 09, 2013 8:07 pm
Location: Brussels, Belgium

Re: New Assignment Server feedback/problem

Post by Breach »

heikosch wrote:OK, I looked it up in my log files. I refer to the "Force RMSE error" that occured in a lot of different projects, not only P1300x.
First occurence in my logs at2014-10-02T23:38:09Z which is in fact after the new AS was activated. But Prof. Pande announced "Upgraded Maxwell support for Core17" that day. Maybe that changes for Maxwell support and assignment configuration errors mixed up.

Heiko
AFAIK, Maxwell is currently still "partially supported". In the sense that both current core 17 versions 0.0.52 and 0.0.55 would work just fine when folding WUs from some projects, e.g. those from P9201/9202/7814 (non-exhaustive list). With other projects (e.g. 1300x) the core would apparently do a call for a function which is "buggy" (I guess on the level of nvidia drivers/hardware?) hence the RMSE problems. Although this was reported as a problem months ago, it only became more manifest when the new AS came on-line as it was initially giving out way more "problematic" projects to Maxwells than before. This issue has been discussed at length already and the "solution" was indeed not to assign WUs from known problematic projects to Maxwells. This AFAIK is still the case today so you shouldn't be seeing these errors now (and if you do, please report it). This will hopefully change soon because all "newer" core 17 projects are apparently utilising the said problematic function and as much I have learned to enjoy my fan at 100% when folding Core 15 WUs it's getting a tad old and 9201 will end at some point. I guess we're already running low on 9201 WUs, and no, no idea why we're not getting WUs from P9202 and P7814 - either the projects are over or they have been rightly/mistakenly blacklisted for Maxwells(?). By the way this problem has been worked-around in core 18 on beta at a (huge) PPD expense - hopefully a more permanent solution will be found soon.
Windows 11 x64 / 5800X@5Ghz / 32GB DDR4 3800 CL14 / 4090 FE / Creative Titanium HD / Sennheiser 650 / PSU Corsair AX1200i
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: New Assignment Server feedback/problem

Post by bruce »

JimF wrote:It certainly happened to me at the same time. I was surprised that PG did not catch it themselves then, and even more surprised that it is questioned now. It may be a coincidence, but that is another question.
viewtopic.php?f=18&t=26807&p=269497&hilit=GTX#p269497
viewtopic.php?f=80&t=25887&start=135
I think what happened was that when the AS code wasn't working right, everything got an increased degree of scrutiny. The "Force RMSE error" problem was already a kiw frequency problem, but once they recognized it, they set out to fix it and by not assigning those projects to certain GPUs, the success rate went up.

I'm certain that some developers, somewhere, are examining the trade-offs and in time, there will be a permanent fix for the problem.
JimF
Posts: 652
Joined: Thu Jan 21, 2010 2:03 pm

Re: New Assignment Server feedback/problem

Post by JimF »

Great, that is all I, and probably a lot of others need to know.
billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: New Assignment Server feedback/problem

Post by billford »

bruce wrote:The "Force RMSE error" problem was already a kiw frequency problem, but once they recognized it, they set out to fix it and by not assigning those projects to certain GPUs, the success rate went up.
That certainly makes sense, but I wonder- did the overall rate of completed WUs (ie the amount of science getting done in a given time) also go up? It's quite possible to get a higher overall return (not quite the same thing as success rate) by accepting a certain failure rate in some areas, especially when the "unreliable" platform is much faster than the others.

In the matter under discussion this will certainly be true for Linux clients when P9201 goes EOL, and may even still be true when the slower Core_18 comes online.

Certainly some donors won't like it, but if the "marginal" WUs were set to advanced for that platform (or even beta, with the acceptance that no support would be given) they would have the choice, and setting the advanced flag implies acceptance of a certain degree of risk anyway.

(And setting the beta flag virtually guarantees it!)

It may well be that PG went through these calculations and the results pointed to the same decision, I'm just curious. And also fairly certain that my curiosity won't be satisfied :wink:
bruce wrote:I'm certain that some developers, somewhere, are examining the trade-offs and in time, there will be a permanent fix for the problem.
The operative words there are "in time" :(
Image
Kjetil
Posts: 178
Joined: Sat Apr 14, 2012 5:56 pm
Location: Stavanger Norway

Re: New Assignment Server feedback/problem

Post by Kjetil »

My 5 970 and 6 980 is off, sorry.
heikosch
Posts: 110
Joined: Thu Apr 30, 2009 7:31 pm
Hardware configuration: i7-3930K@4.1GHz
GTX680@1.275GHz

Q9300@2.4GHz
GTX460@800MHz
Location: Essen, Germany

Re: New Assignment Server feedback/problem

Post by heikosch »

Breach wrote:According to that: 171.67.108.52 is 'full' (in full operation, should be giving out WUs), but is then marked as 'Blue' ("Blue - if the AS has decided not to assign to that machine, eg. the AS thinks it is down or out of jobs (blue means iced)". The WU stats for this WS are null - guess it's either out of work or there's another reason the AS considers it not available...
In the last days I looked at the server status of 171.67.108.52 and my computer got WUs from 171.67.108.52 independently of the blue color in the "% Ass" column and the numbers in the 3 columns in front.

=> I´ve no idea how to interpret the server status correctly.

Maybe the server status changes so often that the log is always outdated.

Heiko
Image Image
Image
Post Reply