Page 1 of 1

Server Assignment Philosophy Improvement Suggestions

Posted: Sun May 28, 2017 6:07 pm
by ifolder
I put this as a separate topic to be sure developpers see it...
Aurum wrote:
ifolder wrote:
ifolder wrote:Hi,

FAHClient connects to 171.67.108.45 that keeps on assigning the same work server 171.67.108.105 which doesn't assign any work unit.

Shouldn't 171.67.108.45 detect that same client or IP is asking again for WS and assign another work server instead of dumbly assigning the same one again and again and again??
An even better option would be to implement in the next FAHClient version the fact that when the WS did not assign a WU, FAHClient gives this information, alongside with the IP of the faulty WS, on its next WS request so that a different WS should be assigned. And several requests like this should raise an alarm about the faulty WS.
We recommended that about 6 months ago along with other things and still no 7.4.17.
There is also a dumb thing with Collection Sever. The CS seems to be assigned at the same time with WS at the beginning of the WU folding. And if the CS is down by the time the WU is finished (besides the WS being down like today) FAHClient will just endlessly loop on the assigned CS.

So besides having FAHClient telling the AS that a WS assignment failed with the server having IP x.x.x.x (so that the AS immediately assigns another WS, as suggested above), FAHClient should request the CS from the AS when the WU is finished (if the WS is down). And of course if the assigned CS is down FAHCLient should also tell the AS that CS with IP x.x.x.x is down so that the AS assigns another CS.

And, finally as I suggested in another topic, Credit computation shouldn't take into account the Upload time because there is no reason folders who invested a lot of money in high-end GPUs be penalized by server-side problems or lack of performance.

Re: Server Assignment Philosophy Improvement Suggestions

Posted: Sun May 28, 2017 6:35 pm
by bruce
ifolder wrote:And, finally as I suggested in another topic, Credit computation shouldn't take into account the Upload time because there is no reason folders who invested a lot of money in high-end GPUs be penalized by server-side problems or lack of performance.
The WU is not finished if you haven't returned it you haven't returned it yet. Science cannot proceed until the results have actually been received by the server. Some connection problems are on the user-side and others are on the server-side and others are the responsibility of the ISPs involved in the connection.

I've been around FAH long enough to remember when we used dial-up modems and if you were talking on your phone, the computer could not get an internet connection ... and when it did, it might be running at 24kb or it might have been a "fast" one at, say, 56kb.

Yes, files were smaller then, but even today, you can spend more money on a faster GPU but does anybody consider upgrading their internet connection to improve their PPD.

Re: Server Assignment Philosophy Improvement Suggestions

Posted: Sun May 28, 2017 7:37 pm
by Nathan_P
bruce wrote:
Yes, files were smaller then, but even today, you can spend more money on a faster GPU but does anybody consider upgrading their internet connection to improve their PPD.
Its not as relevant now but back in the days of bigadv I did look at what ppd gain I could get from upgrading my net connection, it took 40 minutes to upload an 8103 WU - upgrading to one that halved the upload time was worth 5k PPD per system, I don't think that the current projects generate a big enough results file to warrant the extra cost. Unfortunately the end of BA WU meant I never got to prove my calculations and Its now doubly irrelevant as our telco is upgrading everyone to fibre at a minimum of 50/10 mbits

Re: Server Assignment Philosophy Improvement Suggestions

Posted: Sun May 28, 2017 8:42 pm
by foldy
There may still be a limit in bandwidth on the Stanford server side. If many users - say 100 - try to upload at the same time at fast 10 mbits speed then the server upload bandwidth needs to be 100 x 10 mbits = 1000 mbits = 1 gigabits. But if the upload bandwidth of Stanford is only 100 mbits then each user upload takes 10 times longer and the user's 10 mbits internet connection does not help because it is limited to 1 mbits because of server parallel upload bandwidth limit.

If an average work unit upload package has 25 MB = 200 mbits then it takes 20 seconds on a 10 mbits user upload internet connection.
But if the server parallel upload bandwidth limit is reached then it may take 5 min or more even with a 10 mbits user upload internet connection.

I searched my FAH log files for upload speeds and found these
It took 1:15 Uploading 4.68MiB to 171.64.65.84
It took 1:22 Uploading 5.04MiB to 171.64.65.84
It took 3:25 Uploading 13.09MiB to 171.67.108.105
It took 3:25 Uploading 13.79MiB to 171.67.108.159

When I calculate the upload mbits per second this is 0.5 mbits/s in all 4 cases while my ISP supports up to 10 mbits/s upload.

Re: Server Assignment Philosophy Improvement Suggestions

Posted: Mon May 29, 2017 1:21 am
by bruce
I'm responsible for my machines up to my router as well as whatever I buy from my ISP (and in my case, that includes lots of advertisements directed at my browsers). At my ISP level, FAH has to share bandwidth with their other customers. At the backbone level, my ISP has to share bandwidth with other ISPs. On Stanford's ISP, I doubt there's enough congestion to matter (Stanford is one of DARPAnet original founders so they certainly have a good connection at that level and probably within the campus, too. At the (virtual?) server level, the campus networking folks are responsible for any kind of load-balancing as well as for any reboots. Within a single (virtual?) server, it becomes the responsibility of the PI.

While any one server can have problems, it's not too likely that an individual server's connection will be saturated You can always look at traceroute (tracert) and draw your own conclusions to determine where there any congestion might be. A Collection Server is almost always provided giving redundancy for those cases where there is something wrong with uploads on the primary Work Server.

The first several items may change as "net neutrality" evolves, but let's not start a political discussion here. (There are plenty of other places where you can air your opinions on that subject.)

Re: Server Assignment Philosophy Improvement Suggestions

Posted: Mon May 29, 2017 2:12 pm
by foldy
I just wanted to point out if your ISP upload bandwidth is below 0.5 mbits/s then upgrading will speed up the FAH uploads else it will not, this may change in future.

Re: Server Assignment Philosophy Improvement Suggestions

Posted: Thu Jun 01, 2017 3:09 pm
by ifolder
bruce wrote: A Collection Server is almost always provided giving redundancy for those cases where there is something wrong with uploads on the primary Work Server.
Yes but this last week besides WS massive failures Collection Servers failures happened a lot like you can see below.

Thus my suggestion for the AS to dynamically assign a Collection Server once the WU is finished and to take into account feedback provided by FAHClient to AS ("Failed connection to WS x.x.x.x / Failed connection to CS x.x.x.x, please assign me ANOTHER one").

Code: Select all

11:42:11:WU02:FS00:Uploading 21.88MiB to 140.163.4.245
11:42:11:WARNING:WU02:FS00:WorkServer connection failed on port 8080 trying 80
11:42:11:WARNING:WU02:FS00:Exception: Failed to send results to work server: Failed to create socket

11:42:11:WU02:FS00:Trying to send results to collection server

11:42:11:WU02:FS00:Uploading 21.88MiB to 128.252.203.2
11:42:11:WARNING:WU02:FS00:WorkServer connection failed on port 8080 trying 80
11:42:11:ERROR:WU02:FS00:Exception: Failed to create socket
11:42:11:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:10496 run:7 clone:75 gen:66 core:0x21 unit:0x000000578ca304f558895887e566aaf4
11:42:11:WARNING:WU02:FS00:Exception: Failed to send results to work server: Failed to open './work/02/wuresults_01.dat': Failed to open './work/02/wuresults_01.dat': Too many open files: Too many open files

11:42:11:WU02:FS00:Trying to send results to collection server

11:42:11:ERROR:WU02:FS00:Exception: Failed to open './work/02/wuresults_01.dat': Failed to open './work/02/wuresults_01.dat': Too many open files: Too many open files
11:42:11:WARNING:WU03:FS00:Exception: Could not get IP address for assign-GPU.stanford.edu: System error
11:42:11:ERROR:WU03:FS00:Exception: Could not get an assignment
11:42:11:WARNING:WU03:FS00:Exception: Could not get IP address for assign-GPU.stanford.edu: System error

and so on for 3 hours until I killed FAHClient...

Re: Server Assignment Philosophy Improvement Suggestions

Posted: Thu Jun 01, 2017 3:25 pm
by bruce
I understand your suggestion, but the example you posted doesn't fit with the suggestion. Enabling another CS would not solve that particular problem.
1) Failed to create socket indicates that something inside your own computer's operating system is blocking it's ability to connect to the internet. Note that it simply does not say that CS x.x.x.x is refusing the connection because the process never got far enough to communicate with x.x.x.x..
2) Failed to open './work/02/wuresults_01.dat' indicates that the results of the analysis are not on your disk so even with a good path to a server, there's noting left to upload.

Re: Server Assignment Philosophy Improvement Suggestions

Posted: Fri Jun 02, 2017 7:54 am
by ifolder
OK, as I don't have the sources of FAHCLient I don't know what error messages exactly stand for and since a socket is a quadruple (src ip, src port, dest ip, dest port) failing to create a socket could also have been because of the dest part...

But if it's not a CS problem, could it be a FAHClient bug?
Because after successfully processing and sending WUs, FAHCLient suddently can't connect to the WS, can't connect to the CS (maybe because of Linux) and also apparently loses track of the work results? And this happened several times this week...

Re: Server Assignment Philosophy Improvement Suggestions

Posted: Fri Jun 02, 2017 9:16 am
by foldy
You did not use any firewall rules or security software that would block outgoing connection creation?

"Too many open files" and "Failed create socket" both looks like FahClient is out of handles?

If you are on linux what does "ulimit -n" show? I guess 1024 and you could increase that to "ulimit -n 16384" as workaround.

Still looks like a FahClient bug why it needs so many handles and does not close them again.

Re: Server Assignment Philosophy Improvement Suggestions

Posted: Sat Jun 03, 2017 12:34 pm
by ifolder
No I have no firewall blocking outgoing connection creation. Everything worked well and suddenly I got this problem.

ulimit -n shows indeed 1024

Might be indeed that FAHClient doesn't close the handles when the WS doesn't assign a WU. And since last week there were lots of non-assignement of WU due to the big server problems, the 1024 available handles got all consumed.

Quitting FAHClient and restarting it solved the problem (maybe because the handles were freed by quitting).