Failed to connect to 171.64.65.99

Moderators: Site Moderators, FAHC Science Team

bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Failed to connect to 171.64.65.99

Post by bruce »

Setting the number of CPUs to 8 (or anything else except -1) overrides the automatic selection process. It should still run with, say 3 GPUs but it will be inefficient since the system cannot provide the required resources to process 11 concurrent processes of (almost) 11*100%.

Forgetting the GPUs for a moment, you could configure two CPU slots, each using 8 CPUs, but all 16 tasks will be competing for the same 8 CPUs, causing each process to run less than 50% of the time. FAH thrives on CPUs that are available close to 100% of the time. It does slow down whenever something else needs the CPU. Browsing the web (for example) when all CPUs are busy with FAH will interrupt a task or two, but it quickly relinquishes the CPU so the total amount of time lost from processing FahCore_* won't add up to enough to matter.

The code associated with the CPU slot factors the number of CPUs configured into the x, y, and z directions.

14 is not prime, but 7x2x1 contains the factor 7 which means that in one direction, the analysis with 14 is as unreliable as using 7. Nine is not prime, but 3x3x1 contains no factor greater than 3 so it's much more reliable than 7.
RABishop
Posts: 73
Joined: Thu May 07, 2015 2:42 am

Re: Failed to connect to 171.64.65.99

Post by RABishop »

Well, I just checked all 4 systems a while ago, and they're all back up and working. Two that ought to have 5 threads on their cpus only have 4. Two others that ought to have 9 threads are split between 9 and 8 threads. I don't get it. But since I live in the So. Cal. desert, I'm only running 4 right now. When things cool off, I'll add the other two systems back in. I just can't run 6 systems at this time, with all the heat outside, and all the heat inside, my A/C is active nearly all day, and a good portion of the night too. All seems well for the moment.
davidcoton
Posts: 1102
Joined: Wed Nov 05, 2008 3:19 pm
Location: Cambridge, UK

Re: Failed to connect to 171.64.65.99

Post by davidcoton »

CPU:5 used to work, but there has been a shortage of suitable WUs. CPU:3 (or multiples thereof) has not been reported as a problem. So at the moment the CPU count must have factors of 2 and 3 only (unless someone knows otherwise...). I don't understand how you got systems running at a lower count than your setting -- I didn't think that was possible. Maybe a client feature activated by new server code. If that is what's happened, it would be nice to have an announcement so we could watch for any issues. As it is my rig usually running CPU:% is now set for CPU:4.
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Failed to connect to 171.64.65.99

Post by bruce »

davidcoton wrote:I don't understand how you got systems running at a lower count than your setting -- I didn't think that was possible.
If you ask for an assignment with, say 7 CPUs, and all active projects are prohibited from being assigned to CPU:7, there have been discussions of assigning one for CPU:6. To the best of my knowledge, nothing like that has been done to the servers since the server can't change your CPU setting so you'd be testing running one that's available for CPU:6 but you'd still be running it at CPU:7 -- a bad proposition. Whether or not something like that is a good or bad idea and whether it's ever implemented, it would take changes to both the Servers and the Client for the Server to be able to adjust your setting for you.

Code is already built into the client so that if you ask for, say CPU:13, The FAHCore will override that and it will be adjusted to CPU:12. That rule applies to ALL projects, even if as an unlikely case, somebody devised a project that could be successfully completed with 13.

CPU:7 seemed to be a borderline case. Some projects are almost always successful while others almost always fail. The PI can decide to include or exclude assignments to those machines running with CPU:7 and no local adjustments are made.

This is the first case that I've heard of where CPU:5 is a problem.
Post Reply