SMP RAM ?

Moderators: Site Moderators, PandeGroup

Re: SMP RAM ?

Postby bruce » Fri Mar 16, 2012 8:35 pm

prjindigo wrote:I think that means that most the SMP core loads are running internal to the chip? 30k a day on SMP alone at stock speed is impressive.


Possibly.

Obviously this will depend on both the size of the CPU cache compared to the working set of the WU assigned. The working set will vary depending mostly on the number of atoms and to a lesser degree on the number threads you have SMP configured to use so it can easily depend on which project you're assigned. If the project resides entirely in cache, it will run faster than if it has to go to main RAM. Every time the GPU task interrupts, it will flush the cache and replace it with the newly dispatched program.

Since SMP runs at the speed of the slowest thread, any kind of thread imbalance has a more profound effect than just the time spent servicing the GPU, and the change in QRBonus can make that look pretty big.

Is SMP configured to use all of your CPU threads, or have you reduced that number to leave HT resources for the GPU? Sometimes SMP runs better with fewer threads when there's a GPU, within the constraints of the "large prime" considerations, and that has nothing to do with cache.
bruce
 
Posts: 21670
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: SMP RAM ?

Postby lightminer » Thu Mar 22, 2012 8:23 pm

We had another thread about configuring for 3930k a while ago when I was getting started.

viewtopic.php?f=38&t=20034

You can set the GPU to 'low' (which actually is 'high' relative to the SMP) - that was the advice back then, so it is able to interrupt when it needs to, and I was told it doesn't do it very often. Running this way I'm not seeing a big difference between GPU on or off in terms of SMP PPD.


A comment on the discussion above - it is very possible if you have two good processors that running 2 SMP units with 'lock cores' will be more efficient than running one big SMP. SMP gets much, much slower when you go from 2 to 4, 4 to 8, 8 to 16, etc. cores. I do this stuff at work... So, if I had more time I would do an 'SMP Scaling Performance Profile' which any of you are free to do, it is easy, just takes time.

Run a single project on 1 thread, note the tpf, then 2, then 4, then 8, then max for a single processor. In fact, to make 100% sure you are not using QPI or anything, remove your second processor if you have one.

Then add it back in, and continue up the ladder, which might be only one more step since we typically don't do this graph by 4s or soemthing, we do it by doubling.

What you will often see is not a straight line, but instead one that starts off at 45 degrees (meaning that most processes are around twice as fast at 1 to 2 cores, and I've seen over-unity conditions quite a few times), and then it levels off. This is an area where 15 years ago Solaris was great and Windows stunk so bad you can't even imagine it, but now the gap is closed. Windows, 15 years ago, would go to a 10 degree line between 4 and 8 processors, meaning if you payed the extra $30k for a 'massive' server that had 8 processors over one with 4, you only went a tiny bit faster, lets say, 20%.

These days its much better and with QPI and all that it is great. But remember that when you extend a single SMP process over QPI you take a big performance hit, and it almost always shows up in the graph.

So, what you want to see is what is called 'linear scaling' and a 'true' SMP program (where the threads need the info from the other threads between fork-join steps) will not be anywhere near linear and an embarrassingly parallel program (like Photoshop on a GPU where each pixel needs to be altered, but often doesn't care about the pixels around it - sometimes it does, but often it doesn't, or if it does it cares about the ones assigned to that same core) will be linear even with a crappy motherboard implementation of communication between sockets.


So, if we really want to understand how this thing scales, you need to do a SMP Scaling Profile and see how close to linear scaling (a jump of 2X performance for each doubling of cores) *and* to check for a single inflection point for QPI efficiency for this program - (which is really only measurable if you first remove a processor, because unless it was programmed to be socket-aware a SMP-4 for example could dance all around the place, or even if you lock it manually to cores, you have to be 100% sure where those cores are, it could be 3 on one socket and 1 on another).

Anyway if someone was really interested, I would love to see the graph. Good ones are at 45 degrees and bad ones start at 45 degrees and flatten to almost completely horizontal at the last jump in cores. Even if that is the case, however, don't despair, because you would simple want to find the inflection point on the graph and then run the nearest factor of SMP slots, for example 2 slots or maybe 3. I would guess it would be enough to go to 2.

I've mentioned this before it was mentioned that you would hurt your time return bonus thingy - well, it depends on the flatness of the curve. People double core-count when they need a single response back quicker, and getting it back 20% quicker for a doubling of the cores is often worth it. But when you have a constant process of fulfilling a never-ending bag of problems to solve, basically you are in a steady state processing environment, you don't want to take that hit. You will get much more done by running 2 slots if the line gets close to horizontal. PPD may say otherwise, but in terms of real work, you will be way ahead of the game.

I tried to look into the PPD calcs to see if it was linear - because if time return bonus is linear, then running multiple SMP slots would make more sense, however, there is a square root in it. I tried to model the formula in excel, but they made it really tricky to figure out, and there is a 'special k factor' that you have to back calculate from your runs. So, I'll just say this, if PPD is linearly getting bonus points to people for quick returns then what I'm saying is definitely the way to go, unless the graph is 45 degrees the whole way, or even 35 or above (we are looking for large swings in inefficiency). If PPD is not linear in getting bonus points to people then you'd have to model how flat the SMP scaling curve is compared to that non-linearity in Excel. And remember, that is for maximizing points, in terms of helping the FAH project, much more work would get done by doing 2 slots if the curve goes pretty flat regardless of PPD calculations.
lightminer
 
Posts: 42
Joined: Thu Dec 29, 2011 11:05 pm

Re: SMP RAM ?

Postby 7im » Thu Mar 22, 2012 9:00 pm

This has all been done before. We understand it very well. Forum Search could be your friend, to. ;)

The line is close to 45 degrees, and linear upwards to 32 cores, and close to that up to 64 cores.

And since the bonus points are exponential with time, running double the number of clients with half the core count is NEVER more efficient. And while you assume correctly this is (close to) a steady state processing environment, you seem to forget work units are time senstive, and serial in nature. It's not one big giant gumball machine, with a random selection of balls going to your client. WU (ball) priority is set by the project, and one generation of completed WUs is needed before the next generation can be sent out. Each generation building upon the previous one.
Last edited by 7im on Thu Mar 22, 2012 9:16 pm, edited 2 times in total.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
User avatar
7im
 
Posts: 14648
Joined: Thu Nov 29, 2007 4:30 pm
Location: Arizona

Re: SMP RAM ?

Postby lightminer » Thu Mar 22, 2012 9:06 pm

FWIW, I just turned GPU off. Previous TPF on SMP-12 (3930k) was around 22.3 seconds, and now it is at 20.5 seconds. I am using 'low' on GPU as indicated above. Turning it back on now...
lightminer
 
Posts: 42
Joined: Thu Dec 29, 2011 11:05 pm

Re: SMP RAM ?

Postby lightminer » Thu Mar 22, 2012 9:18 pm

Great, thanks for the info, that is interesting/good to hear.
lightminer
 
Posts: 42
Joined: Thu Dec 29, 2011 11:05 pm

Re: SMP RAM ?

Postby bruce » Fri Mar 23, 2012 3:17 am

lightminer wrote:You can set the GPU to 'low' (which actually is 'high' relative to the SMP) - that was the advice back then, so it is able to interrupt when it needs to, and I was told it doesn't do it very often. Running this way I'm not seeing a big difference between GPU on or off in terms of SMP PPD.

Obviously you must have a NVidia GPU.

There's a big difference between AMD/ATI and NVidia. The NVidia drivers use almost no CPU time, so what you say is true. The AMD/ATI driver uses about 100% of one CPU. If you run SMP with the default setting of "use all CPUs" together with an AMD GPU, you'll be one CPU short, which will have a very serious impact on SMP performance. You need to leave at least one CPU free for the AMD GPU.

lightminer wrote:A comment on the discussion above - it is very possible if you have two good processors that running 2 SMP units with 'lock cores' will be more efficient than running one big SMP. SMP gets much, much slower when you go from 2 to 4, 4 to 8, 8 to 16, etc. cores.

If you have 8 free cores, it's significantly FASTER than using only 4 (or 2).

It's true that 8 cores will be less than twice as fast as 4 cores, and 4 cores will be less than twice as fast as 2 cores. The bonus plan takes into account the words "less than twice as fast" in those statements and changes it to "more than twice as many points"

Using all of your free cores is the optimum setting, both in terms of points and in terms of minimizing the total time you process a WU. The only thing to watch out for is the word FREE cores (see previous discussion). If your machine is idle most of the time, then allocating all cores to SMP + GPU is the best setting. If you run other "heavy" applications continuously, then SMP performance will be severely reduced by trying to run N threads on less than N independent CPUs.
bruce
 
Posts: 21670
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: SMP RAM ?

Postby lightminer » Fri Mar 23, 2012 11:02 pm

Bruce - Yes, NVIDIA.

Bruce - Just for clarity - I was never referring to 'not using all cores' just discussing different ways of possibly splitting up work that would still always use all cores in the end.


Oh, and per 7im's comments I poked around on the internet about GROMAC scaling factors, and as he suggests (not taking into account anything specific that was done by this particular project) GROMACs is pretty linear to 32, and then starts to go from 45 degrees to horizontal after that. So if people have quad-socket machines with lets say 64 cores, they would be better off running 2 slots at 32 each. That may not be reflected in the points system, and I understand that 7im and Bruce are against this, but nevertheless more work runs will get done per week than if they run a single slot.

One of the comments I've heard against something like this is 'they need the info quickly' and I've heard this before and this confuses me. If a WU gets done in 8 hrs vs 4, are you saying that impacts the group? I see SMP runs from 45 min to 2 hrs, so 1.5 hrs or 4 hrs if done at half speed, certainly that doesn't have an impact? If I had a less powerful computer they would come back in those timeframes anyway. I can't imagine 1.5 hrs vs 45 minutes matters, but am open to it if it can be explained. On the other hand, do bigadv take 3 or 4 days? So if those took 6 - 8 days, I can see how that might be less advantageous, depending of course on how the group uses the information they get back. So maybe you are just talking about bigadv. Depending on the work environment sometimes someone is literally waiting for results and sometimes they don't look at it for a week or a day or whatever. Also, I don't know what gets done with these results, they might be further computed on their servers and then those results are given to the people doing science with this, and those intermediate servers might need the quick response in order to put things back together from so many disparate sources, and maybe that is the time importance?

Lastly, on those graphs it wasn't clear to me from those particular web pages how many socket-to-socket links there were. I believe the degradation will be seen to be mostly from inter-socket communication and not from threading vis-a-vis OpenMP or more MPI, so it could be that independent of thread count, a single socket link is acceptable, but 2 links degrades, or that 2 are fine, and 4 starts to significantly degrade. This information might interest people with quad-socket systems. Also, QPI is an extremely advanced inter-socket communication system, AMD folks might see different behavior (QPI was purchased by Intel from one of the supercomputer manu's and incorporated only in the last bunch of years, I think I remember AMD was way ahead on this issue until QPI), and people with pre-QPI Intel systems will certainly see much, much less efficient inter-socket communication.
lightminer
 
Posts: 42
Joined: Thu Dec 29, 2011 11:05 pm

Re: SMP RAM ?

Postby Jesse_V » Fri Mar 23, 2012 11:17 pm

lightminer wrote:One of the comments I've heard against something like this is 'they need the info quickly' and I've heard this before and this confuses me. If a WU gets done in 8 hrs vs 4, are you saying that impacts the group? I see SMP runs from 45 min to 2 hrs, so 1.5 hrs or 4 hrs if done at half speed, certainly that doesn't have an impact? If I had a less powerful computer they would come back in those timeframes anyway. I can't imagine 1.5 hrs vs 45 minutes matters, but am open to it if it can be explained. On the other hand, do bigadv take 3 or 4 days? So if those took 6 - 8 days, I can see how that might be less advantageous, depending of course on how the group uses the information they get back. So maybe you are just talking about bigadv. Depending on the work environment sometimes someone is literally waiting for results and sometimes they don't look at it for a week or a day or whatever. Also, I don't know what gets done with these results, they might be further computed on their servers and then those results are given to the people doing science with this, and those intermediate servers might need the quick response in order to put things back together from so many disparate sources, and maybe that is the time importance?


Here's the situation as I understand it: priority is key because most of the time the results don't mean much until everything has come in. This is a DRASTIC difference to most DC projects, say SETI@home for example, where its just number crunching and there's no interdependence. In many projects on Folding@home, the current Generation (the last number in the PRCG number) is built from the previous Generation. So its linear. And yes there is a lot of post-collection processing. The Pande lab takes a protein folding simulation and divides it up into a bunch of tiny sub-simulations, which are then all combined together afterwards into things called Markov State Models. Protein folding is parallelized because all these sub-simulations can be run in parallel, but each one has serial elements, and that matters. 1.5 hours vs 45 hours probably does matter, because if there's 900 generations and if everyone does that, then that's a difference of 4,500 hours, which is 187.5 days. Yeah. See the last couple paragraphs in this section.
Pen tester at Cigital/Synopsys
User avatar
Jesse_V
 
Posts: 2773
Joined: Mon Jul 18, 2011 4:44 am
Location: USA

Re: SMP RAM ?

Postby bruce » Sat Mar 24, 2012 12:25 am

To add to what Jesse_V has said, with most OTHER DC projects, the total amount of work completed is an accurate measure of results. There is no need to waste any effort developing a SMP client. With projects like Seti@home, all WUs have been prerecorded. The only objective is to process all WUs in any order. If all are essentially equivalent in complexity, downloading 64 WUs and completing them all whenever they can be completed is just as good as downloading one WU and completing it in ~1/64 as much time and then proceeding to another one. Total quantity matters, but speed really doesn't.

FAH WUs are not parallel, they're serial. (There are enough Parallel WUs, that everybody can simultaneously be be working on a different WU (trajectory).) You can't start Gen 2 until the results of Gen 1 have been completed, because Gen 2 is generated from the end conditions of Gen 1. Suppose a single trajectory takes 10 years of continuous processing with a single core. It can be broken up into 3650 Gens but only one person can be working on today's assignment because tomorrow's assignment doesn't exist yet, since it's waiting for today's assignment is completed. If the same trajectory is processed exclusively with 32 SMP-threads, today's assignment might take a bit more than 24/32 hours, so maybe ~30 serial WUs will be completed today and the 10 years of study can be completed in ~4 months. Speed is critical, not quantity.
bruce
 
Posts: 21670
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: SMP RAM ?

Postby lightminer » Sun Mar 25, 2012 5:58 am

So each user is basically a node in the cluster and fork-join processes are distributed across all of these nodes and there may be 3000 to 4000 fork-joins before all is done. That is indeed different than some of the other projects.

Good to know!
lightminer
 
Posts: 42
Joined: Thu Dec 29, 2011 11:05 pm

Re: SMP RAM ?

Postby bruce » Sun Mar 25, 2012 9:12 am

Not quite.

One SMP WU can be fork-joined into N processes WITHIN YOUR NODE ... one for every CPU that you provide, but that's not 3000 to 4000. I'm not sure where you got those numbers. If the protein consists of somewhere between 200 and 2500000 atoms, it can be divided up into 2 to ~48 separate processes and one time step computed. Then results must be exchanged with the other separate processes and the next time step can be dispatched. Computational speaking, the steps are relatively small but you have to do a lot of them and synchronize the data between every step. That data synchronization moves A LOT of data, so the threads are limited to a single SMP node. You can't split-join across multiple donors.

After maybe 10**6 steps, (one core-day, in my previous example) the results are returned to the server and assigned to another node. (That provides a server-based checkpoint and distributes the work to a series of of nodes with varying capabilities.) If the node has 24-CPUs, it may take only one node-hour, though it's easy to increase the amount of work in a single WU, so it's unlikely to be something that can be completed in such a short time. The next node proceeds from step 10**6+1 to step 2*10**6 ... and similarly up to K*10**6.

More powerful nodes are generally assigned work that has more atoms and may get 10**9 steps instead of 10**6 and rewarded appropriately. A protein with only 200 atoms would never be assigned to as powerful a machine as one with 2500000 atoms. Both numbers determine the complexity of a given WU.
bruce
 
Posts: 21670
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: SMP RAM ?

Postby lightminer » Tue Mar 27, 2012 6:24 pm

3000 - 4000 -> from your note above :). (3650 Gens). I just meant that the iterative process from start to finish encompasses many serial WU's per your comments. Clearly there will be many fork-joins at the sub-second level within a WU.

(There can be many levels to spreading and joining the work, within a fork you can fork again i.e., for to send data to 10 servers, then fork again for threading (or use OpenMP), I was trying to refer to that higher level splitting of the work)

Anyhow I understand what I set out to understand and appreciate the time to explain how the system works!

Thanks
lightminer
 
Posts: 42
Joined: Thu Dec 29, 2011 11:05 pm

Previous

Return to V7.1.52 Windows/Linux

Who is online

Users browsing this forum: No registered users and 1 guest

cron