CPU Architecture and FAH

Post by **toTOW** » Fri Oct 17, 2008 10:30 am

I originally posted this detailed information in a specific thread, but some people asked me to create an unique thread to discuss CPU architecture, and how FAH will benefit from the different parts of the CPU. This thread can also be used to discuss how stability and heat generation is impacted by the use of the different parts of the CPU.

A CPU can be divided into different subparts that won't do the same operations ... here are the most common for a modern CPU (for example a Core 2 or an Athlon X2 or a Phenom X4) ... keep in mind that there is usually many units of one type in a single CPU :

- ALU (Arithmetic and Logic Unit) : the main job of this unit is to do basic arithmetic and logic operations, like addition, comparison, shifting, boolean operators, ... this is the oldest unit that exists in a CPU and it can only work on integers. This unit is not used a lot in FAH, but it is used in cryptography or compression algorithms.
- FPU (Floating Point Unit) : this unit has been added (first as a coprocessor) in the 286 (287 coprocessor) and 386 CPU (387 coprocessor). It is integrated to the CPU since 486. This unit is doing advanced math and logic operations like multiplication, power, division, ... This unit works with floating point numbers (the most common ones in the world), and can use single or double precision. This unit is used in many applications, like games (3D rendering) or multimedia applications for example. In FAH, this unit was used by Tinker core, and is now used in Amber core, or Gromacs when the message (using standard loops) is printed.

The following unit have been added as improvements to the two basic unit :
- MMX (MultiMedia eXtensions) : this unit was added by Intel to the Pentium core to speed up multimedia application. This is an extension to the ALU, and it can only work on integers too. This extension is usually useless, and multimedia application use floating point operation ...

This unit can speed up file compression or cryptography operations, but it's usually not used by FAH.
- SSE (Streaming SIMD Extension) : this unit was first implement into the Pentium 3 CPU. This is one of the most interesting units in a CPU : this is the first unit that is able to apply one instruction to different data at the same time (SIMD : Single Instruction, Multiple Data). This unit works with floating point number (extension to the FPU), so it is used by many applications (multimedia, games, computing, ...). In FAH, this unit is massively used by Gromacs core and its variants (Gromacs, Gromacs33, GroST, GroSMP), and is signaled with the message "Extra SSE boost OK". SSE instruction can only be used in single precision calculations. AMD implemented 3Dnow! in Athlon CPUs to challenge Intel's SSE, which is their equivalent to the SSE instructions (all current AMD processors currently support both SSE and 3Dnow!).
- SSE2 : these are additional instructions added to the original SSE (in Pentium 4 CPUs) to speed up calculations in double precision. They apply to the same type of jobs (games, multimedia, computing, ...) as SSE. These instructions are used by Double Gromacs core in FAH, and it's variants (Dgromacs, Dgromacs B and Dgromacs C).

In addition to the processing units, you need to understand how a CPU gets it's data for memory. There is usually 3 "level" of memory :

- Level 1 cache
- Level 2 cache
- System memory

When the CPU is working on small data that fit in the L1 cache, there is no accesses to the other "memories". As data size grows, it will start using L2 cache, and then system memory.

Now we can talk about power consumption and stability issues. The worst case is of course when a lot of processing units are used, with a lot of data to move between CPU and memory. Here are some examples, with FAH cores and WU :

- Amber core : it's the lightest operation we can find in FAH as it only uses ALU and FPU. These unit are usually small, and won't stress caches a lot too.
- Gromacs (Dgromacs) core : it's the hardest thing to do for the CPU. It uses ALU, FPU and SSE (SSE2). If you're opted for BigWU, it will also stresses caches and memory.
- Gromacs33 is like Gromacs, but with a newer code, it tend to be more optimized and stressful.
- GroSMP is a bit different : it doesn't stresses CPU as hard as regular Gromacs because processing power is limited by data transfers between CPU cores, but it's easy to guess, it will stresses a lot the caches and memory subsystem. The A2 SMP core is progressively changing the rule as it better use the CPU cores ... So we can say this is one of the "worst" case using ALU, FPU, SSE, caches and system memory.

Tell me if there is something wrong or that you don't understand.

Simmol · Post by **Simmol** » Tue Oct 21, 2008 11:19 am

Hello there,

I have noticed that the performance and power consumption of my dual system setup, depends heavily on the core that is being used by FAH. For example:

project 2605 (core a1) 6260 ppd @ 182 Watt
project 2668 (core a2) 10770 ppd @ 208 Watt

Does this mean that the a2-core uses the Quad-core CPU more efficient than a1-core?
Is this statement also true for a dual-core CPU?

System specs:
- 2x Q6600 @ 2,88 GHz (320x9)
- mem @ 800 MHz
- Notfred's diskless folding suite on USB-drive.

Thank you for the response.

Post by **toTOW** » Tue Oct 21, 2008 11:41 am

Simmol wrote:Does this mean that the a2-core uses the Quad-core CPU more efficient than a1-core?

This is exactly what happens : if you look at your CPU usages, you'll see that the A2 cores uses almost 95% of all available cores, where the A1 core only uses 70% (the numbers might vary depending on the project).

A2 core also scales much better than A1 core when you add some CPU cores.

alpha754293 · Post by **alpha754293** » Wed Apr 15, 2009 5:42 am

Good page. Lots of good info.

A small, minor correction:
L1, L2 caches isn't because of the data size. It actually has to do with "preparing" the CPU to do the calculation or task (be it ALU or FPU) than the actual "size" of the task.

System memory is where data size matters. (You get into things like direct vs. iterative solving, stuff like that.) I won't get into that, but suffice it to say that system memory (RAM) is the cake, L2 is a slice, and L1 is the piece on your fork ready/waiting to be eaten. (only that you can play the tape in reverse as the process goes backwards and forwards.)

I think that this should be made into a general page.

(Optional):
Also note that ALU is measured in (millions) of instructions per second ((M)IPS).
FPU is measured in (millions (mega) floating point operations per second ((M)FLOPS) or billions (giga) floating point operations per second ((G)FLOPS). This is the super crucial, super critical number to watch out for.

MtM · Post by **MtM** » Wed Apr 15, 2009 7:40 am

toTOW wrote:- Level 1 cache
- Level 2 cache
- System memory

When the CPU is working on small data that fit in the L1 cache, there is no accesses to the other "memories". As data size grows, it will start using L2 cache, and then system memory.

As hinted by alpha, this is incorrect ( but that's his explanation as well imo so I'll give it a shot ).

It is the size of the cache which matters allot ( and it's throughput! ). Reason is that when the alu's and fpu's request data which is in l1 cache it saves allot of time waiting on the sluggish l2 and horribly slow system ram memory roundtrips. I say sluggish and horribly with envy btw, I whish my memory had the latency of even the slowest pc memory right now. This is also the main reason why you will notice f@h running when you just start another program ( talking about cpu client ), the cache is filled with data relevant to the executing threads and more cache misses will occur in new threads ( untill the cache is synchronised with the current load ).

FaaR · Post by **FaaR** » Sun Jun 14, 2009 10:56 pm

Typically/generally speaking:
Level 1 cache is optimized for latency (ie, fast access time.)
Level 2 cache is optimized for bandwidth (ie, data transfer rate.)
Level 3 cache (if applicable) is optimized for storage capacity.

Also, maybe nitpicking, but MMX and SSE aren't actual processor UNITS.

They're instruction set extensions, and curiously MMX actually uses FPU registers, even though it can only work on integers. That is because it was easier (and cheaper) for Intel to recycle the big FPU registers for SIMD use than to introduce all-new registers, which would also have broken software compatibility with existing OSes I might add. Existing task schedulers would not know of those new registers, and that would have introduced issues.

I do believe though that Intel did create new registers for SSE2 however. Don't quote me on that though; it was ages since I last read up on the subject.

shatteredsilicon · Post by **shatteredsilicon** » Sun Jun 14, 2009 11:41 pm

toTOW wrote:- GroSMP is a bit different : it doesn't stresses CPU as hard as regular Gromacs because processing power is limited by data transfers between CPU cores

This affects quad Core2s particularly badly, among other things that the two sets of two cores are completely separate without shared chaches between them (separate silicon dies). Dual Core2s, Phenoms and Nehalems are not as badly affected because their inter-core latencies are much lower due to them all being on the same die and all cores have access to each other caches with negligible penalty.

Post by **bruce** » Mon Jun 15, 2009 7:54 pm

FaaR wrote:Also, maybe nitpicking, but MMX and SSE aren't actual processor UNITS. They're instruction set extensions, and curiously MMX actually uses FPU registers, even though it can only work on integers. That is because it was easier (and cheaper) for Intel to recycle the big FPU registers for SIMD use than to introduce all-new registers, which would also have broken software compatibility with existing OSes I might add. Existing task schedulers would not know of those new registers, and that would have introduced issues.

True, and you're right, it's nitpicking, but let's not let that distract us from the original question.

FAH does not use MMX so you can ignore it, for the purposes of this discussion.

FAH does make very heavy use of SSE. SSE is designed to process four floating point operations simultaneously and in real code (considering that there still are some non-SSE instructions), it generally exceeds an average above 3.5). This is probably why FAH loads the FPU heavier than any other software that I'm familiar with. That's also why the typical overclocker who relies on traditional benchmarks to establish system stability often find that their system isn't stable when running FAH. (They need to use stresscpu2, which is based on the optimized code in GROMACS.)

A small fraction of the FAH WUs use SSE2 which is similar to SSE except that it's used for double precision operations. The code is optimized for two simultaneous double precision floats compared to the four single precision floats for SSE.

cheechi · Post by **cheechi** » Thu Jun 18, 2009 12:05 pm

bruce wrote:A small fraction of the FAH WUs use SSE2 which is similar to SSE except that it's used for double precision operations. The code is optimized for two simultaneous double precision floats compared to the four single precision floats for SSE.

So would this be more efficient use of a dual core CPU than a wu using SSE? Or compared to a quad core running the SSE2 wu?

shatteredsilicon · Post by **shatteredsilicon** » Thu Jun 18, 2009 7:51 pm

cheechi wrote:So would this be more efficient use of a dual core CPU than a wu using SSE? Or compared to a quad core running the SSE2 wu?

Typically, if your algorithm is well adjusted to the CPU, SSE1 will give you approximately 4x speed-up on floats. SSE2 will give you a 2x speed-up on doubles. Multiple cores stack on top of that. Scalability of both will vary - it all depends on the algorithm used, how well it lends itself to different scaling approaches and how well it was implemented.

Post by **bruce** » Thu Jun 18, 2009 9:08 pm

cheechi wrote:
bruce wrote:A small fraction of the FAH WUs use SSE2 which is similar to SSE except that it's used for double precision operations. The code is optimized for two simultaneous double precision floats compared to the four single precision floats for SSE.
So would this be more efficient use of a dual core CPU than a wu using SSE? Or compared to a quad core running the SSE2 wu?

To add to what shatteredsilicon has said, you don't get to choose the projects that will be assigned to you. You can choose broad categories such as SMP or GPU or neither (aka uniprocessor).

If it's a uniprocessor project, ig might be one using Double Precision and SSE2 and if so, it will run using SSE2 if your hardware has that instruction set. If it's one of the more common projects that use Single Precision and SSE, it'll run nearly 4x as fast if your hardware supports SSE than if your hardware does not. Either way, it will use a single core unless you're running the SMP core.

cheechi · Post by **cheechi** » Fri Jun 19, 2009 8:22 pm

My question was specific to SMP, unless you're saying that only Uniprocessor cores utilize SSE/SSE2. I thought all SMP units utilized SSE at least. My thinking is that basically (theoretically) a SMP core or wu that uses SSE2 heavily would be a better use of a dual core CPU than running a 4-task SSE SMP unit on that same dual core. Would a quad core running a 4-task SMP SSE-heavy unit be faster than this? I am just thinking of other possibilities since we have a much more rich selection of platforms than say 3 years ago. Like I said, it's all theoretical. I know there are no 2-task SMP units so it is entirely speculative.

bruce wrote:To add to what shatteredsilicon has said, you don't get to choose the projects that will be assigned to you. You can choose broad categories such as SMP or GPU or neither (aka uniprocessor).

I have been folding long enough to know this very well, I was not suggesting to do so.

Post by **bruce** » Fri Jun 19, 2009 9:17 pm

The issue of SSE vs. SSE2 is a bit more complicated that that.

First, you have to consider the differences between single precision floats and double precision floats. Double precision does provide more accurate calculations than single, but in most hardware, doubles run slower. Even if you happen to have hardware that can run Doubles as fast as Singles, you need to consider that SSE2 can only do two Double floats in the same amount of time as four Singles. Thus using single precision is twice as fast, or better.

If the calculation requires doubles, then they will be used. Otherwise the project will be assigned to one of the versions of Gromacs that uses SSE. So the bottom line is: Very few projects need SSE2.

Also, as far as I know there has been no need for Double Precision SMP code yet, but if there were, how would you reframe your question?

shatteredsilicon · Post by **shatteredsilicon** » Fri Jun 19, 2009 9:30 pm

I'm not sure I follow what you're saying, cheechi. SMP and SSE/SSE2 are totally orthogonal in the logical sense.

cheechi · Post by **cheechi** » Fri Jun 19, 2009 10:31 pm

Let's make an assumption. We assume that there is a protein or project that will benefit or require a Double run. Now we take a look at the hardware available; single, dual & quad x86 & x86-64, PS3 Cell, & GPU. Users may choose to run several uniprocessor clients on multicore systems, but I doubt there is much argument as to this being the most efficient (PPD/(watt + $)), or any scaling metric to give overall 'efficiency' to the project) and generate more (in either variety or quantity, whichever may be necessary) results for the team to examine. This kind of 'efficient' use of the donated power and hardware to the project seems like a goal of the donors at least, and it is the angle I am looking at when I purchase parts for folding. The rest of the hardware available is already being utilized very well and doesn't necessarily need any more efficiency in terms of the results they are producing, but I have felt for a while that dual core CPUs & dual CPU systems are being underutilized. Clearly quad core is the way to go if you are buying a CPU solely for folding.

So my question is, still assuming that a run can be done a) simultaneously (SMP) and b) benefit or require double precision; Would doing this work (2 double precision tasks) on a dual core CPU that supports SSE2 be more efficient than doing the same work on either a quad core, or having the current SMP model (4 tasks) be done on that same dual core? Would the project benefit more by utilizing the dual core CPUs out there for something (potentially) leading to faster turnaround? I am suggesting that even though Doubles are slower, having 2 tasks on 2 cores would be more efficient than 4 tasks on 2 cores, and that should Doubles become necessary, I am asking if this is a more worthwhile consideration than making a 4 task Double run.

I'm not trying to be difficult, honestly.

I'm not a molecular biologist or a computer scientist, so I apologize for all the confusion.

Folding Forum

CPU Architecture and FAH

CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH