CPU Architecture and FAH

7im · Post by **7im** » Fri Jun 19, 2009 11:17 pm

There isn't enough need for double precision work (SSE2) to warrant a multicore client. The SMP client is SSE only as I recall.

shatteredsilicon · Post by **shatteredsilicon** » Sat Jun 20, 2009 12:12 am

The two are unrelated. It doesn't matter whether double or single precision numbers are used - that part is WU/core specific. There is no issue of efficiency here. Doubles are only used where they are needed - it is not about efficiency, it's about necessity.

The other point about utilization is debatable one. The problem is that the thread scheduler in the current implementation is very naive, and because task switching always comes with a penalty (when there are more tasks than cores) the throughput of a whole unit will end up being bound by the performance of the slowest thread. These overheads, made worse by unshared caches on core pairs of a quad Core2, are what leads to inefficiency. This is why running 4 SMP clients, each affinity bound to one core, ends up yielding more PPD than running a single SMP client. The bandwidth increase because the caches suffer fewer misses (the processes don't end up migrating between the core pairs), and you overbook the CPU time more so the time that would normally be spent idle due to the slowest thread scaling limitation ends up being utilized. Running multiple clients each bound to one core will end up yielding higher throughput, but it will also almost equally worse latency, and this is a problem because when a particularly interesting protein folding case needs to be analyzed, it is more important to get each time frame (WU) as quickly as possible so that they can be analyzed and researched further, which guides the creation of new WUs.

The switching overheads and cache sharing issues are significantly reduced on CPUs that don't consist of multiple physical dies, as I said in an earlier post. Core2 Duo, Nehalem and Phenom CPUs don't get as hammered by the overheads because although the WU is still bound by the speed of the slowest thread, switching between threads is faster.

codysluder · Post by **codysluder** » Sat Jun 20, 2009 12:16 am

Maybe I can explain it in a different way than either Bruce or ShatteredSilicon.

Suppose two donors have different hardware and one gets a very good PPD/Watt and the other gets a very poor PPD/Watt. The Pande Group is going to assign a project to both of them indiscriminately.

You see, the Pande Group has a different perspective than an individual user. They use donated resources and they don't have any way to know how many PPD/Watt is being used. They're only concerned about the turn-around time for each assignment. You, on the other hand, measure your performance primarily by the points you earn. Optimizing with two different goals sometimes leads to different measures of what is "best"

If your machine has SSE2, then it also has the single-precision instructions (SSE). If your machine is multi-cored, it is capable of running SMP. If you can do both, then the Pande Group really doesn't care which feature you use, as long as somebody else can run the "other" WU that you might have run instead.

From your perspective, however, there is probably a "best" that might include SSE2 or might include SMP but since it won't be doing both at the same time, you'll have to choose which works best for you. remember, though, that you cannot CHOOSE projects with SSE2, they just happen, and not on any pre-announced frequency. You CAN choose SMP and if you do, that's the only thing you'll get.

shatteredsilicon · Post by **shatteredsilicon** » Sat Jun 20, 2009 2:29 pm

Another thing worth pointing out that SSE2 capable units will still run on SSE1-only hardware, they'll just run slower. The core auto-detects what can be used and uses whatever is available. Any floating point features that are missing from the hardware being used get executed using the 387 FPU calls. In fact, you don't even need SSE - you can run the classic non-SMP client on a Pentium Pro, but it'll run _really_ slowly.

7im · Post by **7im** » Sat Jun 20, 2009 7:05 pm

One should also note that an SSE2 work unit will fold at the same Points Per Day as a SSE work unit on CPUs that only have SSE.

cheechi · Post by **cheechi** » Sun Jun 21, 2009 12:54 am

codysluder wrote:Maybe I can explain it in a different way than either Bruce or ShatteredSilicon.

Suppose two donors have different hardware and one gets a very good PPD/Watt and the other gets a very poor PPD/Watt. The Pande Group is going to assign a project to both of them indiscriminately.

You see, the Pande Group has a different perspective than an individual user. They use donated resources and they don't have any way to know how many PPD/Watt is being used. They're only concerned about the turn-around time for each assignment. You, on the other hand, measure your performance primarily by the points you earn. Optimizing with two different goals sometimes leads to different measures of what is "best"

I think this is most specifically helpful, although Bruce and shattered have done a good job of addressing the specifics I think you see what I was getting at. If I'm understanding you correctly, the Pande group doesn't have any plans to further diversify clients to take advantage of this because the need is not present. The only metrics they are using for the project's efficiency is the turnaround time that work returns to their servers.

I was mistakenly under the impression that they would approach the donators' machines as a large cluster, and actively approach issues of efficiency and reduction of overhead by releasing cores & work that would take the best advantage of the hardware currently being used in the cluster. Instead it sounds more like a magic box where work goes in and results come out, and they don't look inside the box. I'm not looking for Stanford to micro-manage or trying to give them more work to do. But my idea was that since there are so many dual cores, something could be released to take particular advantage of them in addition to the existing work. I'm certain that it would shift things around (current SMP units not being done on those duals) but may provide some reduced overhead (dual core work could be used to 'check' other projects) allowing Stanford to work with more projects and potentially wait for fewer runs to get the same result.

The reason I brought this up from the start is I am considering a new standalone folding system but there isn't a 'best' choice in terms of cost effectiveness of the parts I would be buying vs overall impact they will have to the project over their lifetime. A PS3 is probably the best investment of PPD/$ but if everyone just goes to PS3's then the project suffers in the long run. Diversifying the hardware the donors use is good to an extent, but to some degree as well it would be ideal as a donor to make sure that I am providing lasting worth to the project in my purchases. I am not looking for any 'official hardware list.' This gives me some insight to some underlying goals that may otherwise not be very clear.

Zagen30 · Post by **Zagen30** » Sun Jun 21, 2009 6:41 am

A PS3 would not actually be the best PPD/$. You could easily build a computer for $400 that would greatly outperform the 900 PPD of the PS3 (a dual-core Pentium w/ a 96-shader 9600 GSO would produce a few thousand PPD, for instance). Of course the PS3 is useful, and I am in no way discounting its folding capabilities.

shatteredsilicon · Post by **shatteredsilicon** » Sun Jun 21, 2009 10:30 am

I rather suspect the best PPD/$ and PPD/W figure would likely go to one of these:
http://www.supermicro.com/products/moth ... .cfm?typ=H (£115+VAT in UK)
coupled with something like a good old 9800GX2. This mobo has a PCIe x8 slot, so it would probably work reasonably well.

cheechi · Post by **cheechi** » Sun Jun 21, 2009 6:14 pm

That was just one example. The point is if we all start using the same thing to the exclusion of other clients then the project suffers in the long term. So if all users are going after the best PPD/whatever then we may be doing more harm than good.

And when you consider the $ you pay for power, PS3's are pretty close to the top in overall production per investment (purchase + wattage).

7im · Post by **7im** » Mon Jun 22, 2009 1:28 am

I wouldn't worry about PPD causing any work to be excluded cheechi. Not everyone buys their home or work computer based solely on getting the best PPD. Sure, a lot of enthusiasts on this forum do, but many people buy a PC for many other reasons, and then just happen to also fold. Also consider that in this economy, few people are ugprading, so they keep folding on older hardware. There is still more than 100,000 CPU clients folding, and they don't get the best PPD but still keep going.

cheechi · Post by **cheechi** » Mon Jun 22, 2009 3:30 am

Sounds like a plan. Thanks.

Post by **bruce** » Mon Jun 22, 2009 4:40 am

cheechi wrote:I was mistakenly under the impression that they would approach the donators' machines as a large cluster, and actively approach issues of efficiency and reduction of overhead by releasing cores & work that would take the best advantage of the hardware currently being used in the cluster. Instead it sounds more like a magic box where work goes in and results come out, and they don't look inside the box. I'm not looking for Stanford to micro-manage or trying to give them more work to do. But my idea was that since there are so many dual cores, something could be released to take particular advantage of them in addition to the existing work. I'm certain that it would shift things around (current SMP units not being done on those duals) but may provide some reduced overhead (dual core work could be used to 'check' other projects) allowing Stanford to work with more projects and potentially wait for fewer runs to get the same result.

The donators' machines are a large cluster and they DO activel approach issues of efficiency just as you suggest . . . but only where it makes sense. For example, SMP and GPU and CPU assignments are managed separately and work is assigned to take advantage of those features. Not everything is worth managing. For example, suppose the CPU client could detect cache size and report that back to the server. The server could then assign different WUs based on that, but the system-wide throughput improvement compared to the amount of server logic required to manage it would mean that makes no sense to do it.

The client does report how many cores you have and how much RAM you have. That information is sometimes used to customize assignments, but putting resources into managing assignments at this level is less productive than putting that same effort into making the GPU and/or SMP cores more reliable.

Archangelboy · Post by **Archangelboy** » Thu Jul 09, 2009 6:09 pm

I am awed by the depth of understanding present in this forum. I stand to learn a lot! I suppose I would like, for a moment, to dumb this conversation down and ask the point-blank question:

Other than in terms of pure PPD, am I gaining anything (besides a heftier power bill) by utilizing an Intel VT/HT capable to SMP fold on (NotFreds clients since they're easy to set up and, um, I'm not the most computer literate guy in the world) or would I be better off running some sort of simple CPU client, leaving GPU folding aside. Where, other than raw consumption of wattage and subsequent costs (ok, and in terms of cpu cost, although that's really moot for me since i'm not buying an extreme chip) would I see a gain? I'm considering a $50 single core without VT/HT vs a $220 quad with VT/HT.

My purpose is to build a (just a single, for now) folding rig, with PPD as one of my goals (hence GPU folding, but as I said, another discussion). As new as I am to the entire folding scene, I'm not sure I know how to configure my NotFreds SMPs to do the most good (this is, after all, philanthropy on my part), but given a baseline configuration, do I gain anything besides PPD from running a more than 4x expensive chip?

shatteredsilicon · Post by **shatteredsilicon** » Thu Jul 09, 2009 6:37 pm

You'll need to define your goals more precisely. Are you looking for maximum PPD/$ in terms of initial investment? Or PPD/W (ongoing electricity costs)? There are threads elsewhere on this forum for PPD/$ and PPD/W figures for most commonly available hardware.
(Somebody also threatened to start PPD/m^3 lists, but thankfully, it didn't happen.

)

Archangelboy · Post by **Archangelboy** » Thu Jul 09, 2009 6:55 pm

I'm fairly confident in my understanding of my PPD/$ ratios on these various levels of technology, I guess my question revolves directly around a lack of understanding of efficiency/architecture. What I'm gathering is that a quad core is more capable than a dual core (2x cores=2x work), although as you mentioned with Core2 architecture it outstrips anything previous, and i7 (assumedly) exceeds Core2. I guess i'd amend my question to read, 'Is there any benefit to be gained by running a core2 duo vs a core2 quad?'

Any meaningful PPD I achieve will be through GPU clients once I'm set up. In terms of what architecture is most efficient (not necessarily the highest PPD, although they may coincide), how do the available technologies rank? From the perspective of Pande Group, for example, who doesn't have to worry about my electricity bill. What would they prefer, so to speak, that I use? (The duh answer being that anything is better than nothing, but beyond that...)

Folding Forum

CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH

Re: CPU Architecture and FAH