domboy wrote:Ya know that is curious. Since they developed the GPU client using ATI hardware, I wonder why it ends up being faster on nVidia hardware. I would have expected the opposite to be true... i.e. initially optimized for ATI due to the fact it was developed on ATI hardware...
The code paths for AMD and NV are completely separate, and each was developed in collaboration with programmers from each company. Keep in mind that the AMD code is in general older; a combination of architectural features of the hardware and lessons learned mean that the NV code is more efficient in its current state. Also keep in mind that although the AMD code may be slower in terms of ns/day than NV, it's still a lot faster than the CPU.
FaaR wrote: Nvidia boards that are years old and clearly technically far inferior to today's ATI boards still fold much faster than any ATI board available today. Including the 2.1 *teraflops* Radeon 5870.
This is something I see a lot, and it reflects a lack of understanding of GPU architecture and how peak FLOP counts are measured, so let me lay some wisdom down on y'all. For comparison, we'll consider the Nvidia GT200 GPU (GeForce GTX 280) and the AMD RV770 GPU (Radeon HD 4870); older and newer cards on both sides are substantially similar in terms of what I'll describe here. I'm less familiar with the RV770 architecture than the others, so I apologize in advance for any mistakes.
The first ambiguity to resolve is what a "core" is - NV claims their chip has 240 "shader cores", AMD claims 800 "stream processors", and Intel gamely claims only 4 cores in their latest CPUs. For the purposes of this post, I'm going to go with common CS terminology, and call a core a functional unit on the chip that has full instruction-fetching-and-decoding abilities -- in other words, a core is only a core if it can process different instructions from those being run by other cores. Under this definition, the GT200 has 30 cores (NV's "streaming multiprocessors"), RV770 has 10, and Core 2 has 4 (or 2).
The structure of the cores in each of these is very different. Nvidia's cores are structured as clusters of 8 "Scalar Processors" each. Each SP executes the same instruction as each other one in the same SM, but they do so on different data (what's known as "Single-Instruction, Multiple Data" or SIMD). In particular, for this to work, the programming model does not require that one express computations in what's called "explicitly-vectorized" format. In such a format, one must code up the computation in a form that maps directly to operations on fixed-length (usually between 2 and 16) lists of numbers (vectors). Since Nvidia's hardware is structured around scalar, not vector, SIMD units, it can be easier to express arbitrary parallel computations on GT200. The disadvantage is that since the programmer isn't helping the hardware out as much with explicit parallelism, the device can't run as many computations (adds/multiplies/etc) per clock cycle, so Nvidia's system runs a relatively-narrower chip at very high clock speeds (over a GHz) for high throughput.
AMD's cores are 16-wide SIMD VLIWs with 5 ALUs per VLIW unit. Let's break down the acronyms

. At a high level, AMD's chip is similar to Nvidia's - they both have a moderate number of wide SIMD cores - Nvidia has 30 cores with a SIMD width of 8, and AMD has 10 with a SIMD width of 16. But what's this VLIW business, you ask? Remember that Nvidia's cores were 8-wide
scalar SIMD; AMD cores, in contrast, are 16-wide and use an instruction format called VLIW, for "very long instruction word". For the purposes of this discussion, you can consider VLIW to be a generalization on a vector architecture. The upshot is as follows: AMD's architecture packs multiple operations into each "instruction" of the "single instruction, multiple data" format. Because they can run a lot of operations at the same time, they clock their chips relatively lower than does Nvidia, and can get as good peak throughput.
For example, let's say that we're adding two 4-dimensional vectors (this might be a common operation in graphics, if you're blending two images in RGBA format (red, green, blue, alpha (transparency)); call them v1 and v2, each with components r, g, b, and a, and we want to store the results in vector v3. Clearly, there are four independent operations here. Let's call them as follows:
- Code: Select all
OP1: v3.r <-- v1.r + v2.r
OP2: v3.g <-- v1.g + v2.g
OP3: v3.b <-- v1.b + v2.b
OP4: v3.a <-- v1.a + v2.a
The Nvidia architecture would need to issue four independent instructions here* :
- Code: Select all
Instruction 1: OP1
Instruction 2: OP2
Instruction 3: OP3
Instruction 4: OP4
In contrast, the AMD architecture, because the code makes the vectorization explicit, can roll these all up into one instruction:
- Code: Select all
Instruction 1: OP1 | OP2 | OP3 | OP4 | NOP
This example illustrates two important things about the difference between the two architectures. First, when you can load all those VLIW slots with work to do, the ATI chips can be very, very fast. Even though RV770 usually runs around half the clock rate of GT200, if you can run four or five operations per clock cycle, then overall the ATI throughput will be 2-2.5x that of Nvidia. But that's the catch which brings us to the second point. Note in the RV770 example code the fifth VLIW operation slot is occupied by a NOP (meaning no operation), because we didn't have any work for it to do. VLIW architectures require explicit vectorization (or lots of compiler magic to infer vectorization) to keep all those slots occupied. It's not easy to express all your computations as vector operations (and sometimes it doesn't make sense). Basically, RV770 takes a much larger hit from code that isn't explicitly vectorized because of the way its processing is structured. It's easy to turn vector code into scalar code (to run it on GT200); it's much, much harder to go the other way. In fact, it's an open problem in computer science and has been for some time -- Intel was counting on compiler magic to help performance on its VLIW-architecture Itanium series of processors, but it never quite happened. AMD's cards work great for graphics because graphics computations can be easily vectorized.
There are certainly other differences in the chips that affect folding performance significantly (in particular, restrictions in how on-chip memory can be used), but my objective here is to deal with comparisons that rest just on peak FLOP counts. AMD's 2.1 teraflop number for Cypress (Radeon HD 58xx) isn't a lie - it just assumes that every VLIW unit on every SIMD is fully occupied. Nvidia's claims are the same - they assume that every SP on every SM is always occupied. What makes a difference in sustained (rather than theoretical peak) throughput is how well it's possible to make use of the hardware. AMD's hardware is intrinsically harder to program for, which makes those peak numbers a bit more fluffy from our perspective. Before anyone in the know bites off my head - memory throughput is also extremely important for performance, but I'm leaving that bit of the story out for now.
So after a very long discussion, what's the upshot?Don't compare hardware on the basis of peak teraflops. While the Nvidia hardware has a lower peak teraflop count, for the applications were interested in, that does not make them "clearly technically far inferior", because the architecture makes it easier to actually get close to that peak performance than does the AMD architecture.
AtwaterFS wrote:Does PG seem to care?
Short Answer, No
As has been explained
ad infinitum in this thread and others, our focus is now on the next generation GPU3 client and OpenMM core. The programming language and model used by the old ATI code have been made largely obsolete by the introduction of OpenCL, so we're focusing our efforts there, rather than trying to revamp an older code.
Besides, if we didn't care, I wouldn't have spent an hour writing this post rather than preparing my talk for group meeting...