Is it worth folding on ATI cards?

Moderators: mhouston, Site Moderators, PandeGroup

Re: Is it worth folding on ATI cards?

Postby bambihunter » Fri Nov 06, 2009 5:30 pm

FaaR wrote:
bruce wrote:If you're explicitly asking for help, then we need to know which problem you want help with.

I think his point is that ATI hardware isn't performing to its full capabilities, and - perhaps implied - he'd like help with that.

We're a lot of people who'd like to have help getting our ATI boards performing up to expectations. It's not just a matter of Nvidia folding faster like some CPUs fold faster than other CPUs - Nvidia boards that are years old and clearly technically far inferior to today's ATI boards still fold much faster than any ATI board available today. Including the 2.1 *teraflops* Radeon 5870.

That's the problem we'd like fixed. :)


Thank you FaaR, that was a much more eloquent answer than I could come up with. Now I have a "board warning issued" PM. I guess some people don't care real opinions based on real experience. I guess I should just stay away from this forum. Apparently it is modded by my mom. Her motto always was "If you don't have anything nice to say, don't say anything at all".

I guess that includes trying to get peer-level help getting MY equipment to run using MY electricity, and MY configuration time, to participate in a VOLUNTEER program to better mankind.
bambihunter
 
Posts: 59
Joined: Fri Apr 03, 2009 4:09 pm

Re: Is it worth folding on ATI cards?

Postby Nathan_P » Fri Nov 06, 2009 9:38 pm

@The OP

I am a NV board user and was very surprised when the 787 pointers came out with a long deadline, when i saw the deadline i thought they would be really big WU's but that is not the case. As for you comment on 511 pointers the NV boards have them as well and they take a lot longer to complete than any others and cause all sorts of problems for us NV owners (heat in particular). Their deadlines are only 3 days as well.

IIRC correctly part of the problem is that the older ATI cards use Brook and the newer cards use openCL, this backwards compatability is killing performance on the newer boards. I for one would like to see this sorted as the ATI cards with 800 stream processors should smoke the NV competition at a lower cost. Pande group are working on it and several other new cores at the moment, I can only recommend patience. you aren't the only ones suffering, win SMP client users are in a similar boat as the win SMP client is vastly outperformed by the linux/Mac OS one and there are some NV users out there with low end cards that can't meet some of their deadlines either. Everyone needs to take a breath and remember that PG has limited resources and is trying to please a lot of different user groups at the same time.
Image
Nathan_P
 
Posts: 1584
Joined: Wed Apr 01, 2009 9:22 pm
Location: Jersey, Channel islands

Re: Is it worth folding on ATI cards?

Postby AtwaterFS » Fri Nov 06, 2009 10:26 pm

Short Answer, No
Does PG seem to care?
Short Answer, No

I will say that it was rewarding finally getting my ATI card folding after months of VPU recovers... but that was only cause there was so much pain/effort involved in the process
ImageImage
User avatar
AtwaterFS
 
Posts: 124
Joined: Wed Jan 21, 2009 9:08 pm

Re: Is it worth folding on ATI cards?

Postby ihaque » Fri Nov 06, 2009 10:41 pm

domboy wrote:Ya know that is curious. Since they developed the GPU client using ATI hardware, I wonder why it ends up being faster on nVidia hardware. I would have expected the opposite to be true... i.e. initially optimized for ATI due to the fact it was developed on ATI hardware...


The code paths for AMD and NV are completely separate, and each was developed in collaboration with programmers from each company. Keep in mind that the AMD code is in general older; a combination of architectural features of the hardware and lessons learned mean that the NV code is more efficient in its current state. Also keep in mind that although the AMD code may be slower in terms of ns/day than NV, it's still a lot faster than the CPU.

FaaR wrote: Nvidia boards that are years old and clearly technically far inferior to today's ATI boards still fold much faster than any ATI board available today. Including the 2.1 *teraflops* Radeon 5870.


This is something I see a lot, and it reflects a lack of understanding of GPU architecture and how peak FLOP counts are measured, so let me lay some wisdom down on y'all. For comparison, we'll consider the Nvidia GT200 GPU (GeForce GTX 280) and the AMD RV770 GPU (Radeon HD 4870); older and newer cards on both sides are substantially similar in terms of what I'll describe here. I'm less familiar with the RV770 architecture than the others, so I apologize in advance for any mistakes.

The first ambiguity to resolve is what a "core" is - NV claims their chip has 240 "shader cores", AMD claims 800 "stream processors", and Intel gamely claims only 4 cores in their latest CPUs. For the purposes of this post, I'm going to go with common CS terminology, and call a core a functional unit on the chip that has full instruction-fetching-and-decoding abilities -- in other words, a core is only a core if it can process different instructions from those being run by other cores. Under this definition, the GT200 has 30 cores (NV's "streaming multiprocessors"), RV770 has 10, and Core 2 has 4 (or 2).

The structure of the cores in each of these is very different. Nvidia's cores are structured as clusters of 8 "Scalar Processors" each. Each SP executes the same instruction as each other one in the same SM, but they do so on different data (what's known as "Single-Instruction, Multiple Data" or SIMD). In particular, for this to work, the programming model does not require that one express computations in what's called "explicitly-vectorized" format. In such a format, one must code up the computation in a form that maps directly to operations on fixed-length (usually between 2 and 16) lists of numbers (vectors). Since Nvidia's hardware is structured around scalar, not vector, SIMD units, it can be easier to express arbitrary parallel computations on GT200. The disadvantage is that since the programmer isn't helping the hardware out as much with explicit parallelism, the device can't run as many computations (adds/multiplies/etc) per clock cycle, so Nvidia's system runs a relatively-narrower chip at very high clock speeds (over a GHz) for high throughput.

AMD's cores are 16-wide SIMD VLIWs with 5 ALUs per VLIW unit. Let's break down the acronyms :). At a high level, AMD's chip is similar to Nvidia's - they both have a moderate number of wide SIMD cores - Nvidia has 30 cores with a SIMD width of 8, and AMD has 10 with a SIMD width of 16. But what's this VLIW business, you ask? Remember that Nvidia's cores were 8-wide scalar SIMD; AMD cores, in contrast, are 16-wide and use an instruction format called VLIW, for "very long instruction word". For the purposes of this discussion, you can consider VLIW to be a generalization on a vector architecture. The upshot is as follows: AMD's architecture packs multiple operations into each "instruction" of the "single instruction, multiple data" format. Because they can run a lot of operations at the same time, they clock their chips relatively lower than does Nvidia, and can get as good peak throughput.

For example, let's say that we're adding two 4-dimensional vectors (this might be a common operation in graphics, if you're blending two images in RGBA format (red, green, blue, alpha (transparency)); call them v1 and v2, each with components r, g, b, and a, and we want to store the results in vector v3. Clearly, there are four independent operations here. Let's call them as follows:
Code: Select all
OP1: v3.r <-- v1.r + v2.r
OP2: v3.g <-- v1.g + v2.g
OP3: v3.b <-- v1.b + v2.b
OP4: v3.a <-- v1.a + v2.a


The Nvidia architecture would need to issue four independent instructions here* :
Code: Select all
Instruction 1: OP1
Instruction 2: OP2
Instruction 3: OP3
Instruction 4: OP4


In contrast, the AMD architecture, because the code makes the vectorization explicit, can roll these all up into one instruction:
Code: Select all
Instruction 1: OP1 | OP2 | OP3 | OP4 | NOP


This example illustrates two important things about the difference between the two architectures. First, when you can load all those VLIW slots with work to do, the ATI chips can be very, very fast. Even though RV770 usually runs around half the clock rate of GT200, if you can run four or five operations per clock cycle, then overall the ATI throughput will be 2-2.5x that of Nvidia. But that's the catch which brings us to the second point. Note in the RV770 example code the fifth VLIW operation slot is occupied by a NOP (meaning no operation), because we didn't have any work for it to do. VLIW architectures require explicit vectorization (or lots of compiler magic to infer vectorization) to keep all those slots occupied. It's not easy to express all your computations as vector operations (and sometimes it doesn't make sense). Basically, RV770 takes a much larger hit from code that isn't explicitly vectorized because of the way its processing is structured. It's easy to turn vector code into scalar code (to run it on GT200); it's much, much harder to go the other way. In fact, it's an open problem in computer science and has been for some time -- Intel was counting on compiler magic to help performance on its VLIW-architecture Itanium series of processors, but it never quite happened. AMD's cards work great for graphics because graphics computations can be easily vectorized.

There are certainly other differences in the chips that affect folding performance significantly (in particular, restrictions in how on-chip memory can be used), but my objective here is to deal with comparisons that rest just on peak FLOP counts. AMD's 2.1 teraflop number for Cypress (Radeon HD 58xx) isn't a lie - it just assumes that every VLIW unit on every SIMD is fully occupied. Nvidia's claims are the same - they assume that every SP on every SM is always occupied. What makes a difference in sustained (rather than theoretical peak) throughput is how well it's possible to make use of the hardware. AMD's hardware is intrinsically harder to program for, which makes those peak numbers a bit more fluffy from our perspective. Before anyone in the know bites off my head - memory throughput is also extremely important for performance, but I'm leaving that bit of the story out for now.

So after a very long discussion, what's the upshot?

Don't compare hardware on the basis of peak teraflops. While the Nvidia hardware has a lower peak teraflop count, for the applications were interested in, that does not make them "clearly technically far inferior", because the architecture makes it easier to actually get close to that peak performance than does the AMD architecture.

AtwaterFS wrote:Does PG seem to care?
Short Answer, No


As has been explained ad infinitum in this thread and others, our focus is now on the next generation GPU3 client and OpenMM core. The programming language and model used by the old ATI code have been made largely obsolete by the introduction of OpenCL, so we're focusing our efforts there, rather than trying to revamp an older code.

Besides, if we didn't care, I wouldn't have spent an hour writing this post rather than preparing my talk for group meeting...
User avatar
ihaque
Pande Group Member
 
Posts: 239
Joined: Mon Dec 03, 2007 4:20 am
Location: Stanford

Re: Is it worth folding on ATI cards?

Postby bruce » Fri Nov 06, 2009 11:00 pm

domboy wrote:Ya know that is curious. Since they developed the GPU client using ATI hardware, I wonder why it ends up being faster on nVidia hardware. I would have expected the opposite to be true... i.e. initially optimized for ATI due to the fact it was developed on ATI hardware...


The ATI client was developed before either CAL or CUDA existed. It was intially designed to use Brook as an interface with DirectX and then further adapted as GPU2 to avoid the problems assoicated with DirectX. The NVidia core was developed without the Brook interface which avoids a certain amount of extra overhead.

I'm reasonably confident that both ATI and NVidia are working on a deep rewrite to support the newer Open interface and there's reason to suspect that this might even the score. Of course this is all just speculation, since the details of what either company is doing or what the Pande Group is doing are only hinted at in the public announcements. We won't really know anything until there is/are new cores that are ready to be tested.
bruce
Site Admin
 
Posts: 20177
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Is it worth folding on ATI cards?

Postby Zagen30 » Sat Nov 07, 2009 2:00 am

Thank you, ihaque, for that enlightening summary.
Image
Zagen30
 
Posts: 1814
Joined: Tue Mar 25, 2008 12:45 am

Re: Is it worth folding on ATI cards?

Postby bruce » Sat Nov 07, 2009 2:31 am

It will be interesting to see how it all plays out in OpenMM. There are several unknowns, such as the degree with which "compiler magic" can minimize ATI's NOPs for the fundamental FahCore code and approach the peak throughput compared to the ease with which the compiler can load NV scalar processors.

It reminds me of the early day's of GROMACS, where hand optimized code was able to stuff a lot of scalar processes into the vcctor-oriented SSE instruction format with very few NOPs. At that time, the "compiler magic" was not able to get close to using SSE effectively. Some of that was relatively easy, since FAH, in fact, operates on a lot of actual vectors, but the person who coded it did some additional optimizations that were spectacularly important to the effective use of SSE.
bruce
Site Admin
 
Posts: 20177
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Is it worth folding on ATI cards?

Postby John Naylor » Sat Nov 07, 2009 2:59 am

Thankyou for taking the time to write that post, ihaque :) very interesting!
Folding whatever I'm sent since March 2006 :) Beta testing since October 2006. www.FAH-Addict.net Administrator since August 2009.
User avatar
John Naylor
 
Posts: 1039
Joined: Mon Dec 03, 2007 4:36 pm
Location: University of Birmingham, UK

Re: Is it worth folding on ATI cards?

Postby Grandpa_01 » Sat Nov 07, 2009 6:34 am

Thanks for the explanation ihaque
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
User avatar
Grandpa_01
 
Posts: 1863
Joined: Wed Mar 04, 2009 7:36 am

Re: Is it worth folding on ATI cards?

Postby EduardoS » Sun Nov 08, 2009 6:37 am

ihaque wrote:This is something I see a lot, and it reflects a lack of understanding of GPU architecture and how peak FLOP counts are measured, so let me lay some wisdom down on y'all. For comparison, we'll consider the Nvidia GT200 GPU (GeForce GTX 280) and the AMD RV770 GPU (Radeon HD 4870); older and newer cards on both sides are substantially similar in terms of what I'll describe here. I'm less familiar with the RV770 architecture than the others, so I apologize in advance for any mistakes.

ihaque, the problem here is the lack of transparency, is it a problem with code? Precision? Politics? You don't know? Whatever, but hey... Be honest, you made a comparison and... Well, let me made another, 9600GT vs HD5870, note the difference in ppd (supposed to be the difference in scientific value) between them aren't big, it's 208Gflops vs ~2700Gflops, ok, it's harder to HD5870 to use all of them, let's say you can only get 20% of HD5870, still almost 3 times more flops than 9600GT, it can't be just because ATI is harder to program rigth?

ihaque wrote:VLIW architectures require explicit vectorization

Here you messed things a bit, VLIW is more flexible than a vector in the sense, not all slots must execute the same instruction, if we have a = b + c and then d = e * f both operations may be packed in the same instruction, vectors like SSE and other SIMD extensions requires explicit vectorization.

A VLIW compiler can pack instructions efficiently when a lane execute multiple operations like, for example, in OpenMM :wink:

ihaque wrote:Besides, if we didn't care, I wouldn't have spent an hour writing this post rather than preparing my talk for group meeting...

I believe you, but could you share more about how things are going on PG? I think everybody here don't expects PG to have something to hide, on the other hand PG doesn't show the expected transparency...
EduardoS
 
Posts: 10
Joined: Thu Dec 18, 2008 10:41 pm

Re: Is it worth folding on ATI cards?

Postby cristipurdel » Sun Nov 08, 2009 3:10 pm

Thank you ihaque, for the explanation in your post. Now I can sleep more easily without refreshing the 'gpu comparison' page on wiki every day.

But I don't get it why the short deadlines for the ATI cards, compared to NVIDIA.

A little bit of transparency every now and then doesn't hurt.
cristipurdel
 
Posts: 77
Joined: Wed Mar 12, 2008 11:40 pm

Re: Is it worth folding on ATI cards?

Postby Amaruk » Mon Nov 09, 2009 5:59 am

First of all, thank you ihaque for taking the time to share that info with us. :)


cristipurdel wrote:But I don't get it why the short deadlines for the ATI cards, compared to NVIDIA.
Most of the Nvidia deadlines are 2-3 days. The 1888 pt WUs (5911-5915) are 3-6 days but take 3X longer, so they are comparable as well. That leaves just the newer 787 pt WUs (5787-5798) which have a deadline of 10-15 days. These latest WUs were released after many users had requested longer deadlines. It is possible that the deadline for newer AMD WUs will be longer as well, but I'm not sure when new WUs will be released for AMD. That might not happen until GPU3 - only time will tell.

EduardoS wrote:ihaque, the problem here is the lack of transparency, is it a problem with code? Precision? Politics? You don't know? Whatever, but hey... Be honest...
The main contributors to AMD's performance relative to nvidia are card architecture and small protein size, both of which PG has publicly acknowledged since before the nvidia client was released. That was well over a year ago. Seems pretty open and honest to me. The GPU3 client should provide improved utilization of the hardware and the size of the proteins will increase in time. As for comparing the 9600GT to the HD5870, flops!= PPD for a number of reasons, rendering your comparison irrelevant.

EduardoS wrote:
ihaque wrote:VLIW architectures require explicit vectorization

Here you messed things a bit, VLIW is more flexible than a vector in the sense, not all slots must execute the same instruction, if we have a = b + c and then d = e * f both operations may be packed in the same instruction, vectors like SSE and other SIMD extensions requires explicit vectorization.

A VLIW compiler can pack instructions efficiently when a lane execute multiple operations like, for example, in OpenMM :wink:
When quoting fractions of sentences one often runs the risk of taking it out of context. Here is the complete sentence. (Emphasis is mine)

ihaque wrote:VLIW architectures require explicit vectorization (or lots of compiler magic to infer vectorization) to keep all those slots occupied.
Looks like you both agree that a good compiler would prove very useful. ;)

EduardoS wrote:I think everybody here don't expects PG to have something to hide, on the other hand PG doesn't show the expected transparency...
There will in fact be those who think PG is hiding something.

And no matter how 'open' PG is, they will still be accused of not being 'transparent' enough.
Image
User avatar
Amaruk
 
Posts: 510
Joined: Fri Jun 20, 2008 3:57 am
Location: Watching from the Woods

Re: Is it worth folding on ATI cards?

Postby ihaque » Mon Nov 09, 2009 7:33 am

You (all) are quite welcome - I'm glad you found my description informative! Amaruk already answered many of the things to which I was going to respond, but there's just one point I'd like to clarify a bit:

Amaruk wrote:
EduardoS wrote:
ihaque wrote:VLIW architectures require explicit vectorization

Here you messed things a bit, VLIW is more flexible than a vector in the sense, not all slots must execute the same instruction, if we have a = b + c and then d = e * f both operations may be packed in the same instruction, vectors like SSE and other SIMD extensions requires explicit vectorization.

A VLIW compiler can pack instructions efficiently when a lane execute multiple operations like, for example, in OpenMM :wink:
When quoting fractions of sentences one often runs the risk of taking it out of context. Here is the complete sentence. (Emphasis is mine)

ihaque wrote:VLIW architectures require explicit vectorization (or lots of compiler magic to infer vectorization) to keep all those slots occupied.
Looks like you both agree that a good compiler would prove very useful. ;)


Both of you are correct. EduardoS is right - vectorization is actually a stricter condition than is required for multiple instruction dispatch in a VLIW. What's needed to pack multiple operations into different lanes of a single VLIW instruction is that the operations must be independent. Informally speaking, you can't have one operation read data that is the result of another one (obviously, because you're trying to compute them at the same time). Vectorization is an easy way to achieve this, because structuring your computation as a vector operation is an explicit statement to the compiler that all the vector components are being handled independently.

Of course, it's possible to have independent instructions even if you don't explicitly vectorize your code. As a very simple example, consider the same 4-component add as in my previous post, but written as an addition on four separate variables, not 4 components of a vector. It's the same operation, just expressed slightly differently. The key aspect here is that the programmer has no longer explicitly marked these instructions as independent, so now it's up to the compiler (or the hardware) to discover this parallelism. Believe it or not, almost all current desktop CPUs will do a limited form of this in hardware - it's called out-of-order superscalar execution, and it's present on essentially every Intel and AMD CPU out there except for the Atom (CPUs use it for a different reason - to avoid pipeline stalls - but the fundamental idea of finding instruction-level parallelism is the same).

The problem is that the effectiveness of this sort of parallelism-finding is often rather limited. Like Amaruk pointed out, a good compiler would be useful for a VLIW. It's stronger than that - in the absence of explicit programmer directives (like vectorization), a good compiler is essential for a VLIW to perform well, because this kind of chip depends on the compiler to give it good instruction scheduling. The problem here is that historically, we've had a lot more success in getting people to write vectorized code than writing awesome VLIW compilers - I wasn't kidding when I said it was an open problem in CS. The AMD architecture is great for graphics work because the math operations underlying graphics are easily vectorized or manually scheduled in the driver, which handles compiling DirectX/OpenGL instructions into GPU operations. It's just harder to program (which doesn't necessarily mean worse, mind you!) for GPGPU work, like folding, because it potentially depends more on programmer effort (in vectorization) and compiler quality (instruction scheduling and autovectorization) to get top performance.

(I had been hoping not to get too far into a discussion of instruction-level parallelism and the difference between vector and VLIW processors but EduardoS called me on it :P )
User avatar
ihaque
Pande Group Member
 
Posts: 239
Joined: Mon Dec 03, 2007 4:20 am
Location: Stanford

Re: Is it worth folding on ATI cards?

Postby yurexxx » Mon Nov 09, 2009 3:16 pm

The GPU3 client could provide another situation comparing PPD between Nvidia and ATI hardware!

Even today we can see that in Elcomsoft software:
Image
Image
yurexxx
 
Posts: 61
Joined: Mon Sep 22, 2008 9:38 pm
Location: Russia, Ufa city

Re: Is it worth folding on ATI cards?

Postby SidVicious » Mon Nov 09, 2009 6:50 pm

Last edited by SidVicious on Mon Nov 09, 2009 10:20 pm, edited 1 time in total.
I'm doing science and I'm still folding
I feel FANTASTIC and I'm still folding
While you are dying I'll still be folding
and when you're dead I'll still be folding
STILL FOLDING, still folding
SidVicious
 
Posts: 51
Joined: Sun Jan 13, 2008 10:14 pm

PreviousNext

Return to ATI specific issues

Who is online

Users browsing this forum: No registered users and 0 guests

cron