ATI Client limited to using 320 Shaders? [No, it is not]

Moderators: mhouston, Site Moderators, PandeGroup

ATI Client limited to using 320 Shaders? [No, it is not]

Postby Mechromancer » Mon Feb 02, 2009 12:06 am

ATI GPUs are currently heavily limited in performance, all according to this guy from TR:

HurgyMcGurgyGurg wrote:Okay about ATI Vs. Nvidia.

For low end cards, get Nvidia, they definitely out do anything that is currently offered by ATI, mainly because they aren't much improvement compared to the 3850 which is used for benchmarking, and thus sit around the 1.5k ppd area.

However, if you did have the money, it might be best to think twice about choosing Nvidia over ATI. Sure the GTX 260 gets 7-9k ppd, and the HD 4870 only gets 4-5k, but its important to understand how points are assigned.

The benchmark machine is a Radeon HD 3850, points are determined so that no matter what WU the 3850 does, it will get 1.5k ppd per day. For instance all those 511 point WU's the 3850 will do three per day or so. Pretty basic, yes?

Now, Nvidia does well because the current range of WU's are in-efficient on ATI gpus. All gpu work units are very small at the moment, (You know how when you select WU size, you have the option of small, medium, or large? There only are small WU's and a few medium ones out yet) this is due to the GPU client still being early days. So Nvidia cards are essentially optimized for this kind of WU, they have a smaller amount of fast shaders. This is compounded by the fact that the ATI client has huge in-efficiencies so that it can't use any more shaders than were on the HD 3870.

So where am I going with this.

1. Once larger WUs are released, ATI gpus will do much better as they are more suited to this with the greater amount of shaders. Nvidia gpus will actually slow down and earn less points because the benchmark 3850 will do better.
2. The ATI client still has a lot of optimization to do. I have a HD 4870, its ppd has gone from around 3k to about 5k with the newest client and the largest WU. CPU usage has dropped from maxing out a core to using only 25%. Nvidia has already done pretty much all the optimization they can do client wise. (Remember how much they marketed CUDA with folding, they put in a lot more effort than ATI did with folding) Basically this is the idea, the HD 4870 has 800 shader processors, the HD 3870 has only 320. Thats about 2.5k performance if it can all be accessed. So do the math, 4 or 5 times 2.5 is 10-12.5k ppd! Of course this is over optimistic but it obviously shows there is more power in the HD 4870 for folding than is recognized.
3. Theoretically ATI gpus have much greater number crunching power, 1.2 t flops for the 4870, compared to 500-600 g flops for the GTX 260-280.

Thats about it, now all we have to do is wait for new ATI clients to get their job done...


Are these statements truthful? Are ATI GPUs that highly underutilized with the current client (Only 320 of my 800 shaders being use)?!

Honestly, this would explain the tiny increase in PPD going from my 2900XT to a 4870. If anything that is proof that something is unoptimized right now in the ATI client.
Mechromancer
 
Posts: 30
Joined: Thu Feb 21, 2008 5:47 pm

Re: ATI Client limited to utilizing only 320 Shaders?!?!?!

Postby John Naylor » Mon Feb 02, 2009 12:14 am

I don't know if there is a specific limit (i really doubt it), but there is certainly a lot of unused potential in the 48xx series for F@H and the ATI staff (mhouston on this forum) will freely admit that. The ATI team are working very hard to get that extra power utilised, but they have to try and do it while keeping the stability the same or better, otherwise the number of complaints about stability will be more than the current complaints about under-utilisation. So better utilisation is coming, but don't expect it immediately.
Folding whatever they send me since March 2006 :) Beta testing since October 2006. www.FAH-Addict.net Administrator since August 2009.
User avatar
John Naylor
 
Posts: 1268
Joined: Mon Dec 03, 2007 5:36 pm
Location: University of Birmingham, UK

Re: ATI Client limited to utilizing only 320 Shaders?!?!?!

Postby MESeidel » Mon Feb 02, 2009 1:31 am

It is not more true (if it ever was).
The HD4870 runs at double the PPD of a HD3870 and more, depending on the WU.
HD4800 Cards have seen very good improvement on the latest Core and Driver updates.
While the HD3870 is around the level where it was even before the nVidia GPU2 Client came.

The HD4870 still doesn't reach the cheap Geforce 8800GT/9800GT with some WUs.
But that is a different topic.
Discussions on it are very often driven by superficial knowledge.
Image
MESeidel
 
Posts: 124
Joined: Sat Nov 01, 2008 1:49 am

Re: ATI Client limited to utilizing only 320 Shaders?!?!?!

Postby Mechromancer » Mon Feb 02, 2009 2:11 am

^Which is why I am seeking clarity here on in these forums.
Mechromancer
 
Posts: 30
Joined: Thu Feb 21, 2008 5:47 pm

Re: ATI Client limited to utilizing only 320 Shaders?!?!?!

Postby toTOW » Mon Feb 02, 2009 2:17 am

Here's an answer I got from Mike to the question "What is the real SP usage ?" :

mhouston wrote:Well, on the 4870, it depends a lot on the size of the protein. Basically, the force calculations are well behaved and can be unrolled to maximize processor utilization and we are using all of the 10 SIMD arrays (each with 16 5-wide VLIW ALUs), and our ALU pack rate is high. However, there are parts of the algorithm, like the update steps, that do not handle wide issue all that well, at least in their current form on ATI hardware, and may only make use of 1 or 2 SIMD arrays well. With large proteins, the force calculation dwarfs the other calculations, so the processor is used well. On small proteins, like lambda, the code is tuned up to the point now where the force kernels represent only a little more than half of the computation time, so the processor is not being used all that well.

However, the unrolling required to maximize usage during the force calculations causes overheads in other kernels, so there is a balance there and at least on the ATI core, we have some code that tries to make this tradeoff automatically, although it's only working well on the larger proteins.
Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.

FAH-Addict : latest news, tests and reviews about Folding@Home project.

Image
User avatar
toTOW
Super Moderator
 
Posts: 9387
Joined: Sun Dec 02, 2007 11:38 am
Location: Bordeaux, France

Re: ATI Client limited to utilizing only 320 Shaders?!?!?!

Postby slavas » Mon Feb 02, 2009 2:30 am

http://www.techpowerup.com/reviews/Powe ... 30/20.html
dunno which WU, Core and CAL quite old were, but at least at time of that article, that WU, Core and CAL combo didnt use >560 shaders
slavas
 
Posts: 33
Joined: Sun Jun 29, 2008 12:25 am

Re: ATI Client limited to utilizing only 320 Shaders?!?!?!

Postby mhouston » Mon Feb 02, 2009 4:23 am

In early cores, the processor was not well used for most proteins. We have gotten progressively better with each core update. To update toTOWs quote, lambda in the next release is using the processor much better now, but there is still some narrow issue points we are looking at as well as overall tuning of the algorithm. If a 48XX was used to it's full ability, you would expect a 2.5X performance increase clock for clock over a 38XX running the same code. However, there are some things that a 4XXX can potentially do that a 3XXX/2XXX board cannot that are also being looked at which may make the performance multiplier higher. Right now the concentration is getting the current code paths tuned all the way up.
mhouston
 
Posts: 1211
Joined: Sun Dec 02, 2007 9:19 pm

Re: ATI Client limited to utilizing only 320 Shaders?!?!?!

Postby 7im » Mon Feb 02, 2009 4:37 am

mhouston wrote:In early cores, the processor was not well used for most proteins. We have gotten progressively better with each core update. To update toTOWs quote, lambda in the next release is using the processor much better now, but there is still some narrow issue points we are looking at as well as overall tuning of the algorithm. If a 48XX was used to it's full ability, you would expect a 2.5X performance increase clock for clock over a 38XX running the same code. However, there are some things that a 4XXX can potentially do that a 3XXX/2XXX board cannot that are also being looked at which may make the performance multiplier higher. Right now the concentration is getting the current code paths tuned all the way up.


Just to confirm, you are trying to tuneup the current code to reach that 2.5X level, and will then persue the additional tweaks through new code, correct? And to inquire further, you are taking this approach because it potentially helps increase the performance of all ATI cards, and not just the 4xxx series like that 2nd step would?
User avatar
7im
 
Posts: 7379
Joined: Thu Nov 29, 2007 5:30 pm

Re: ATI Client limited to utilizing only 320 Shaders?!?!?!

Postby mhouston » Mon Feb 02, 2009 5:27 am

We are trying to improve performance of ALL boards, i.e. the general code path. When that tops out, we can start looking at 4XXX and beyond specific features.
mhouston
 
Posts: 1211
Joined: Sun Dec 02, 2007 9:19 pm

Re: ATI Client limited to utilizing only 320 Shaders?!?!?!

Postby Mechromancer » Mon Feb 02, 2009 6:27 am

mhouston wrote:In early cores, the processor was not well used for most proteins. We have gotten progressively better with each core update. To update toTOWs quote, lambda in the next release is using the processor much better now, but there is still some narrow issue points we are looking at as well as overall tuning of the algorithm. If a 48XX was used to it's full ability, you would expect a 2.5X performance increase clock for clock over a 38XX running the same code. However, there are some things that a 4XXX can potentially do that a 3XXX/2XXX board cannot that are also being looked at which may make the performance multiplier higher. Right now the concentration is getting the current code paths tuned all the way up.


Thank you for the clarity. I'm glad to know my 4870 will pay off eventually. F@H is saving the best for last.
Mechromancer
 
Posts: 30
Joined: Thu Feb 21, 2008 5:47 pm

Re: ATI Client limited to utilizing only 320 Shaders?!?!?!

Postby HurgyMcGurgyGurg » Mon Feb 02, 2009 11:55 am

Erm yea that's me,

I will admit most of my statements are speculation, I pretty much coupled together various statements from other people (Such as mhouston) and presented them for someone else asking about which GPU they should buy for folding. Yes I did oversimplify, since beyond the basics it gets very very speculative, I'm not a journalist I can't get the facts, I'm sure I made a few mistakes.

mhouston wrote:In early cores, the processor was not well used for most proteins. We have gotten progressively better with each core update. To update toTOWs quote, lambda in the next release is using the processor much better now, but there is still some narrow issue points we are looking at as well as overall tuning of the algorithm. If a 48XX was used to it's full ability, you would expect a 2.5X performance increase clock for clock over a 38XX running the same code. However, there are some things that a 4XXX can potentially do that a 3XXX/2XXX board cannot that are also being looked at which may make the performance multiplier higher. Right now the concentration is getting the current code paths tuned all the way up.


Good to know I was in the general ballpark though. :D

Good luck with those optimizations, I'm sorry if any of my statements made you cringe due to in-accuracy or some blatant mis-understanding.
HurgyMcGurgyGurg
 
Posts: 22
Joined: Sun Oct 26, 2008 2:06 am

Re: ATI Client limited to utilizing only 320 Shaders?!?!?!

Postby CDG » Wed Feb 04, 2009 7:45 pm

How 'bout this?

http://theovalich.wordpress.com/2008/11/04/amd-folding-explained-future-reveale/

At least it give much reasonable explanation regarding the ATI's issue,
rather old article but Mhouston should know better ... :mrgreen:
CDG
 
Posts: 5
Joined: Fri Aug 15, 2008 2:13 pm

Re: ATI Client limited to utilizing only 320 Shaders?!?!?!

Postby mhouston » Wed Feb 04, 2009 10:16 pm

There is a fair amount of speculation in that article. While it's true that 4XXX hardware can do things 2XXX/3XXX cannot, from the time that article was written to now, we have improved performance without resorting to family specific optimizations. We have been heavily CPU bound, especially on the small proteins, which was holding back the GPU, and that has been our main focus in the drivers and Brook. We continue to chip away at that so that GPU tuning work will show benefit. We will only dig deeply into 4XXX specific optimizations once we have mined out the potential optimization that span all GPUs.
mhouston
 
Posts: 1211
Joined: Sun Dec 02, 2007 9:19 pm

Re: ATI Client limited to utilizing only 320 Shaders?!?!?!

Postby ParrLeyne » Thu Feb 05, 2009 2:19 am

mhouston wrote:...We have been heavily CPU bound, especially on the small proteins, which was holding back the GPU, and that has been our main focus in the drivers and Brook. We continue to chip away at that so that GPU tuning work will show benefit...

First, thanks for your reply and your work on Folding, and GPU/Brook technologies in general. It is appreciated by many, including myself.

I understand the CPU vs. GPU trade-off (have been writing PC software for 20+ years so I know there "is no free lunch") but the "we have been heavily CPU bound" comment in your reply is something that I can't quite reconcile, based on my own experience as well as other forum postings I have read.

{Context: I have a HD4850 (core slightly OCd to 665), Q6600 (OC'd to 2.96Ghz) and I run WinXP64}

  • My GPU utililization is 99+% and Task Manager shows the FAH client using 100 % of 1 CPU (I use SetAffinity assign the client to specific CPU), these % are unchanged since before I OCd the GPU {w/no other applications running}
  • Other forum postings have comments which outline that if I were to install more than 2 GPU's into my system (say 2 x HD4870x2 cards), regardless of CPU type/speed or host OS, that the overall folding rate would drop across all GPUs as the number of GPU increases beyond 2
These suggest to me, that there is something else at play in the CPU vs. GPU tug of war, since the utilitization rates (%) should move, more or less, inversely to each other (High GPU/Low CPU or Low GPU/High CPU).

What am I missing?
ParrLeyne
 
Posts: 311
Joined: Sun Dec 14, 2008 11:59 pm
Location: Toronto, Canada

Re: ATI Client limited to utilizing only 320 Shaders?!?!?!

Postby mhouston » Thu Feb 05, 2009 5:12 am

When the article was written, GPU utilization reported was lower for most people, sometimes in the high 80%/low 90%. (smaller proteins and some current core/driver optimizations not on then that are on now) However, the sample rate of that counter is not sufficiently high to show the ups and downs in actual utilization (many effects that are sub millisecond). On processors like the 4850, there are some compute kernels that are running in ~10usec, and the CPU can be racing to keep up and submit the next kernel (FLUSH_INTERVAL also relates to this), so we can get idle "bubbles" in the GPU. You could see this amplified last year where for small proteins performance increased almost linearly with CPU performance. (We had originally tuned for lambda [larger protein] on a 3870]). If we submit larger command streams or the proteins are larger meaning the GPU will take longer to process, the CPU side (core code + Brook + CAL + driver + OS) has more time to work through things. The "big oh" complexity is O(N^2), so there is a large difference between a 544 length protein and a 1392 length protein from GPU runtime. Another way to say that is that as we drop CPU load, there will be larger performance increases on smaller proteins and smaller increases on larger proteins. The other side of this is that the utilization reported is the front-end of the processor working on the command stream. Also, a more optimized command stream will make better use of the backend hardware.

Most of the performance increases visible to date has actually not been the actual code running on the GPU, but tuning how we submit to the GPU and lately work on how we "unroll" the algorithm to better use processors. Not that there hasn't been lots of GPU code tuning, it's just that many of the effect of that work, especially on smaller proteins on high end hardware has been masked by CPU overheads. However, as we better utilize the GPU, a large parallel array, any serial paths in the code, i.e. CPU side or GPU front-end become the bottlenecks to speedup (see Amdahl's law), and we need to work on those. Another way to think about it, is tune the GPU code until you aren't getting speed ups you expect, and then start improving the serial portions of the code.
mhouston
 
Posts: 1211
Joined: Sun Dec 02, 2007 9:19 pm

Next

Return to ATI specific issues

Who is online

Users browsing this forum: Yahoo [Bot]