Unuseful FLOPS in ATI GPU Client?

Moderators: mhouston, Site Moderators, PandeGroup

Unuseful FLOPS in ATI GPU Client?

Postby jbq.junior » Fri Jun 19, 2009 3:56 pm

Hello guys!

According to Stanford "FLOP FAQ":

"Due to a difference in the implementation (in part due to hardware differences), the ATI code must do two force calculations where the x86, Cell, and NVIDIA hardware need only do one. This increases the overall native FLOP count for ATI hardware, but since these are not useful FLOPS in a sense, we did not include them in the x86 count."

So, ATI GPUs need to do redundant math to do same thing that others? This may be the cause of the ATI GPU client be so much slower than nVidia in PPD, while ATI GPUs have, theoretically, more compute power?

Sorry for my bad english.

Thanks in advance.
jbq.junior
 
Posts: 3
Joined: Fri Jun 19, 2009 3:11 pm
Location: Goiania, Brazil.

Re: Unuseful FLOPS in ATI GPU Client?

Postby bruce » Fri Jun 19, 2009 4:07 pm

Welcome to the foldingforum, jbq.junior.

There are many reasons for differences in performance and this is, in fact, one of them. At the present time the force calculations do need to be done twice, which means that there's a potential to improve the speed of the ATI code by some factor less than two. Improvements are constantly being made to both ATI and NVidia code, so this might change in the future. (I don't know the details of why this difference exists or how difficult it would be to eliminate this duplication.)
bruce
 
Posts: 8730
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: Unuseful FLOPS in ATI GPU Client?

Postby 7im » Fri Jun 19, 2009 9:02 pm

Yes, ATI cards may have to do some calculations twice, but are faster in other calculations because of the difference in hardware architecture.. Both ATI and NV have unique hardware advantages and disadvantages, and one should not give too much weight to this one example. ATI does not do ALL calculations twice, just a select few calculations.
User avatar
7im
 
Posts: 7067
Joined: Thu Nov 29, 2007 5:30 pm

Re: Unuseful FLOPS in ATI GPU Client?

Postby EduardoS » Sat Jun 20, 2009 2:56 pm

This is a subject where I would like mhouston to talk more, I'm not sure why force calculation needs to be done twice on ATI and only once on nVidia (one to find sizes another to fill? But nVidia lacks global sync too... Can't imagine another reason), it looks just an excuse to explain why nVidia cards gets twice more points and the real reason is something else...

Before someone says ppd isn't important and the science is, transparency is important too, even on small things.
EduardoS
 
Posts: 10
Joined: Thu Dec 18, 2008 11:41 pm

Re: Unuseful FLOPS in ATI GPU Client?

Postby jbq.junior » Sat Jun 20, 2009 3:07 pm

Thanks Bruce!

7im, but if there are only few calculations done twice, x86 flops shouldn't be higher, at least at the same order than the nVidia and PS3 clients? According to F@H Client Status page, 1 ATI native flop equals to 1,055 x86 flops, while the nVidia and PS3 x86 flops are exactly 2 times greater ( ~ 2,11 x86 flops).

Thanks.
jbq.junior
 
Posts: 3
Joined: Fri Jun 19, 2009 3:11 pm
Location: Goiania, Brazil.

Re: Unuseful FLOPS in ATI GPU Client?

Postby kobib » Sat Jun 20, 2009 3:50 pm

As i run an ATI 4850 almost 24/7 for folding i have also wondered as to why the HUGE diff.

I for one don't give a hoot about points. The number that catches my attention is how many WU are turned in per day and how many wu total i have completed, because the WUs we do now make the next batch of WUs and the more we can turn in per day the faster over all the research goes. My little 4850 trudges through about 4 WU everyday. This is on average and without regard to what WU its chewing on. 4 WU is the average that i can send in per day. And thinking that the nVidia guys are getting the same points per WU that we are... they are turning in more WUs every day. (duh)

Using my sub prime math skills, the GTX285 has to turn in at least 14 WUs per day to get the ppd that the nvidia guys are putting up.

DANG!
14-ish WUs a day!
98-ish WUs a week!
420-ish WUs a month!
"And now," cried Max, "Let the wild rumpus start!"
kobib
 
Posts: 27
Joined: Sun Jul 06, 2008 11:24 pm
Location: Las Vegas NV, Kunsan ROK

Re: Unuseful FLOPS in ATI GPU Client?

Postby mhouston » Sat Jun 20, 2009 4:00 pm

It's a difference in algorithm choice and scaling. The easiest way to think about things is to look at the physics. Looking at two particles A and B. To calculate the force on A, you add up all the partial forces on A from all of the particles, including B. To calculate the force on B, you add up all the partial forces on B from all of the particles, including A. Now, you have calculated the force between A and B twice.

There are tradeoffs with calculating the force twice and storing the calculation and reloading it, i.e. a tradeoff between ALU load and memory system load. Now, that being said, not everything is calculated twice since there is other math besides just the force pair (like acceleration and velocity calculations done after the partial forces are calculated, as well as the update to the particle position). There is also a different constant factor for each of the algorithms. If you look at really massive proteins, the performance difference between ATI and Nvidia is small, ~18% comparing a GTX280 and a 4870 despite us "doing 2X the work".

You can read about the difference in implementations and the performance scaling on different proteins in a paper from Vijay's group that Scott LeGrand (Nvidia) and I work on as well: "Accelerating molecular dynamic simulation on graphics processing units" in the Journal of Computational Chemistry, Volume 30, Issue 6, Pages 864 - 872
mhouston
 
Posts: 1210
Joined: Sun Dec 02, 2007 9:19 pm

Re: Unuseful FLOPS in ATI GPU Client?

Postby kobib » Sat Jun 20, 2009 4:03 pm

I'm guessing that we are not working on the massive proteins?
"And now," cried Max, "Let the wild rumpus start!"
kobib
 
Posts: 27
Joined: Sun Jul 06, 2008 11:24 pm
Location: Las Vegas NV, Kunsan ROK

Re: Unuseful FLOPS in ATI GPU Client?

Postby mhouston » Sat Jun 20, 2009 4:42 pm

I don't get to control the distribution of proteins, but nothing as large as spectrin used in the above cited paper has gone outside of the lab.
mhouston
 
Posts: 1210
Joined: Sun Dec 02, 2007 9:19 pm

Re: Unuseful FLOPS in ATI GPU Client?

Postby 7im » Sat Jun 20, 2009 7:52 pm

jbq.junior wrote:Thanks Bruce!

7im, but if there are only few calculations done twice, x86 flops shouldn't be higher, at least at the same order than the nVidia and PS3 clients? According to F@H Client Status page, 1 ATI native flop equals to 1,055 x86 flops, while the nVidia and PS3 x86 flops are exactly 2 times greater ( ~ 2,11 x86 flops).

Thanks.



Summing the # of FLOPS is not a very accurate way to conclude actual performance, or actual science production. Rough estimates only... As shown by this ATI example, ATI GPUs do some calculations twice, so it's actual FLOP count is somewhat raised, but that raised FLOP count does not translate in to more science work being done.

Also, FLOP count varies because some calculations use implicit models, and some are explicit modeling. (I forget which is which, but the CPUs do mostly one kind, and the GPUs ONLY do the other...) One type of model is more thorough (again, I forget which of the two) or more inclusive of the environment surrounding the proteins while the other only works on the proteins itself and makes assumptions about the surrounding environment. (sorry, you can look all this explicit/implicit stuff up on WIKIPedia if you must know more, it's also discussed in the GPU FAQs.)

As such, it's not easy to answer the question in the opening post about useful FLOPS... not all FLOPS are created equally, or even calculated equally. Vijay made a post about "deceptive" FLOPS a while back. Search for it. Since then, PG has taken steps to better represent FLOP counts, but nothing perfect. ;)
User avatar
7im
 
Posts: 7067
Joined: Thu Nov 29, 2007 5:30 pm

Re: Unuseful FLOPS in ATI GPU Client?

Postby mephistopheles » Sun Jun 21, 2009 7:51 pm

Here is the paper mhouston refers to: http://www3.interscience.wiley.com/cgi- ... /HTMLSTART
I guess the 'performance' section is the most interesting for most, while the 'implementation' section says something about why the difference exists.

The performance benchmarks are for an Nvidia GTX 280 and an ATI 4870. On paper, the 4870 has the advantage with higher theoretical peak FLOPS, but for this folding implementation the GTX 280 is
  • 100% quicker for small proteins (~500 atoms)
  • 40% quicker for medium (~1200 atoms, the largest we are currently folding)
  • 20% quicker for large proteins (~5000 atoms)
This is despite the ATI card doing up to twice as many FLOPS during the calculations.

The primary architectural difference seems to be that Nvidia can store intermediate results in fast short-term memory (like a cache, but managed by the program rather than the hardware) while ATI cannot. For the ATI implementation it is quicker to repeat the calculations than to store to and retrieve from the GPU main memory.

Although I guess it is part of the story that both cards are 60 (small proteins) to 700 (large proteins) times quicker than a single-core CPU client.
And it is definitely part of the story that doing this kind of calculations on a graphics card in the first place is still so new that it warrants publishing in a scientific journal.
mephistopheles
 
Posts: 111
Joined: Tue Apr 07, 2009 8:51 am

Re: Unuseful FLOPS in ATI GPU Client?

Postby jbq.junior » Mon Jun 22, 2009 1:53 am

7im wrote:Summing the # of FLOPS is not a very accurate way to conclude actual performance, or actual science production. Rough estimates only... As shown by this ATI example, ATI GPUs do some calculations twice, so it's actual FLOP count is somewhat raised, but that raised FLOP count does not translate in to more science work being done.

Also, FLOP count varies because some calculations use implicit models, and some are explicit modeling. (I forget which is which, but the CPUs do mostly one kind, and the GPUs ONLY do the other...) One type of model is more thorough (again, I forget which of the two) or more inclusive of the environment surrounding the proteins while the other only works on the proteins itself and makes assumptions about the surrounding environment. (sorry, you can look all this explicit/implicit stuff up on WIKIPedia if you must know more, it's also discussed in the GPU FAQs.)

As such, it's not easy to answer the question in the opening post about useful FLOPS... not all FLOPS are created equally, or even calculated equally. Vijay made a post about "deceptive" FLOPS a while back. Search for it. Since then, PG has taken steps to better represent FLOP counts, but nothing perfect. ;)


Thanks, 7im!

mephistopheles wrote:The primary architectural difference seems to be that Nvidia can store intermediate results in fast short-term memory (like a cache, but managed by the program rather than the hardware) while ATI cannot. For the ATI implementation it is quicker to repeat the calculations than to store to and retrieve from the GPU main memory.


But, as far as I know, the RV770 (HD4800) can also store data locally. It has much more register file space (2,5 MB while GT200 has 640KB), has 16KB of Local Data Share per SIMD array (same size of the Nvidia's solutions) and even a 16KB of Global data Share.

So, why ATI cannot store locally?

Regarding to the article, I cannot pay for read it. If you have access to it, couldn't you give us details about the 'implementation' section? I think it will be interesting for many.

Thanks!
jbq.junior
 
Posts: 3
Joined: Fri Jun 19, 2009 3:11 pm
Location: Goiania, Brazil.

Re: Unuseful FLOPS in ATI GPU Client?

Postby mhouston » Mon Jun 22, 2009 4:45 am

We do use our register file for sharing, on both R6XX and R7XX, but it limits to sharing within a "thread". i.e. we do multiple particles per thread. We do not currently use LDS.
mhouston
 
Posts: 1210
Joined: Sun Dec 02, 2007 9:19 pm

Re: Unuseful FLOPS in ATI GPU Client?

Postby mephistopheles » Mon Jun 22, 2009 7:18 pm

jbq.junior wrote:Regarding to the article, I cannot pay for read it. If you have access to it, couldn't you give us details about the 'implementation' section? I think it will be interesting for many.

It's too long for a forum post, and copyrighted by the journal, but I suppose a couple of quotes fall under "fair use":
ATI implementation
For the generations of ATI boards that were available while the software was under development, scatter capability (i.e. indirect writes such as a[i] = x) was not available. To circumvent this limitation, the ... calculations are carried out using two kernels:
1 A computational kernel to calculate the force ... and output the results to the frame buffer.
2 For each computational kernel a corresponding helper kernel to gather and sum the results stored in the frame buffer and update the final atom-indexed force array.
...
The computational effort used in the helper kernels is minimal. However the overhead associated with the launch of each kernel is significant relative to the overall computational time for the relatively small systems considered here and in aggregate the overhead required is a significant fraction of the total time.
...

Nvidia implementation
... exploiting architectural features of CUDA allowed for significantly more efficient execution.
...
Due to the existence of scatter, thread synchronization, and a 16 K of high-speed shared memory in each processor within CUDA-compatible GPUs, each nonbond kernel can exploit the symmetry of the force calculation matrix... This reduces the magnitude of the overall calculation by a factor of ~2
....
Because all the threads in a warp are guaranteed to execute synchronously, each thread can interact with one atom's data in shared memory at a time for p iterations without any fear of overlap or any need for overt synchronization
....
Unlike the ATI client, merging the nonbond kernel with the first loop of the implicit solvent kernel was a big win, improving performance by 20%. The difference here lies in the ability of the shared memory to hold a sufficient number of intermediate values to make this a net win.
...

Future work
Because our GPU implementations have been developed over a period of time, some of the latest advances in GPU hardware have not been fully exploited.



jbq.junior wrote:So, why ATI cannot store locally?

I'm hardly an expert on GPU programming, I just had a quick look at CUDA and Stream over the weekend, but here is my guess:
The ATI Stream overview says that
The Local Data Store is a write-private, read-public model: A thread can only write to its own memory space.

Unless I am very much mistaken, the "half the number of calculations" trick used in the Nvidia implementation requires that all threads in a group write to the same local memory (and that the threads move in lockstep synchronization, which in the Nvidia model is implicitly enforced by the hardware) to be fully efficient.
mephistopheles
 
Posts: 111
Joined: Tue Apr 07, 2009 8:51 am

Re: Unuseful FLOPS in ATI GPU Client?

Postby deekey777 » Mon Jun 22, 2009 9:18 pm

Stanford uses Brook+ for the GPU2 Client, and Brook+ supports LDS since Stream SDK 1.4 (March) and it's limited to HD4000.
User avatar
deekey777
 
Posts: 84
Joined: Tue Jan 29, 2008 12:28 am

Next

Return to ATI specific issues

Who is online

Users browsing this forum: No registered users