Moderators: mhouston, Site Moderators, PandeGroup
jbq.junior wrote:Thanks Bruce!
7im, but if there are only few calculations done twice, x86 flops shouldn't be higher, at least at the same order than the nVidia and PS3 clients? According to F@H Client Status page, 1 ATI native flop equals to 1,055 x86 flops, while the nVidia and PS3 x86 flops are exactly 2 times greater ( ~ 2,11 x86 flops).
Thanks.
7im wrote:Summing the # of FLOPS is not a very accurate way to conclude actual performance, or actual science production. Rough estimates only... As shown by this ATI example, ATI GPUs do some calculations twice, so it's actual FLOP count is somewhat raised, but that raised FLOP count does not translate in to more science work being done.
Also, FLOP count varies because some calculations use implicit models, and some are explicit modeling. (I forget which is which, but the CPUs do mostly one kind, and the GPUs ONLY do the other...) One type of model is more thorough (again, I forget which of the two) or more inclusive of the environment surrounding the proteins while the other only works on the proteins itself and makes assumptions about the surrounding environment. (sorry, you can look all this explicit/implicit stuff up on WIKIPedia if you must know more, it's also discussed in the GPU FAQs.)
As such, it's not easy to answer the question in the opening post about useful FLOPS... not all FLOPS are created equally, or even calculated equally. Vijay made a post about "deceptive" FLOPS a while back. Search for it. Since then, PG has taken steps to better represent FLOP counts, but nothing perfect.
mephistopheles wrote:The primary architectural difference seems to be that Nvidia can store intermediate results in fast short-term memory (like a cache, but managed by the program rather than the hardware) while ATI cannot. For the ATI implementation it is quicker to repeat the calculations than to store to and retrieve from the GPU main memory.
jbq.junior wrote:Regarding to the article, I cannot pay for read it. If you have access to it, couldn't you give us details about the 'implementation' section? I think it will be interesting for many.
ATI implementation
For the generations of ATI boards that were available while the software was under development, scatter capability (i.e. indirect writes such as a[i] = x) was not available. To circumvent this limitation, the ... calculations are carried out using two kernels:
1 A computational kernel to calculate the force ... and output the results to the frame buffer.
2 For each computational kernel a corresponding helper kernel to gather and sum the results stored in the frame buffer and update the final atom-indexed force array.
...
The computational effort used in the helper kernels is minimal. However the overhead associated with the launch of each kernel is significant relative to the overall computational time for the relatively small systems considered here and in aggregate the overhead required is a significant fraction of the total time.
...
Nvidia implementation
... exploiting architectural features of CUDA allowed for significantly more efficient execution.
...
Due to the existence of scatter, thread synchronization, and a 16 K of high-speed shared memory in each processor within CUDA-compatible GPUs, each nonbond kernel can exploit the symmetry of the force calculation matrix... This reduces the magnitude of the overall calculation by a factor of ~2
....
Because all the threads in a warp are guaranteed to execute synchronously, each thread can interact with one atom's data in shared memory at a time for p iterations without any fear of overlap or any need for overt synchronization
....
Unlike the ATI client, merging the nonbond kernel with the first loop of the implicit solvent kernel was a big win, improving performance by 20%. The difference here lies in the ability of the shared memory to hold a sufficient number of intermediate values to make this a net win.
...
Future work
Because our GPU implementations have been developed over a period of time, some of the latest advances in GPU hardware have not been fully exploited.
jbq.junior wrote:So, why ATI cannot store locally?
The Local Data Store is a write-private, read-public model: A thread can only write to its own memory space.

Users browsing this forum: No registered users