One thing to note about FAH and gromacs is that they spend a lot of time hand-optimizing the code so that it works incredibly efficiently on a given platform.
If you compare NVidia's development with AMD's development it really is no surprise that NVidia's is so far ahead...
Take the past three generations of AMD cards, you have three fundamentally different architectures: VLIW5, VLIW4 and most recently, a big shift with GCN. Big changes each step of the way. It has to be noted that for AMD, they actually got the utilisation of VLIW5 pretty high, apparently between 4 and 4.5 according to mhouston, which is a pretty good effort as that architecture was very challenging to hit anywhere near peak efficiency.
Compare that with NVidia: Up until the latest series, where there was a relatively big shift (ironically Nvidia shifted from fewer more powerful 'cores' to many weak 'cores' while AMD has done the reverse), the underlying architecture had evolved over recent generations, rather than radically differ. But while NVidia has had a large shift between the 500 series and 600 series, software and firmware wise it's been a much simpler path forward. If AMD's hardware architecture has been all over the place it's firmware/software evolution has been even worse (and its reputation historically on drivers are pretty poor, though from the forums it looks like mhouston has in the past put in a lot of effort into resolving driver issues). When results need to uphold such scientific integrity and rigor, development doesn't happen as fast as it does in Web 2.0 sites...
So while the latest gen of architecture is much different, getting a workable client out for Nvidia has been much easier. Any OpenCL core on the GPU is going to be far less mature than the CUDA implementation. Furthermore, to simply say "AMD outperforms NVidia in some other GPGPU applications" is incredibly disingenuous because so many factors relying on the actual calculations being undertaken alone.
Finally, it has been said for a while now that for a long time ATI folding has been 'waiting' on larger WUs to be processed, in order to take better advantage of all the stream processors. I personally don't think hardware should dictate the science - they shouldn't release large implicit solvent WUs just because they have the GPUs to do so. Although, on the CPU side of the ledger much larger explicit solvent proteins do exist (BigAdv anyone?), that is what I find makes this quote most interesting:
With recent advances in both cores and completion of our testing of these capabilities to ensure agreement, we are now confident we can do the same work on both cores.
In other words, there is the
possibility that formerly CPU only, larger explicit proteins, may soon be able to be crunched on the GPU. That fact seems to have been overlooked somewhat in the broader analysis of that blog post. What the post means in terms of platforms supported, computational efficiency, availability and numerous other aspects
is purely speculation until they roll out...
In other words, I am optimistic that the relative 'quiet' lately has hidden what is hopefully a large amount of development work behind the scenes. I for one am keeping my eyes on the developments ahead as hopefully all the different components are finally lining up in a manner that means GPU folding can really make a big leap forward.