Jonathan wrote:I think it would be a HUGE persuasion point if you can get 2.5 PFLOPS for just $525,000, which is the equivalent of China's super computer, which cost 88 million (they also used a combination of gpus and cpus).
The Chinese computer costs 20 million a year to operate. The estimated Power consumption of the build I'm thinking of is would be 198 KW, which would cost $208,137 a year in california for electricity costs. (12 cents per KW/Hr).
Statements like this are exactly why it's a bad idea to measure a system in FLOPs. Let me explain.
The Wikipedia page you quote for the 560Ti claims 1263.4 GFLOP/s for the board. This is a number known as theoretical peak - the absolute maximum number of floating point operations that can be performed per second, given the hardware and clock speed. The question of "native" vs "x86" FLOP/s doesn't even make sense here - the conversion we use for FAH is because certain operations that take multiple steps on x86 can be done in one step in hardware on GPUs. However, the operation used for the Wikipedia number is multiplies or adds, which are one operation on both x86 and GPU. So there's no x86 vs native difference here.
The first problem is that peak is, generally, a useless number to quote. Even highly optimized software that is computationally "easy" to map to the hardware typically performs at much less than peak. To use your own example, Rpeak for Tianhe-1 (the Chinese supercomputer) is 4.7 PFLOP/s, but the sustained performance is only 2.6 PFLOP/s (and you can be sure that they tuned the heck out of the software for that!).
The second problem is that you have to look at not just sustained performance, but sustained performance in a useful program. The TOP500 numbers are for LINPACK, a linear algebra benchmark that is (as far as these things go) relatively easy to parallelize and map to supercomputer architectures. Problems that are harder (like molecular dynamics) will not show the same number of GFLOP/s because it's not possible to use the hardware as efficiently. You can't just say, "Oh, we're at 1/500 the performance we need, so I'll buy 500x the hardware and call it a day" - as more parallel resources are added, it's much harder to utilize them effectively in a traditional supercomputing context. The computational architecture of FAH is nice in that we *can* use more computers with almost no effort, but in many ways it is unfair to compare directly to a traditional supercomputer like Jaguar or Tianhe; they're running different problems and have different measurements.
So. Could you build a machine that performs the same number of MD FLOP/s as Tianhe does LINPACK FLOP/s, and do it cheaper? Sure, but that's a consequence of the fact that FAH is able to parallelize more effectively than LINPACK and the benchmarks are doing completely different things. FLOP/s are not the apples to apples comparison you're looking for, here. What you should want is "time to solution" or cost-effectiveness for that. Of course, that's a rather slipperier notion; however, that's what underlies the common PPD or PPD/$ metrics that people use on the forum - using FAH points as a proxy for MD productivity. I suspect that that is a far more reasonable metric in this case.