QRB: WU Speed vs. WU Quantity

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: Site Moderators, FAHC Science Team

Post Reply
rwh202
Posts: 425
Joined: Mon Nov 15, 2010 8:51 pm
Hardware configuration: 8x GTX 1080
3x GTX 1080 Ti
3x GTX 1060
Various other bits and pieces
Location: South Coast, UK

QRB: WU Speed vs. WU Quantity

Post by rwh202 »

I think it's fairly obvious that big pascal cards aren't being used efficiently - either down to drivers, cores or both. If they were, then performance would scale linearly from the 1280 shader 1060, through the 2560 shader 1080 and the 3584 shaders of the 1080Ti. The QRB would then help further elevate the higher cards.
As it stands, a 1080Ti barely outperforms a 1080, which in turn barely out folds a 1070.
Power draw is a clear measure of this - folding struggles to pull the full TDP from a 1080Ti, instead drawing about the same as a 1080 (which is fair, since it does about the same output). There is another 40% on tap for optimisations to draw on.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: QRB: WU Speed vs. WU Quantity

Post by bruce »

GPU processings may be limited by the drivers but my guess is that the PCIe bus isn't providing the data fast enough to keep the shaders busy 100% if the time. that's a difficult thing to evaluate (at least on Windows) when the drivers run with a spin-wait so you can't tell whether the CPU is busy calculating things or busy waiting.

There is a certain amount of processing that's off-loaded to the CPU. We like to think that the FAHCore process doesn't do anything except move data to/from RAM to VRAM to provide shaders with more work before they finish whatever they're doing on the previous block of work. but that's only partly true. OpenCL does off-load certain calculations to the CPU -- especially on Windows. "FAHCore-next" has plans to make improvements in this area but a lot still depends on how the manufacturer writes the drivers.
rwh202
Posts: 425
Joined: Mon Nov 15, 2010 8:51 pm
Hardware configuration: 8x GTX 1080
3x GTX 1080 Ti
3x GTX 1060
Various other bits and pieces
Location: South Coast, UK

Re: Is there going to be new cores to take advantage of new

Post by rwh202 »

Yeah, linux performs better but still room for improvement - there it only uses 1-2% PCIE bus, and with 10 GB/sec available, that shouldn't be a limiting factor (if it is, something should be optimised in the code)
Folding uses only a few hundred MB of RAM and VRAM and with 10 GB available, there are resources to use and abuse if it improves performance.

It seems like either the projects or routines just don't scale across the sheer number of cores - has anyone tried or indeed succeeded to run multiple WUs simultaneosuly on the same card? Even then, it'll be a trade off between two tasks at best case 50% performance vs a single one getting 70% performance.

I know that some other distributed computing projects effectively interweave 2 or more WUs together that get processed at the same time as part of a super-WU, ensuring that the GPU is always busy (1 task shuffling data or checkpointing on CPU whilst other is crunching)
rwh202
Posts: 425
Joined: Mon Nov 15, 2010 8:51 pm
Hardware configuration: 8x GTX 1080
3x GTX 1080 Ti
3x GTX 1060
Various other bits and pieces
Location: South Coast, UK

Re: QRB: WU Speed vs. WU Quantity

Post by rwh202 »

bruce wrote:Running multiple WUs on the same device slows down both of them. If one WU uses ~90% of the GFLOPS, running two of them might make use of the other 10%, but each of the WUs would get ~55% of the original 90% so each one will take 60% longer to finish. As far as FAH is concerned, faster results are ALWAYS better than MORE results if those results are delayed.
If it were 90% then I'd agree. It's just that I believe it's around 70% of available FLOPS being used on big pascal for one reason or another, and looking to use as many of them as possible for science. Even if points were to go down, science would increase since I think we all agree that the QRB doesn't give an accurate scaling for science at the high end.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

QRB: WU Speed vs. WU Quantiitty

Post by bruce »

rwh202 wrote:If it were 90% then I'd agree. It's just that I believe it's around 70% of available FLOPS being used on big pascal for one reason or another, and looking to use as many of them as possible for science. Even if points were to go down, science would increase since I think we all agree that the QRB doesn't give an accurate scaling for science at the high end.
I question those assumptions.

1) Getting more WUs done doesn't help science as getting them done faster. At some point it becomes a matter of supply and demand -- based both on the number of GPUs available and the number of alternate projects that are currently running. If there are N people competing for the "best" Wus and those projects only need X*N Run-Clone-Gens, (for some 0<X<1.0) then somebody has to need to generate (1-X)*N more WUs -- probably low-priority WUs, at that . The total number of WUs that can run on Pascal needs to always be greater than N or people complain that there aren't enough WUs. It's reasonable to assume there are always low-priority projects that could be started provided there's somebody available to generate them and servers with enough bandwidth to manage them, but there are seasonal variations in people who are available to do just that, just as there are seasonal variations in the value of N.

FAH is not massively parallel (like SETI) but has some very significant needs to manage the serialization of limited numbers of trajectories, hence actual scientific value leads to [bonus points >> per WU points].

2) Even if we assume that all of those issues are insignificant, what percentage of pascal owners would choose more science with lower PPDs? ... and ... what tools do they have to express their desire to bias their assignments contrary to the bias of the majority of other Donors?
rwh202
Posts: 425
Joined: Mon Nov 15, 2010 8:51 pm
Hardware configuration: 8x GTX 1080
3x GTX 1080 Ti
3x GTX 1060
Various other bits and pieces
Location: South Coast, UK

Re: Is there going to be new cores to take advantage of new

Post by rwh202 »

bruce wrote:
rwh202 wrote:If it were 90% then I'd agree. It's just that I believe it's around 70% of available FLOPS being used on big pascal for one reason or another, and looking to use as many of them as possible for science. Even if points were to go down, science would increase since I think we all agree that the QRB doesn't give an accurate scaling for science at the high end.
I question those assumptions.

1) Getting more WUs done doesn't help science as getting them done faster. At some point it becomes a matter of supply and demand -- based both on the number of GPUs available and the number of alternate projects that are currently running. If there are N people competing for the "best" Wus and those projects only need X*N Run-Clone-Gens, (for some 0<X<1.0) then somebody has to need to generate (1-X)*N more WUs -- probably low-priority WUs, at that . The total number of WUs that can run on Pascal needs to always be greater than N or people complain that there aren't enough WUs. It's reasonable to assume there are always low-priority projects that could be started provided there's somebody available to generate them and servers with enough bandwidth to manage them, but there are seasonal variations in people who are available to do just that, just as there are seasonal variations in the value of N.

FAH is not massively parallel (like SETI) but has some very significant needs to manage the serialization of limited numbers of trajectories, hence actual scientific value leads to [bonus points >> per WU points].

2) Even if we assume that all of those issues are insignificant, what percentage of pascal owners would choose more science with lower PPDs? ... and ... what tools do they have to express their desire to bias their assignments contrary to the bias of the majority of other Donors?
If demand exceeds supply, then assignment should block slower cards. Fortunately, we're a way off from that happening and there must be plans incase the 1,000,000 active donors materialise.

I still contest that a 1080Ti could complete two entire trajectories quicker by running them in parallel than sequentially - same science in less time = better.
QRB is there to promote 24/7 folding and running smp instead of uniprocessor clients. Whether a WU with a 10 day deadline gets completed in 50 minutes or 60 minutes is neither here nor there.

As to what tools, well none from Stanford that I'm aware of, but I'm sure something could be bashed together. As to choosing points or science, it needn't be a an either or if the QRB was 'fixed'. Until it is, then I and the majority will still likely optimise for points.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: QRB: WU Speed vs. WU Quantity

Post by bruce »

rwh202 wrote:If demand exceeds supply, then assignment should block slower cards. Fortunately, we're a way off from that happening and there must be plans in case the 1,000,000 active donors materialise.
At the present time, the servers don't have any kind of dynamic assignment modulation methodology. A particular project is either assignable to a particular GPU or it's not. I don't remember seeing a request for such an enhancement. You might submit one. Nevertheless, the FAH Development team does work on unannounced system level enhancements so there might or might not be something like that in the pipeline.

Unfortunately, it would probably be prohibitively complex: categorizing GPUs into smaller groups, updating those groups whenwver new hardware comes to market, managing project priority, and somehow optimizing all that against both the current spectrum of active GPUs and the active spectrum of projects.

Nevertheless, you can open an enhancement request.
I still contest that a 1080Ti could complete two entire trajectories quicker by running them in parallel than sequentially - same science in less time = better.
Quicker? I don't think so. The duration of each WU is from the time it's assigned until it's returned. The next WU from each trajectory cannot be issued until the current WU is completed.

It's impossible to halve the duration of both WUs if you allocate fewer GFLOPS to each one.
QRB is there to promote 24/7 folding and running smp instead of uniprocessor clients. Whether a WU with a 10 day deadline gets completed in 50 minutes or 60 minutes is neither here nor there.
The classic example came from CPU WUs. When hyperthreading was invented, Folks were able to run a pair of uniprocessor WUs on the shared FPU. Two WUs would take ~150% as long as running a single WU on both halves of the FPU. More WUs were completed but they all took longer, slowing up projects that needed fast-turnaround. (The 150% number is approximate, depending on many factors.)

QRB does promote SMP because that's dedicating more GFLOPS to fewer WUs.
It does not promote 24/7 folding. It only promotes making sure that the results are returned ASAP. That's why there's a FINISH setting -- to be used if you're going to suspend FAH for some reason.
As to what tools, well none from Stanford that I'm aware of, but I'm sure something could be bashed together. As to choosing points or science, it needn't be a an either or if the QRB was 'fixed'. Until it is, then I and the majority will still likely optimise for points.
Let's not start another discussion about how QRB needs to be fixed. There are dozens of topics here from a number of years ago and FAH management decided not to accept any of those suggestions.
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: Is there going to be new cores to take advantage of new

Post by _r2w_ben »

Some smaller projects are excluded from running on newer GPUs because of low PPD, e.g. p9842.

The current restriction seems to be based on GPU family and there is a large range in performance between the fastest slow card within a generation. Could GPUs.txt and the assignment server logic be enhanced to include GFLOPS as a more accurate predictor of performance? Historical data is available for AMD and NVidia.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Is there going to be new cores to take advantage of new

Post by bruce »

I did give the PIs the ability to cut off some GPUs based on GFLOPS (which is why those numbers appeared in the comments section.) Doing that for all GPUs is a major project with very little pay-back over the current semi-random assignment process.

GFLOPS is a crude approximation of GPU performance, given the way that a kernel has to handle large/small proteins in relation to the speeds of various segments of VRAM. We don't want to go too far down that road..

If a your low PPD project is excluded from a specific class of GPUs, that increases the chances of those GPUs being idle due to getting no assignment when there's nothing else available. Conditions like that do happen from time-to-time and idle GPUs are not good for either Donors or Science. Either a project is assignable or it's not. Setting the project to a low priority is a better option, but that priority applies to all eligible GPUs.

FAH maintains project priories which don't consider PPD on specific platforms. Some low PPD assignments may have a higher priority than higher PPD projects for a given subclass of GPUs. In any case, all projects need some percentage of assignments even if that percentage varies based on supply & demand & priority.
rwh202
Posts: 425
Joined: Mon Nov 15, 2010 8:51 pm
Hardware configuration: 8x GTX 1080
3x GTX 1080 Ti
3x GTX 1060
Various other bits and pieces
Location: South Coast, UK

Re: Is there going to be new cores to take advantage of new

Post by rwh202 »

bruce wrote:
rwh202 wrote:If demand exceeds supply, then assignment should block slower cards. Fortunately, we're a way off from that happening and there must be plans in case the 1,000,000 active donors materialise.
At the present time, the servers don't have any kind of dynamic assignment modulation methodology. A particular project is either assignable to a particular GPU or it's not. I don't remember seeing a request for such an enhancement. You might submit one. Nevertheless, the FAH Development team does work on unannounced system level enhancements so there might or might not be something like that in the pipeline.
As I said, I don't think this is a needed as an enhancement yet - the priority should be to ensure that the supply of WUs exceeds demand. It was a counter to your suggestion that the finite supply of WU would be an issue if trying to optimise big cards.
bruce wrote:
I still contest that a 1080Ti could complete two entire trajectories quicker by running them in parallel than sequentially - same science in less time = better.
Quicker? I don't think so. The duration of each WU is from the time it's assigned until it's returned. The next WU from each trajectory cannot be issued until the current WU is completed.

It's impossible to halve the duration of both WUs if you allocate fewer GFLOPS to each one.
I don't think you fully understand what I am proposing here - I am well aware that the next WU of a trajectory can't be issued until the previous one is completed and returned.
Imagine that there are two whole trajectories in the supply chain and a single 1080 Ti donor to crunch them.
They could fold
Trajectory 1 WU1
Trajectory 2 WU1
Trajectory 1 WU2
Trajectory 2 WU2
Trajectory 1 WU3
Trajectory 2 WU3
etc.
But, these WU don't make full use of the card and take an hour each. (6 hours to complete both runs to Gen3)

Alternatively runs:
Trajectory 1 WU1, Trajectory 2 WU1
Trajectory 1 WU2, Trajectory 2 WU2
Trajectory 1 WU3, Trajectory 2 WU3
etc.
This time, the card is fully utilised (GFLOPs are improved) and each pair of WU takes 1 hour 40 mins. Total time to Gen3 is 5 hours - same work in less time
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Is there going to be new cores to take advantage of new

Post by bruce »

You are looking at FAH from the perspective on one individual, not from a global perspective. FAH has many, many Donors and collectively, they can do work FASTER than an individual can.

Trajectory 1 WU1,
Trajectory 1 WU2,
Trajectory 1 WU3,
Finishes in 3 hours, whether you do them or somebody else picks them up. That's faster than 3 x 1h40m.

Somebody else can complete the other 3 WUs in 3 hours. That's a win for both trajectories.

Now suppose a project needs trajectories that take 6 months (4320 Gens per trajectory) Your method requires 10 months and the Grad student running the project doesn't graduate on time. (Some projects are run by Grad Students)

That's why spending 5 hours on
Trajectory 1 WU1, Trajectory 2 WU1
Trajectory 1 WU2, Trajectory 2 WU2
Trajectory 1 WU3, Trajectory 2 WU3

gets lower PPD than spending 5 hours on
Trajectory 1 WU1
Trajectory 2 WU2
Trajectory x WU3
Trajectory y WU4
Trajectory z WU5.
rwh202
Posts: 425
Joined: Mon Nov 15, 2010 8:51 pm
Hardware configuration: 8x GTX 1080
3x GTX 1080 Ti
3x GTX 1060
Various other bits and pieces
Location: South Coast, UK

Re: Is there going to be new cores to take advantage of new

Post by rwh202 »

bruce wrote:You are looking at FAH from the perspective on one individual, not from a global perspective. FAH has many, many Donors and collectively, they can do work FASTER than an individual can.

Trajectory 1 WU1,
Trajectory 1 WU2,
Trajectory 1 WU3,
Finishes in 3 hours, whether you do them or somebody else picks them up. That's faster than 3 x 1h40m.

Somebody else can complete the other 3 WUs in 3 hours. That's a win for both trajectories.

Now suppose a project needs trajectories that take 6 months (4320 Gens per trajectory) Your method requires 10 months and the Grad student running the project doesn't graduate on time. (Some projects are run by Grad Students)

That's why spending 5 hours on
Trajectory 1 WU1, Trajectory 2 WU1
Trajectory 1 WU2, Trajectory 2 WU2
Trajectory 1 WU3, Trajectory 2 WU3

gets lower PPD than spending 5 hours on
Trajectory 1 WU1
Trajectory 2 WU2
Trajectory x WU3
Trajectory y WU4
Trajectory z WU5.
But there are more trajectories than donors so you can't assume that the other trajectory will be completed. Having 2 trajectories completed in 5 hours has to be better than the same 2 in 6 hours. Even if the other trajectory was able to be resourced from the pool of donors in the same period, the chances are that it would take more than 5 hours to return based on average performance of the pool.

If your example of a 6 month deadline and my method takes 10 was realistic, then the whole system is flawed because it would be needing a performance level equivalent to 60% 1080Ti from every donor all of the time. Every lower specced card, dumped, errored or timed out WU would see their graduation delayed. I've not seen the stats for folding, but for other projects, the ratio of successful, on-time results to the number assigned is appalling.
Post Reply