Inherent problem of small Work Units running on big GPUs

FaaR · Post by **FaaR** » Sun Oct 18, 2020 11:49 pm

We've all run into this I suppose - or at least if we have a high-end GPU.

We get a small work unit that contains comparatively few atoms, and the GPU doesn't have enough data to chrunch to fill up all available shader processors on the chip. Result is a GPU running full tilt burning full power, but doing only a fraction of the total amount of work it is capable of.

This is WASTEFUL, from several points of view, not least the wasted electricity, and particularly if that source of electricity emits large amounts of pollution (like say, from a coal-fired power plant.) I have two AMD Vega 64 GPUs, they can do about 1-1.2M PPD each give or take a little with a big work unit, but often they get stuck with a smaller, burning upwards of 250 watts each for only 700k PPD, or as happens frustratingly sometimes, a mere 450k PPD. That's just not okay. That's taking advantage of me, my time, my computer, my electricity, my money. I don't mind donating lifespan of my electronics and electricity freely for a good cause or I wouldn't have kept running F@H for years, but for grud's sake, at least put my gear to good use!

There's a possible solution...or actually, there's two. Either let users with big GPUs refuse small WUs (easier to implement, but not an ideal alternative), or let large GPUs work on two small WUs at once, using what AMD calls "asynchronous compute shaders". The latter alternative should fill up all - or almost all, in case of a ludicrously large GPU - the shader processors, and let the hardware run much more efficiently.

Post by **bruce** » Mon Oct 19, 2020 12:26 am

FAH's development team is aware of the requests to run pairs of small WUs in wide GPUs. The pros and cons are being evaluated. They are as interested in changes that will increase FAH's total production as you are.

I have no information about how the deliberations are going and no way to predict whether anything will come of this idea and if it turns out to be successful, when it might actually go into production. They do have a backlog of improvement suggestions and each will get a fair hearing of the cost/benefit to establish a ranking.

Post by **PantherX** » Mon Oct 19, 2020 4:14 am

In addition to what bruce stated, do note these two factors too:

1) AMD Drivers could benefit from optimizations, similar to how Nvidia GPUs benefited from CUDA implementation. It would be fantastic to get AMD support to help ensure that FahCore_22 can be as optimized as possible for their architecture. Do keep in mind that recently we have had volunteers join in and help track down some AMD bugs in OpenMM which were ported to FahCore_22 version 0.0.13 but there's more room to improve.

2) There's a possibility that future algorithms might be more intensive on the GPU due to using AI/ML/MM/QML so the current state is bound to change in the future. Given the many features, bug-fixes, etc. they will be addressed in the manner to ensure that the ones that provide the most scientific bang-for-bucks will be addressed first.

ThWuensche · Post by **ThWuensche** » Mon Oct 19, 2020 8:47 am

I don't know about your VEGA 64 GPUs, but at least my Radeon VIIs do consume less power while they run smaller WUs, which do not fully utilize them.

As for the solutions you propose:
- On getting only larger WUs to wide GPUs, that is worked on. Makes sense only as long as WUs for large systems are available, which is not always the case. In the moment I'm also running projects that give only 1.1MPPD, while the GPUs in average make 1.5MPPD, but seems other projects in the moment are not available. If your PPD is only about 1/2 of the average, please notify about these projects here, the project owners can exclude such projects from assignment to AMD GPUs (has been done before on my notification about that, so can confirm that works).
- As for the execution of more than one WU on a GPU, it will raise use of hardware and number of calculations delivered, however I doubt that it will bring more points. The different WUs will compete for hardware ressources, slowing down the individual WUs, which may harm the quick return bonus more than it improves point outcome by parallel WUs. Of course for the project more calculations done would be helpful, even if leading to reduced PPD.

A few months ago I also have been in the situation that something did not work for me and it could be fixed. OpenMM, the system underlying the FAH core22, is open source, so you're welcome to help. The team behind openMM and core22 is very much pushing things ahead, but is also very small. What they have provided is a really great system, but resources are limited. So if you want improvements, I would invite to help.

Disclaimer: I'm not a representative of the project, so can't speak for the project, just my thoughts

foldy · Post by **foldy** » Mon Oct 19, 2020 12:32 pm

ThWuensche wrote:- As for the execution of more than one WU on a GPU, it will raise use of hardware and number of calculations delivered, however I doubt that it will bring more points. The different WUs will compete for hardware ressources, slowing down the individual WUs, which may harm the quick return bonus more than it improves point outcome by parallel WUs. Of course for the project more calculations done would be helpful, even if leading to reduced PPD.

I tried running 2 work units in parallel on gtx 1080ti on Windows and also running boinc gpugrid concurrently with FAH on the same GPU. You need to configure the 2nd GPU slot on the same GPU with index values 0 manually, the 1st GPU slot on same GPU can stay with default -1. The disadvantage is it uses now 2 CPU threads to feed the GPU and also doubled RAM usage. On Linux it had issues with the GPU indexes and also after PC reboot FahControl complained about the double GPU slots and I had to reconfigure them again. But if your GPU is half idle then adding a 2nd project is a good idea. Yes it may delay the 2 projects work units if they are bigger again but it doesn't matter for the project if you deliver 1st work unit and after that 2nd work unit - or after same time deliver both but bonus points get lower.

Post by **PantherX** » Tue Oct 20, 2020 4:31 am

FYI, attempting to configure 2 GPU Slots on a single GPU might work for some and not others as this is an unsupported method. With the new GPU enumeration scheme in V7.6.20, that may or may not work... if it doesn't work, there won't be any "fix" as it isn't officially supported.

When it comes to running multiple WUs on a large GPU, that would not be done in a traditional manner, i.e. multiple GPU Slots for one GPU. Instead, it would be done by the researchers themselves if they can ensure that all the drawbacks are being addressed. Thus, you would see a "single WU" being downloaded/uploaded. Currently, it's all theoretical so there's no commitment.

FaaR · Post by **FaaR** » Tue Oct 20, 2020 11:05 pm

Thank you to those who work on the project providing some info, it is good to know this issue is being looked at. I would like to stress (perhaps needlessly

), it's one increasingly in need of a fix, as GPUs only keep getting bigger and bigger, so the proportion of lost computing power will also keep growing.

ThWuensche wrote:I don't know about your VEGA 64 GPUs, but at least my Radeon VIIs do consume less power while they run smaller WUs, which do not fully utilize them.

Alas, the power consumption of my Vegas do not drop appreciably even when running very small WUs. It's rather vexing actually.

If your PPD is only about 1/2 of the average, please notify about these projects here, the project owners can exclude such projects from assignment to AMD GPUs

They're too common to keep doing that. I can't babysit my PC the whole time.

Also, I *want* to contribute... That's my whole point for starting this thread!

I want to be able to contribute as much as possible.

As for the execution of more than one WU on a GPU, it will raise use of hardware and number of calculations delivered, however I doubt that it will bring more points.

Maybe. Maybe not. The points aren't really, um, the point, though.

I just used some common PPD averages that I see frequently to illustrate the problem I'm describing. And, the total amount of science done per time unit would still go up even if the early return bonus is lower. But I don't think the bonus would go down much if the GPU is only a little over half occupied, as is often the case for me, or sometimes considerably less than half.

OpenMM, the system underlying the FAH core22, is open source, so you're welcome to help. The team behind openMM and core22 is very much pushing things ahead, but is also very small. What they have provided is a really great system, but resources are limited. So if you want improvements, I would invite to help.

Yeah, you teach me computer programming and vector maths and whatnot, and I will! I'm warning you though, I'm an Old these days, so I'm a hard learner. Sorry!

Please. I'm spending probably in excess of a thousand kilowatt hours yearly of electricity on behalf of F@H (my rig can draw well over 800W from the wall full burn), I don't think it's too forward of me to make a suggestion for improvements.

MeeLee · Post by **MeeLee** » Wed Oct 21, 2020 3:18 am

PantherX wrote:FYI, attempting to configure 2 GPU Slots on a single GPU might work for some and not others as this is an unsupported method. With the new GPU enumeration scheme in V7.6.20, that may or may not work... if it doesn't work, there won't be any "fix" as it isn't officially supported.

When it comes to running multiple WUs on a large GPU, that would not be done in a traditional manner, i.e. multiple GPU Slots for one GPU. Instead, it would be done by the researchers themselves if they can ensure that all the drawbacks are being addressed. Thus, you would see a "single WU" being downloaded/uploaded. Currently, it's all theoretical so there's no commitment.

I think this is the way it should be.
If people manually will run multiple WUs per GPU, there's no say in what WUs they'll be getting. One may download 2 wus that each put 80-90% load on the GPU, which obviously would overload the GPU and lower PPD.
If one is able to run 2 small WUs, each using 40-50% of the GPU, resulting in an 80-100% GPU load, as soon as one WU finishes, a next WU could come in that requires more resources.
For this reason, I think multi WUs should be handled from FAH's side (or the researchers).
Big Navi is on the way, RTX 3000 GPUs are on the way.. And after this, we will only expect even faster and bigger GPUs to find their way in!

Post by **PantherX** » Wed Oct 21, 2020 4:23 am

FaaR wrote:...Please. I'm spending probably in excess of a thousand kilowatt hours yearly of electricity on behalf of F@H (my rig can draw well over 800W from the wall full burn), I don't think it's too forward of me to make a suggestion for improvements.

Ideas and suggestions are always welcomed here so feel free to post accordingly.

The idea of multiple WUs on a single GPU has been discussed few months ago (along with a whole bunch of other ideas) with the focus on using high-end GPUs and bringing them to their knees (yes, I want my GPUs to be pushed to their limit). There are some really interesting decisions to be made (by the F@H Team) so this is an area that is very much close to heart for many researchers and would unlock new methods of research. Alas, there's no ETA but work is happening in the background to ensure that all donor hardware (not just the high-end ones) is utilized as efficiently as possible.

The usage of CUDA on Nvidia GPUs is a massive boost there. While there's nothing like that yet on AMD GPUs, let's wait and see what the future brings

Post by **bruce** » Wed Oct 21, 2020 6:44 am

FaaR wrote:We get a small work unit that contains comparatively few atoms, and the GPU doesn't have enough data to crunch to fill up all available shader processors on the chip. Result is a GPU running full tilt burning full power, but doing only a fraction of the total amount of work it is capable of.

There's a possible solution...or actually, there's two. Either let users with big GPUs refuse small WUs (easier to implement, but not an ideal alternative), or let large GPUs work on two small WUs at once, using what AMD calls "asynchronous compute shaders". The latter alternative should fill up all - or almost all, in case of a ludicrously large GPU - the shader processors, and let the hardware run much more efficiently.

FAHClient is designed to run one WU, start to finish, then request another assignment and return the finished WU. Altering that sequence of events and replacing it with a manual operation on two WU means you'll be doing a lot of manual operations and a lot of extra "babysitting" so I doubt that will go anywhere without a major change by the FAH Development team.

@foldy You've done that. Have you figured out a convenient way to make it work?

What happens if you get a small WU that finished in 1 hour and the next WU takes 6 hours? Do you have to keep replacing the first WU every hour?

FAH should be able to construct bigger WUs but it's extra work. Suppose a double project takes the atoms for project P R C G and the atoms for P (R+1) C G and they are placed side-by-side in space with some kind of constraints that prevent WU R and WU R+1 from interacting. Both would finish at about the same wall-clock time, avoiding the WU synchronization problem I mentioned above? The extra atoms woulod keep twice as many shaders busy and twice as much work would be completed in less than twice as much wall-clock time. Again, this makes the scientist do extra work to keep your GPU busy.

As far as a small WU taking almost the same amount of power as a bigger one, you need to define a power limit for your GPU. The GPU will continue to work but at a reduced power level. Internally, some idle calculaion units will get powered down.

gunnarre · Post by **gunnarre** » Wed Oct 21, 2020 8:00 am

If small atom-count work units use the same amount of power as a large one, that might be due to the card's architecture, but it may also be due to a sub-optimal driver that is optimized for maximum gaming performance instead of the best computing results per watt. That said, many users have experienced that reducing the power limit on the card can significantly reduce power consumption with only tiny reductions in PPD, so maybe try that? Manually under-volting the card is not something we officially recommend, because any sort of under- or over-clocking might introduce instability - but reducing the power target in the driver control panel is worth trying.

FaaR wrote:
If your PPD is only about 1/2 of the average, please notify about these projects here, the project owners can exclude such projects from assignment to AMD GPUs
They're too common to keep doing that. I can't babysit my PC the whole time.

Instead of babysitting the computer, it's possible to analyze the logs after the fact to see what PPD you get on each work unit. There are tools for this, and you might also send the log files to someone to look at. Edit: One alternative is to install the unofficial Dark Mode web client, which gathers statistics for you: https://folding.lar.systems/#get_FITDAT

ThWuensche · Post by **ThWuensche** » Wed Oct 21, 2020 10:21 am

FaaR wrote:
OpenMM, the system underlying the FAH core22, is open source, so you're welcome to help. The team behind openMM and core22 is very much pushing things ahead, but is also very small. What they have provided is a really great system, but resources are limited. So if you want improvements, I would invite to help.
Yeah, you teach me computer programming and vector maths and whatnot, and I will! I'm warning you though, I'm an Old these days, so I'm a hard learner. Sorry!

Please. I'm spending probably in excess of a thousand kilowatt hours yearly of electricity on behalf of F@H (my rig can draw well over 800W from the wall full burn), I don't think it's too forward of me to make a suggestion for improvements.

Sorry, did not want to express that you should not make suggestions, they are very welcome! Just wanted to point out the possibility to contribute to the system in excess of hardware resources and electricity, as that is not so openly visible. Of course that is a road not everyone can take. Just the team in behind can handle only "so much", so to have quicker development one way is to extend the teams resources (by contributing development time or making money available to buy development time).

MeeLee · Post by **MeeLee** » Wed Oct 21, 2020 10:28 pm

In Linux you can set a hard limit to GPU frequency.
It won't allow the GPU to spike past a certain Mhz or Ghz value.
This will prevent the GPU from drawing more power to reach maximum performance.

Other than that, if a small atom WU will use only 60% of the GPU, you'll also see a 60% power usage.
Just because a 3080 draws nearly 320W at full load, doesn't mean it always draws 320W.
The GPU itself does draw less watts, the less load it has to do.

What doesn't really change, is the rest of the system. The overhead.
If your GPU runs at 275W, your CPU at 65W, your PSU has 20% overhead losses, your RAM, SSD, monitor, USB peripherals, wifi or LAN, and fans use a few watts, let's say your system runs at 400W with 275W used by the GPU, leaving 125W of system overhead.
Dropping that 275W to 200W (a 37% drop) with a low atom WU, won't drop the remaining overhead.
Your system would still be consuming 325W, instead of 400W. That's a 23% drop in overall system power.
It's slightly better, but not the full 37% power drop the GPU experiences.

FaaR · Post by **FaaR** » Wed Oct 28, 2020 2:13 am

bruce wrote: What happens if you get a small WU that finished in 1 hour and the next WU takes 6 hours? Do you have to keep replacing the first WU every hour?

Ideally, at some point, the client would be able to detect there's unused shader processors on the GPU and automatically request a second WU dynamically as needed, without user intervention. If one WU is "longer" than another, then the client would fetch another WU when the short WU runs out, assuming the longer WU cannot fill the GPU on its own. Ideally.

If there's no API call or hardware register to read to identify when there's unused shader processors, then there's software like GPU-Z which keeps a database of GPUs with information on the number of processors on each model of GPU. The client could compare the WU to the database info on the user's GPU(s) and then act accordingly.

As far as a small WU taking almost the same amount of power as a bigger one, you need to define a power limit for your GPU. The GPU will continue to work but at a reduced power level. Internally, some idle calculaion units will get powered down.

Alas, my Vegas just downclock when you restrict their power limit. They don't power down unused shaders and keep the rest going as normal. From what others have posted, it seems Nvidia GPUs (or maybe even Navi AMD GPUs) handle this better than mine. I've restricted the GPUs down to -50% power limit as it is though because I got tired of all the heat and noise of everything going full tilt. This means they run at about 1050-1100MHz or so, pulling about 110W each on the GPU core (instead of upwards of 250W), regardless of if they're running a big or a small WU.

So I decided to just not worry about it any more, as I've tamed the beasts regardless of how much work they're really doing.

Also, later today is Big Navi reveal day. I'm very excited, honestly!

MeeLee · Post by **MeeLee** » Wed Oct 28, 2020 9:27 pm

FaaR wrote:
bruce wrote: What happens if you get a small WU that finished in 1 hour and the next WU takes 6 hours? Do you have to keep replacing the first WU every hour?
Ideally, at some point, the client would be able to detect there's unused shader processors on the GPU and automatically request a second WU dynamically as needed, without user intervention. If one WU is "longer" than another, then the client would fetch another WU when the short WU runs out, assuming the longer WU cannot fill the GPU on its own. Ideally.

If there's no API call or hardware register to read to identify when there's unused shader processors, then there's software like GPU-Z which keeps a database of GPUs with information on the number of processors on each model of GPU. The client could compare the WU to the database info on the user's GPU(s) and then act accordingly.

Considering that there are only 2 major GPU manufacturers, one of which (Nvidia) uses the same drivers for the 710 all the way up to the newest; and offer the same information on GPU usage across all models,
It shouldn't be too hard to write a small program to make it work.

Folding Forum

Inherent problem of small Work Units running on big GPUs

Inherent problem of small Work Units running on big GPUs

Re: Inherent problem of small Work Units running on big GPUs

Re: Inherent problem of small Work Units running on big GPUs

Re: Inherent problem of small Work Units running on big GPUs

Re: Inherent problem of small Work Units running on big GPUs

Re: Inherent problem of small Work Units running on big GPUs

Re: Inherent problem of small Work Units running on big GPUs

Re: Inherent problem of small Work Units running on big GPUs

Re: Inherent problem of small Work Units running on big GPUs

Re: Inherent problem of small Work Units running on big GPUs

Re: Inherent problem of small Work Units running on big GPUs

Re: Inherent problem of small Work Units running on big GPUs

Re: Inherent problem of small Work Units running on big GPUs

Re: Inherent problem of small Work Units running on big GPUs

Re: Inherent problem of small Work Units running on big GPUs