Launch of new NVIDIA 3000 series GPU's

Moderators: Site Moderators, FAHC Science Team

empleat
Posts: 34
Joined: Fri Apr 03, 2020 10:11 pm

Re: Launch of new NVIDIA 3000 series GPU's

Post by empleat »

bruce wrote:
empleat wrote:... Problem was: WU always crashed, except once i almost finished a project!!! But i tried that many times and it had always crashed at some point... I don't know if, because of instability, or to many errors piled up. And whether or no these data would be useful, even if WU finished! That was only on 780. I have 2070 super now - unfortunately, but i didn't test it yet. But on 3000 series something like 3080/90, one WU could take like 1-2 hours, or less. On 780 one wu took 6-8 hours! I think that could finish! Currently no new projects available, so i test it later...
Ordinarily, it fine to learn by testing, but it's not ok to intentionally crash perfectly good production WUs. Every WU is unique and is assigned to ONE person. If it crashes, it has to be assigned to someone else and if it it fails too many times, it is withdrawn from distribution. Such failure are costly to scientific research. WUs are not "free"
Stop trolling... I didn't crash anything intentionally, i tried that like couple times... Not my problem that you can't adjust gpu usage and your computer is unusable during crunching... Btw if this is such deal, they should mention it in tutorials, or in program. I can't avoid something, i don't know about...
Neil-B wrote:r space to overco
I did not know, that you can crunch completed WU. But i did that like couple times only, don't worry and it was another WU - it gave me! I have Geforce 2070 super now and it says one WU ETA 2 hours. Using 3000 series, that could be even faster! Probably under 1 hour! Someone should test that on completed WU than, when 3080/90 will come out!!! I don't know if this is coincidence. But i am using demanding video renderer, when watching TV shows, which already cause like 80% GPU usage. Previously i couldn't watch tv shows, even when GPU usage was 10%, because WU maxed GPU usage to 99%! But now, ETA increased from 2 hours to 2 day and tv show is completely smooth without fps drops! So they seem FAH did something to address that! Or new GPUS are better at scheduling! This is interesting my gpu usage is like 33-35% on default setting. Even on full power - GPU usage is still "35%". So it seems it doesn't take advantage of high end gpu, i don't mind that at all! I could spare more "GPU %", as i will be using that only when watching tv shows anyways, but at least my computer is usable :D

Also FAH could have option to send data in bursts, they experimented with! Amateurs could easily make utility to cause gpu load, so gpu won't heat up and cool down rapidly, which was the problem...
Last edited by empleat on Sat Sep 19, 2020 4:39 pm, edited 1 time in total.
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Launch of new NVIDIA 3000 series GPU's

Post by Neil-B »

Please read your own posts ... "But i tried that many times and it had always crashed at some point" but now you say you tried that "like couple times only"... I can see why bruce posted as he did - tbh I felt the same.

If you copy a WU data file to a separate folder you can then directly run the cores against it to your hearts content - by not using the client it doesn't try to send it back just creates the packages that would be sent ... for testing/experimental purposes this doesn't cause damage/issues to the science.

FaH tried finding a way of sending data in bursts and using GPUs in a time sliced manner - the approach you are espousing caused issues for the hardware iirc - and saying you could fill in the gaps with something else to load the GPU to minimise heat cycling would be just as problematic as running FaH full time ... I know you want it to be a simple issue that any amateur can resolve - but it simply isn't ... You can rail as much as you like about it but until the way GPUs are handled by OSs and by the GPU vendors (who are finally making moves in that direction) most "solutions" are going to be inelegant "bodges" that don't actually resolve the issue but finds some form of bypass that sometimes work for specific circumstances.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
empleat
Posts: 34
Joined: Fri Apr 03, 2020 10:11 pm

Re: Launch of new NVIDIA 3000 series GPU's

Post by empleat »

Neil-B wrote:Please read your own posts ... "But i tried that many times and it had always crashed at some point" but now you say you tried that "like couple times only"... I can see why bruce posted as he did - tbh I felt the same.

If you copy a WU data file to a separate folder you can then directly run the cores against it to your hearts content - by not using the client it doesn't try to send it back just creates the packages that would be sent ... for testing/experimental purposes this doesn't cause damage/issues to the science.

FaH tried finding a way of sending data in bursts and using GPUs in a time sliced manner - the approach you are espousing caused issues for the hardware iirc - and saying you could fill in the gaps with something else to load the GPU to minimise heat cycling would be just as problematic as running FaH full time ... I know you want it to be a simple issue that any amateur can resolve - but it simply isn't ... You can rail as much as you like about it but until the way GPUs are handled by OSs and by the GPU vendors (who are finally making moves in that direction) most "solutions" are going to be inelegant "bodges" that don't actually resolve the issue but finds some form of bypass that sometimes work for specific circumstances.
OK that's fair. Couple times like 5-10, is many times to know, if it works or not. It just matter of word interpretation...

I don't see how that would be problematic. As if gpu runs at maximum performance state, gpu temp will be sitting at decent amount already! Even it usage goes like from 25% to 80% and than drops back, i have no idea how much that would be difference. But gpu temp doesn't drop that quickly. It if you stress it to similar usage, when usage drops. I bet you could make temp stable like within 10C~ You say, this would be problematic, but you doesn't say why. They could allow experimental feature to do this, but whatever...

Hmm and gpu usage is actually 99% constantly, task manager shows incorrect values. Yet i can watch TV Shows without problems. Interesting, so that's great! And i don't have even installed Windows 2004 and gpu scheduling on.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Launch of new NVIDIA 3000 series GPU's

Post by bruce »

Suppose the FAHCore uses 100% of some resource that produces too much heat. Suppose some App suspends FAH's use of that reource for 50% of the time to enable the device to cool down. That sounds like a good solution, right?

Suppose there is some other APP running that happens to use 50% of that same resource, when FAH is not using it. No cooling takes place since FAH doesn't know about the second App.

The driver can be aware of every App that's heating up the resource; FAH cannot.
MeeLee
Posts: 1375
Joined: Tue Feb 19, 2019 10:16 pm

Re: Launch of new NVIDIA 3000 series GPU's

Post by MeeLee »

From what I saw today, the GPU doesn't seem to heat up much at all.
Postings show 69C with a 75% fan curve, at near to 100% load.
I wonder how much OC headroom the 3080 has, or if it's already running optimally.
And if increasing or decreasing the power would do much to PPD scores?
+300W seems like an awful lot to me...
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Launch of new NVIDIA 3000 series GPU's

Post by Neil-B »

Actually the temperatures on silicon do heat up and cool down very quickly dependant on the architecture be it cpu of gpu wafer ... the tin around it buffers the temps and makes it seem slower both to heat up and cool down ... stable temp at whatever level is better than spiky ... but the workarounds to lower the temp don't just lower it for one usage but all ... so yes you can limit tdp or lower max temp settings but that doesn't change how GPUs schedule loads one iota - it just limits the speed at which it actually processes those loads (potentially making delays worse).

If (note big if) work is scheduled from various sources in such a manner that it keeps the GPU fully loaded without interruptions (even small ones) then yes you achieve your goal ... but since OS GPU scheduling doesn't lend itself to this type of rapid slicing it doesn't work particularly well (loss of performance for all tasks (I am led to believe) and actually thermally not as stable as letting FaH get on with it - because FaH works GPU hard finding loads that keep that amount of loading balanced could be tricky.

... and as I understand it, this type of approach is a bodge workaround at a level that would be susceptible to being regularly broken by OS and Driver updates and would just case dev issues.

The solution is for OS/GPU vendors to agree/adopt a different approach to GPU scheduling than has been generally used to date where multiple workloads can be processed simultaneously rather than sequentially and load balanced by priorities ... Not sure if GPU vendors can do it all on their own (nvidia has various approaches that do this a bit - amd may well have other approaches) and fairly sure coordination with OS vendors will actually be needed to get this truely working ... In the mean time efforts by FaH are simply not the level at which this issue needs to be resolved (imo) - obviously you believe differently - and tbh I'd really like to be wrong on this :)
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
empleat
Posts: 34
Joined: Fri Apr 03, 2020 10:11 pm

Re: Launch of new NVIDIA 3000 series GPU's

Post by empleat »

bruce wrote:Suppose the FAHCore uses 100% of some resource that produces too much heat. Suppose some App suspends FAH's use of that reource for 50% of the time to enable the device to cool down. That sounds like a good solution, right?

Suppose there is some other APP running that happens to use 50% of that same resource, when FAH is not using it. No cooling takes place since FAH doesn't know about the second App.

The driver can be aware of every App that's heating up the resource; FAH cannot.
Wait what? Now you are not making sense. You don't want to cool it down. You want to prevent cooling down and heating up cycles. As long as there is constant load, this won't happen i think. I don't know exact numbers, but lets say like 50% load, so it wouldn't go through these cycles. FAH doesn't even have to know about other loads. As long there is some load like (again i don't know exact numbers) "25%"+ it wouldn't cool down that much. It worst case your tv show would lag, so you could change the load. And tv shows have constant gpu usage, unlike pc games! There are many gpus, but people could make their stress test and set it for their individual gpu. But luckily, it seems new gpus can schedule better. I have like 96-99% constant gpu usage when crunching. Yet when i launch tv show, which cause like 80% usage, i have no fps drops at all!!! Only higher rendering times and still decent fps! While previously using 780, even using renderer with 10 % gpu usage, it was completely unwatchable! So it seems this is problem of the past. Maybe for pc gaming, gpu scheduling will help, so you can do both at the same time. I don't have Windows 2004, even installed yet!
MeeLee wrote:From what I saw today, the GPU doesn't seem to heat up much at all.
Postings show 69C with a 75% fan curve, at near to 100% load.
I wonder how much OC headroom the 3080 has, or if it's already running optimally.
And if increasing or decreasing the power would do much to PPD scores?
+300W seems like an awful lot to me...
Yeah i have 66C max. in continuous 99% load, but i have a triple fan. And not referential version of gpu. And i have OC edition. Don't know how are other gpus today.
Neil-B wrote:y the temperatures on silicon do heat up and cool down very quickly dependant on the architecture be it cpu of gpu wafer ... the tin around it buffers the temps and makes it seem slower both to heat up and cool down ... stable temp at whatever level is better than spiky ... but the workarounds to lower the temp don't just lower it for one usage but all ... so yes you can limit tdp or lower max temp settings but that doesn't change how GPUs schedule loads one iota - it just limits the speed at which it actually processes those loads (potentially making delays worse).

If (note big if) work is scheduled from various sources in such a manner that it keeps the GPU fully loaded without interruptions (even small ones) then yes you achieve your goal ... but since OS GPU scheduling doesn't lend itself to this type of rapid slicing it doesn't work particularly well (loss of performance for all tasks (I am led to believe) and actually thermally not as stable as letting FaH get on with it - because FaH works GPU hard finding loads that keep that amount of loading balanced could be tricky.

... and as I understand it, this type of approach is a bodge workaround at a level that would be susceptible to being regularly broken by OS and Driver updates and would just case dev issues.

The solution is for OS/GPU vendors to agree/adopt a different approach to GPU scheduling than has been generally used to date where multiple workloads can be processed simultaneously rather than sequentially and load balanced by priorities ... Not sure if GPU vendors can do it all on their own (nvidia has various approaches that do this a bit - amd may well have other approaches) and fairly sure coordination with OS vendors will actually be needed to get this truely working ... In the mean time efforts by FaH are simply not the level at which this issue needs to be resolved (imo) - obviously you believe differently - and tbh I'd really like to be wrong on this
Wait what. You too :D ? No one wanted to lower it. But to keep it stable by continuous load.

Than second paragraph, not 100% sure how you mean it. Yes there would be probably loss of performance, but better than people not using it at all :D

Yeah, at this point there will be sooner some technology enabling this. So it is whatever at this point... And Windows 2004 build has gpu scheduling, but programs has to be written to take advantage of it from what i understand.
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Launch of new NVIDIA 3000 series GPU's

Post by Neil-B »

I was simply pointing out that most methods people currently use to "manage" GPUs don't change how prioritisation/workload management work - to head off yet another segue on your part :)

Actually another issue for the loading may be that different loads use different parts of the silicon in different ways - I am not aware if heat cycling different parts of the silicon could cause in some way a torsional style heat effect (simplistically speaking) - hard to describe what I am thinking of but think of four burner stove - keeping any three burners on full may keep the same overall heat output but there is a difference between leaving the same three on all the time or moving the unused burner around - the second would have more heat cycling going on ... not quite sure how this impacts utilising silicon - need to talk to one of my deep specialists and see if anyone has done research on this - my guess would be probably not (yet).

Adjustments in gpu scheduling are slowly happening - imo (as stated a number of times before) this is an OS/Vendor driver matter and only once this is sorted will it be easy/sensible for programs to utilise these advances ... It may be possibly for programs to bespoke themselves before this matures but that would be like hardcoding cpu core management into application software rather than letting the OSs and the vendor firmware handle it - it can be done that way but for many, many scenarios letting the OS/vendor firmware handle it is what application developers choose to do :) ... but if there is a value case for doing so and the development time is available I am sure FaH will consider this alongside the other work packages.

My second paragraph was acknowledging that IF you could keep the GPU constantly loaded (somehow) by timeslicing FaH and filling in with other load (either real usage or some other fill in) you might just (in some highly inefficient way) manage to achieve an aim of avoiding FaH constantly loading up the GPU itself (which would simply mean significantly less science being done) ... I believe you will argue that less science being done is better than none - my counter would be that many people don't have the issues you are trying to avoid (impact of GPU on other usage) as they use other methods such as turning off hardware acceleration, pausing folding, etc. - you will probably counter this with ?? ... and this debate will thus continue :wink: ... tbh from my perspective this discussion has gone full circle at least a few times over the course of the last weeks/months so I'll more than happily let you you have the last words and bow out of this :)

... over to you :D
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Launch of new NVIDIA 3000 series GPU's

Post by bruce »

This was explained earlier but I don't think you understood. Suppose FAH is configured to process for 100 ms and then not process for 100 ms. This would be intended to cool the GPU, but in only works if nothing else is running you your system. Suppose you are also running some other distributed computing app that also uses the GPU (e.g.-BOINC). During those 100 ms while FAH is trying to allow the GPU to cool off, BOINC will step in and run for those 100ms idle gaps and no cooling will take place, keeping the GPU constantly heated ... even though FAH is now running at 50% productivity.
foldy
Posts: 2061
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Launch of new NVIDIA 3000 series GPU's

Post by foldy »

I tried that 100ms ON and 100ms OFF for FAH on GPU to test cooling effect and it was working bad. PPD goes down by half and temp gets lower but still stays high. So for cooling reducing power limit is the best option.
MeeLee
Posts: 1375
Joined: Tue Feb 19, 2019 10:16 pm

Re: Launch of new NVIDIA 3000 series GPU's

Post by MeeLee »

Cooling and heating up does not damage any GPU. And if you're running with limited power (power cap from 90% to 60% total power), even the capacitors won't really break a sweat.
The GPU die won't cool rapidly enough, because the large heat sink above it provides heat when the GPU core is in sleep mode (unused), or when frequencies dip in between WUs.
As far as turning cores on and off, by inserting a sleep pattern, you're just wasting power.
It takes power to bring a sleeping core out of it's state, into an active state.
You preferably want them to never go to sleep, so introducing sleep states isn't recommended.
Like Foldy says, PPD goes down drastically, but I bet power consumption remains high.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Launch of new NVIDIA 3000 series GPU's

Post by bruce »

I'm not sure what the driver coders decided to do when they detect power might exceed the established limitation, but they do have the ability to manage various hardware settings whereas trying to manage heat/power externally is using a very crude tool. Maybe they're just adjusting the boost clock settings to overclock/underclock the hardware. At the very least, that would allow more subtle adjustments and to do so in real-time.

Using the power cap setting is strongly recommended.
ipkh
Posts: 175
Joined: Thu Jul 16, 2015 2:03 pm

Re: Launch of new NVIDIA 3000 series GPU's

Post by ipkh »

The drivers reduce clock speed and disable sm units as temp and power limits are reached.
A heatsink will not generally transfer heat back to the silicon. Metal heatsinks dissipate the heat quite quickly. Heatsinks are designed to bh e path of least resistance to heat energy and the flow doesn't reverse.
Going from 0% to 100% load really quickly will put a stress on the power deliver, including the PSU. The power ramp down will generate heat as the power has to go to ground.
MeeLee
Posts: 1375
Joined: Tue Feb 19, 2019 10:16 pm

Re: Launch of new NVIDIA 3000 series GPU's

Post by MeeLee »

I also believe they cut power to unused VRAM modules, which is why a 2080 Ti can still fold at 220W at near to maximum performance, while during benchmarks they need higher wattages to keep the frequency high.
I'm not sure about the 3000 series, but from what it seems, the driver might not fully control the VRAM modules.
Especially so, if you can't get nearly identical performance out of a 3080 at 225-250W as at the stock 350-380W (depending on the model).
I personally believe the 3000 series could have been a lot more efficient. But I may base my presumption on the data that's currently shared between the few people who actually got one.
Post Reply