Page 2 of 2

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Fri Mar 05, 2021 3:04 pm
by Neil-B
PRP_R148H wrote:Well, if we can't go on holiday right now, at least our GPUs can! I'll see how the SSD fares. Yes it's quite a farce what's happened to the GPU market. I managed to grab a pair of fairly (?) well priced 3070s from a store and as soon as I bought them, they raised the price $200 for the next batch. Wow.

Also thanks Joe_H for that explanation of the WU:atom business and how that affects checkpointing.
Just a heads up ... Have been watching my temperatures/usage patterns quite carefully and can advise the following for comparison:

Checkpointing on my system appears to last maybe a second - GPU usage shows a slight dip of maybe 5% but temperature and clocks speeds don't change.

At the changeover of WUs my system preloads at 99% so next WU is ready to run - GPU usage drops to effectively 0 for maybe 3 seconds (occasionally possibly 10 seconds) as the client focuses on wrapping/packing/shutting down one core and spinning the next one up ... In this time the GPU gets a bigger chance to cool and drops maybe 10C (15C with the longer pause) with clocks also spinning down for a shorter while - this drop is some 30% to 50% of the difference between folding temp and ambient.

Now my kit is fast and cools well - the changeover shows that temp drops do happen is gpu not loaded but the minimal drops I am seeing on checkpointing imply it doesn't need to happen to the same extent as you have seen - the pausing that you have spotted will undoubtedly be a major factor in the spiking you have observed ... hopefully the move from a usb install to an ssd install resolved the majority of this ... Do let us know how this goes/has gone.

Re: Temperature/GPU Usage unstable; drops align with checkpo

Posted: Sat Mar 06, 2021 8:33 am
by bruce
Those dips are normal. There will always be dips, but the time spent in each dip is dependent on both the speed of the HD and the speed of your CPUl.

Here's what's happening:
(1) the energy level of the WU computed by the GPU is checked against the energy level as computed by your CPU. (While this process is happening the progress toward 100% is briefly suspended,) If they differ by a large amount there is something wrong and the calculation will be aborted.

Errors are always possible but they're relatively rare if you kit is functioning correctly and not overclocked. If an error has occurred you really don't want to continue to compute somehting that is certain to fail. Completing the rest of the WU before aborting the calculation and getting a notification would be a bad plan.

(2) the state of the WU is stored on disk so that if something goes wrong during the next segment of the computation (including a pause, which isn't really some "wrong") the computation can resume from that state.

You can PAUSE the calculation at any time and you really don't want to have to start the calculation from the beginning again. The checkpoints need to be frequent enough that you only have to repeat a small part of the calculation -- that part computed since the last checkpoint was stored.