Page 4 of 4

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Posted: Mon Apr 13, 2020 5:59 pm
by Neil-B
The CPU load on GPU WUs is checkpointing sanity checks iirc … Some projects have fairly hard compute on these

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Posted: Mon Apr 13, 2020 6:01 pm
by Joe_H
RMCholewa wrote:From time to time (some minutes), the FahCore_22 process stops using the GPU and uses a lot of CPU (I suppose it is loading the task to be processed by the GPU). That was exactly when the WU was failing;
GPU folding core does a checkpoint every few percent of progress, typically between 2-5%. As part of that, it also runs a sanity check on the data produced by the GPU that runs on the CPU. That sanity check can use 1 or more cores intensely for a brief period before resuming on the GPU. If the sanity check fails, the WU is restarted at the previous checkpoint.

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Posted: Mon Apr 13, 2020 6:22 pm
by RMCholewa
Thank you Joe!

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Posted: Sat Apr 18, 2020 3:55 am
by flarbear
Update on my problems from the OP.

I was able to get some stability as mentioned before by switching drivers, but that only lasted about 10-15 WUs and 3-4 days before going back to CL errors on every task.

I then downclocked by 300MHz and that was pretty stable so I started looking into different parts that might be causing it (with the GPU being the most likely suspect).

I swapped the PSU cables because that was easy - no help.

At that point I was successful in getting nVidia to give me an RMA, but I haven't returned it yet until I can get the stock cooler back on (and I want to run some more tests just in case).

I then swapped to an old GTX980 I had lying around and that has been happily running 5 hour FAHBench runs and folding for 2-3 days now in the same rig. It's slow, but it works fine. No errors (yet, knocking on wood).

Now I need to get the old cooler back on the 2080 for the RMA, but I also want to run it in my previous build as well to see if it gets the same errors in that environment. Unfortunately I'm stuck on trying to get something that resembles the stock nVidia "blue foam" thermal material for the VRM/inductors (you can see it in this image: https://imgur.com/6NnY8D9) as that stuff pulled apart and was mostly lost when I separated the cooler. I have spare thermal pads that came with the GPU cooler, but they don't look even remotely the same (more like plastic stick of gum than a space filling foam-like material). I have some pads on order from Amazon that might resemble the stock material better this stuff which will come in on Sunday. If anyone has any input on how strict nVidia will be on whether the thermal material exactly matches the stock material or not - or how to get some of the material they use on the stock cards, I'd love some input there.

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Posted: Sat Apr 18, 2020 12:06 pm
by ipkh
The strickness of Nvi6in their RMA process is unknown to me. I usually deal with EVGA support for GPUs as they officially support removing the cooler.

Did you tell Nvidia support that you removed the stock cooler? Nvidia themselves will have to remove it to check for pcb damage anyway so it shouldn't be a problem. I've had some techs even say not to bother with pad replacement.

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Posted: Sat Apr 18, 2020 5:43 pm
by flarbear
ipkh wrote:The strickness of Nvi6in their RMA process is unknown to me. I usually deal with EVGA support for GPUs as they officially support removing the cooler.

Did you tell Nvidia support that you removed the stock cooler? Nvidia themselves will have to remove it to check for pcb damage anyway so it shouldn't be a problem. I've had some techs even say not to bother with pad replacement.
I carefully avoided mentioning that, but I did say in the chat that the temps never exceeded 50c which would be a hard result to achieve without water cooling. There were no warranty stickers involved so it doesn't seem like they are actively targetting removing the cooler, but I didn't want to press the issue.

I assume the techs who told you not to bother were from EVGA?

At this point I'm more concerned with firing it up in my old rig for sanity testing than what the RMA inspectors think, but if I put an odd thermal pad in, then that might call their attention to "facts not already admitted". And putting an incompatible thermal pad into the stock cooler may raise an issue of "tampering" that might encourage them to look for physical damage with too fine a threshold.

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Posted: Sun Apr 19, 2020 11:26 pm
by ipkh
MSI actually. Their pads disintegrated when I installed the water block, they were more like a red paste. EVGA uses different pads but I imagine if I asked them they'd say the same thing.
EVGA at least explicitly supports removing the cooler for replacing thermal paste and water/ln2 blocks/pots.
As long as you didn't damage any components it is ok. US warranties can't be voided this way. It definitely pays to check with the RMA department though. It certainly saved me some money as I didn't have to replace them.

Having dealt with both companies, EVGA wins hands down. They are much more communicative about the status of cards sent in. Plus they allow you to purchase up to 10 years total warranty on their cards.

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Posted: Mon Apr 20, 2020 4:42 am
by flarbear
Just an update, the pads arrived yesterday. They were not really the same color and not at all the same consistency as the stock material, but they were enough for me to put it back together and test it out. Same errors with the stock cooler in my older rig, so I boxed it up and dropped it off at FedEx. Now I just have to wait out the RMA process and hope for a better entry in the silicon lottery.

Meanwhile, someone is folding up a storm and passing me in the rankings of the team I'm on. Fold, little spare 980. Fold like the wind!

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Posted: Wed Jul 15, 2020 7:08 am
by bruce
RMCholewa wrote:From time to time (some minutes), the FahCore_22 process stops using the GPU and uses a lot of CPU (I suppose it is loading the task to be processed by the GPU). That was exactly when the WU was failing;
This phenomenon is well understood but rarely explained.

FAH writes checkpoints from time to time (some minutes). Before the checkpoint is written, FAH performs what is commonly called a "sanity check" to make sure the state of the atoms is coherent. To do that, the free energy is calculated twice, once by the GPU and once by the CPU. If the energy sums are (almost) identical, the checkpoint is written and calculations proceed normally. Otherwise an error message is issued.

You're seeing the GPU pause while it waits for the CPU to generate the other energy sum.

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Posted: Wed Jul 15, 2020 1:14 pm
by foldy
And the checkpoint calculation is now done multithreaded in FahCore_22 so it uses as much CPU cores it can get