RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.0?)

It seems that a lot of GPU problems revolve around specific versions of drivers. Though NVidia has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by Neil-B »

The CPU load on GPU WUs is checkpointing sanity checks iirc … Some projects have fairly hard compute on these
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Joe_H
Site Admin
Posts: 7856
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by Joe_H »

RMCholewa wrote:From time to time (some minutes), the FahCore_22 process stops using the GPU and uses a lot of CPU (I suppose it is loading the task to be processed by the GPU). That was exactly when the WU was failing;
GPU folding core does a checkpoint every few percent of progress, typically between 2-5%. As part of that, it also runs a sanity check on the data produced by the GPU that runs on the CPU. That sanity check can use 1 or more cores intensely for a brief period before resuming on the GPU. If the sanity check fails, the WU is restarted at the previous checkpoint.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
RMCholewa
Posts: 29
Joined: Fri Mar 27, 2020 2:25 pm
Hardware configuration: Lenovo Y540 Notebook with an Intel Core i7-9750H, nvidia RTX2060 Mobile amd 32GB RAM
Location: Recife, Pernambuco / BRAZIL
Contact:

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by RMCholewa »

Thank you Joe!
flarbear
Posts: 27
Joined: Fri Apr 03, 2020 7:45 am

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by flarbear »

Update on my problems from the OP.

I was able to get some stability as mentioned before by switching drivers, but that only lasted about 10-15 WUs and 3-4 days before going back to CL errors on every task.

I then downclocked by 300MHz and that was pretty stable so I started looking into different parts that might be causing it (with the GPU being the most likely suspect).

I swapped the PSU cables because that was easy - no help.

At that point I was successful in getting nVidia to give me an RMA, but I haven't returned it yet until I can get the stock cooler back on (and I want to run some more tests just in case).

I then swapped to an old GTX980 I had lying around and that has been happily running 5 hour FAHBench runs and folding for 2-3 days now in the same rig. It's slow, but it works fine. No errors (yet, knocking on wood).

Now I need to get the old cooler back on the 2080 for the RMA, but I also want to run it in my previous build as well to see if it gets the same errors in that environment. Unfortunately I'm stuck on trying to get something that resembles the stock nVidia "blue foam" thermal material for the VRM/inductors (you can see it in this image: https://imgur.com/6NnY8D9) as that stuff pulled apart and was mostly lost when I separated the cooler. I have spare thermal pads that came with the GPU cooler, but they don't look even remotely the same (more like plastic stick of gum than a space filling foam-like material). I have some pads on order from Amazon that might resemble the stock material better this stuff which will come in on Sunday. If anyone has any input on how strict nVidia will be on whether the thermal material exactly matches the stock material or not - or how to get some of the material they use on the stock cards, I'd love some input there.
ipkh
Posts: 175
Joined: Thu Jul 16, 2015 2:03 pm

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by ipkh »

The strickness of Nvi6in their RMA process is unknown to me. I usually deal with EVGA support for GPUs as they officially support removing the cooler.

Did you tell Nvidia support that you removed the stock cooler? Nvidia themselves will have to remove it to check for pcb damage anyway so it shouldn't be a problem. I've had some techs even say not to bother with pad replacement.
flarbear
Posts: 27
Joined: Fri Apr 03, 2020 7:45 am

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by flarbear »

ipkh wrote:The strickness of Nvi6in their RMA process is unknown to me. I usually deal with EVGA support for GPUs as they officially support removing the cooler.

Did you tell Nvidia support that you removed the stock cooler? Nvidia themselves will have to remove it to check for pcb damage anyway so it shouldn't be a problem. I've had some techs even say not to bother with pad replacement.
I carefully avoided mentioning that, but I did say in the chat that the temps never exceeded 50c which would be a hard result to achieve without water cooling. There were no warranty stickers involved so it doesn't seem like they are actively targetting removing the cooler, but I didn't want to press the issue.

I assume the techs who told you not to bother were from EVGA?

At this point I'm more concerned with firing it up in my old rig for sanity testing than what the RMA inspectors think, but if I put an odd thermal pad in, then that might call their attention to "facts not already admitted". And putting an incompatible thermal pad into the stock cooler may raise an issue of "tampering" that might encourage them to look for physical damage with too fine a threshold.
ipkh
Posts: 175
Joined: Thu Jul 16, 2015 2:03 pm

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by ipkh »

MSI actually. Their pads disintegrated when I installed the water block, they were more like a red paste. EVGA uses different pads but I imagine if I asked them they'd say the same thing.
EVGA at least explicitly supports removing the cooler for replacing thermal paste and water/ln2 blocks/pots.
As long as you didn't damage any components it is ok. US warranties can't be voided this way. It definitely pays to check with the RMA department though. It certainly saved me some money as I didn't have to replace them.

Having dealt with both companies, EVGA wins hands down. They are much more communicative about the status of cards sent in. Plus they allow you to purchase up to 10 years total warranty on their cards.
flarbear
Posts: 27
Joined: Fri Apr 03, 2020 7:45 am

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by flarbear »

Just an update, the pads arrived yesterday. They were not really the same color and not at all the same consistency as the stock material, but they were enough for me to put it back together and test it out. Same errors with the stock cooler in my older rig, so I boxed it up and dropped it off at FedEx. Now I just have to wait out the RMA process and hope for a better entry in the silicon lottery.

Meanwhile, someone is folding up a storm and passing me in the rankings of the team I'm on. Fold, little spare 980. Fold like the wind!
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by bruce »

RMCholewa wrote:From time to time (some minutes), the FahCore_22 process stops using the GPU and uses a lot of CPU (I suppose it is loading the task to be processed by the GPU). That was exactly when the WU was failing;
This phenomenon is well understood but rarely explained.

FAH writes checkpoints from time to time (some minutes). Before the checkpoint is written, FAH performs what is commonly called a "sanity check" to make sure the state of the atoms is coherent. To do that, the free energy is calculated twice, once by the GPU and once by the CPU. If the energy sums are (almost) identical, the checkpoint is written and calculations proceed normally. Otherwise an error message is issued.

You're seeing the GPU pause while it waits for the CPU to generate the other energy sum.
foldy
Posts: 2061
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by foldy »

And the checkpoint calculation is now done multithreaded in FahCore_22 so it uses as much CPU cores it can get
Post Reply