RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.0?)

It seems that a lot of GPU problems revolve around specific versions of drivers. Though NVidia has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Postby Neil-B » Mon Apr 13, 2020 6:59 pm

The CPU load on GPU WUs is checkpointing sanity checks iirc … Some projects have fairly hard compute on these
1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent, Quadro K420 1GB, FAH 7.6.13
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro, Quadro M1000M 2GB, FAH 7.6.13
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro, GTX 750Ti 2GB, FAH 7.6.13
Neil-B
 
Posts: 1210
Joined: Sun Mar 22, 2020 6:52 pm
Location: UK

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Postby Joe_H » Mon Apr 13, 2020 7:01 pm

RMCholewa wrote:From time to time (some minutes), the FahCore_22 process stops using the GPU and uses a lot of CPU (I suppose it is loading the task to be processed by the GPU). That was exactly when the WU was failing;

GPU folding core does a checkpoint every few percent of progress, typically between 2-5%. As part of that, it also runs a sanity check on the data produced by the GPU that runs on the CPU. That sanity check can use 1 or more cores intensely for a brief period before resuming on the GPU. If the sanity check fails, the WU is restarted at the previous checkpoint.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Joe_H
Site Admin
 
Posts: 6451
Joined: Tue Apr 21, 2009 5:41 pm
Location: W. MA

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Postby RMCholewa » Mon Apr 13, 2020 7:22 pm

Thank you Joe!
RMCholewa
 
Posts: 22
Joined: Fri Mar 27, 2020 3:25 pm
Location: Recife, Pernambuco / BRAZIL

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Postby flarbear » Sat Apr 18, 2020 4:55 am

Update on my problems from the OP.

I was able to get some stability as mentioned before by switching drivers, but that only lasted about 10-15 WUs and 3-4 days before going back to CL errors on every task.

I then downclocked by 300MHz and that was pretty stable so I started looking into different parts that might be causing it (with the GPU being the most likely suspect).

I swapped the PSU cables because that was easy - no help.

At that point I was successful in getting nVidia to give me an RMA, but I haven't returned it yet until I can get the stock cooler back on (and I want to run some more tests just in case).

I then swapped to an old GTX980 I had lying around and that has been happily running 5 hour FAHBench runs and folding for 2-3 days now in the same rig. It's slow, but it works fine. No errors (yet, knocking on wood).

Now I need to get the old cooler back on the 2080 for the RMA, but I also want to run it in my previous build as well to see if it gets the same errors in that environment. Unfortunately I'm stuck on trying to get something that resembles the stock nVidia "blue foam" thermal material for the VRM/inductors (you can see it in this image: https://imgur.com/6NnY8D9) as that stuff pulled apart and was mostly lost when I separated the cooler. I have spare thermal pads that came with the GPU cooler, but they don't look even remotely the same (more like plastic stick of gum than a space filling foam-like material). I have some pads on order from Amazon that might resemble the stock material better this stuff which will come in on Sunday. If anyone has any input on how strict nVidia will be on whether the thermal material exactly matches the stock material or not - or how to get some of the material they use on the stock cards, I'd love some input there.
flarbear
 
Posts: 25
Joined: Fri Apr 03, 2020 8:45 am

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Postby ipkh » Sat Apr 18, 2020 1:06 pm

The strickness of Nvi6in their RMA process is unknown to me. I usually deal with EVGA support for GPUs as they officially support removing the cooler.

Did you tell Nvidia support that you removed the stock cooler? Nvidia themselves will have to remove it to check for pcb damage anyway so it shouldn't be a problem. I've had some techs even say not to bother with pad replacement.
ipkh
 
Posts: 134
Joined: Thu Jul 16, 2015 3:03 pm

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Postby flarbear » Sat Apr 18, 2020 6:43 pm

ipkh wrote:The strickness of Nvi6in their RMA process is unknown to me. I usually deal with EVGA support for GPUs as they officially support removing the cooler.

Did you tell Nvidia support that you removed the stock cooler? Nvidia themselves will have to remove it to check for pcb damage anyway so it shouldn't be a problem. I've had some techs even say not to bother with pad replacement.

I carefully avoided mentioning that, but I did say in the chat that the temps never exceeded 50c which would be a hard result to achieve without water cooling. There were no warranty stickers involved so it doesn't seem like they are actively targetting removing the cooler, but I didn't want to press the issue.

I assume the techs who told you not to bother were from EVGA?

At this point I'm more concerned with firing it up in my old rig for sanity testing than what the RMA inspectors think, but if I put an odd thermal pad in, then that might call their attention to "facts not already admitted". And putting an incompatible thermal pad into the stock cooler may raise an issue of "tampering" that might encourage them to look for physical damage with too fine a threshold.
flarbear
 
Posts: 25
Joined: Fri Apr 03, 2020 8:45 am

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Postby ipkh » Mon Apr 20, 2020 12:26 am

MSI actually. Their pads disintegrated when I installed the water block, they were more like a red paste. EVGA uses different pads but I imagine if I asked them they'd say the same thing.
EVGA at least explicitly supports removing the cooler for replacing thermal paste and water/ln2 blocks/pots.
As long as you didn't damage any components it is ok. US warranties can't be voided this way. It definitely pays to check with the RMA department though. It certainly saved me some money as I didn't have to replace them.

Having dealt with both companies, EVGA wins hands down. They are much more communicative about the status of cards sent in. Plus they allow you to purchase up to 10 years total warranty on their cards.
ipkh
 
Posts: 134
Joined: Thu Jul 16, 2015 3:03 pm

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Postby flarbear » Mon Apr 20, 2020 5:42 am

Just an update, the pads arrived yesterday. They were not really the same color and not at all the same consistency as the stock material, but they were enough for me to put it back together and test it out. Same errors with the stock cooler in my older rig, so I boxed it up and dropped it off at FedEx. Now I just have to wait out the RMA process and hope for a better entry in the silicon lottery.

Meanwhile, someone is folding up a storm and passing me in the rankings of the team I'm on. Fold, little spare 980. Fold like the wind!
flarbear
 
Posts: 25
Joined: Fri Apr 03, 2020 8:45 am

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Postby bruce » Wed Jul 15, 2020 8:08 am

RMCholewa wrote:From time to time (some minutes), the FahCore_22 process stops using the GPU and uses a lot of CPU (I suppose it is loading the task to be processed by the GPU). That was exactly when the WU was failing;


This phenomenon is well understood but rarely explained.

FAH writes checkpoints from time to time (some minutes). Before the checkpoint is written, FAH performs what is commonly called a "sanity check" to make sure the state of the atoms is coherent. To do that, the free energy is calculated twice, once by the GPU and once by the CPU. If the energy sums are (almost) identical, the checkpoint is written and calculations proceed normally. Otherwise an error message is issued.

You're seeing the GPU pause while it waits for the CPU to generate the other energy sum.
bruce
 
Posts: 19690
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Postby foldy » Wed Jul 15, 2020 2:14 pm

And the checkpoint calculation is now done multithreaded in FahCore_22 so it uses as much CPU cores it can get
foldy
 
Posts: 1939
Joined: Sat Dec 01, 2012 4:43 pm

Previous

Return to Problems with NVidia drivers

Who is online

Users browsing this forum: No registered users and 1 guest

cron