RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.0?)

It seems that a lot of GPU problems revolve around specific versions of drivers. Though NVidia has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

Roadpower
Posts: 71
Joined: Mon Mar 16, 2020 5:11 pm

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by Roadpower »

flarbear wrote:
Roadpower wrote:A weak PSU can cause vexing and confusing issues. I would look there.
It's a Corsair SF750, 80 PLUS Platinum rated and the UPS indicates that the maximum draw under load with F@H is under 500w including 2 monitors and the Nas that are also plugged into the same UPS. With the computer asleep the draw is 70w before the monitors enter sleep state so the computer would be drawing under 400w on a 750w PSU.

I have custom cables, but they look at least as beefy as the stock cables that came with the PSU - 18 awg and shorter than the stock cables by almost half.
Hmm, I only found one user of two of those models failing in two weeks but his post was deleted over on Reddit so I couldn't learn more. Failing to swap in another PSU to test with then the only things that seem to remain are motherboard and GPU. I'm sorry I couldn't be of more help. One last stupid question though, have you reseated the connections yet? IE simply unsocket or unplug and then resocket and plug.
flarbear
Posts: 27
Joined: Fri Apr 03, 2020 7:45 am

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by flarbear »

Roadpower wrote:Hmm, I only found one user of two of those models failing in two weeks but his post was deleted over on Reddit so I couldn't learn more. Failing to swap in another PSU to test with then the only things that seem to remain are motherboard and GPU. I'm sorry I couldn't be of more help. One last stupid question though, have you reseated the connections yet? IE simply unsocket or unplug and then resocket and plug.
I can try reseating them, or at least touching them to see if they feel especially overheated. The build log with all of the component details is posted here on SFF.net.

Grr, the side panels on that side are binding and hard to slide off. I won't be able to reach into the GPU side until I can unplug it and lay it down on a workbench. All PSU cables feel cold wherever I can reach them and visually inspecting the modular cable connectors all look fully seated.

Is there a way to see if it is being power starved from looking at MSI Afterburner? The voltage gauge on the main page seems to always read 0mv and I found a power and voltage limit graph that look like they spend a lot of time at 1 (since those are just 0->1, are they binary indicators?).

I've now got HWInfo monitoring the GPU voltage and power in addition to its temp. I had previously hidden those sensors...
flarbear
Posts: 27
Joined: Fri Apr 03, 2020 7:45 am

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by flarbear »

The Ghost S1 uses a pretty high end and beefy riser cable. I'd have to check the seating of that on both ends as well...
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by PantherX »

flarbear wrote:...Is there a way to see if it is being power starved from looking at MSI Afterburner? The voltage gauge on the main page seems to always read 0mv and I found a power and voltage limit graph that look like they spend a lot of time at 1 (since those are just 0->1, are they binary indicators?)...
The Power and volatage limit graphs are binary, as in:
Did the GPU reach the power limit of the board: Yes (1) / No (0)
Did the GPU reach the voltage limit of the board: Yes (1) / No (0)

The same logic applies to other values that are Yes (1) / No (0) in the graph.

If you want to test the power to the GPU, try running a power intensive application and see if it crashes or not. I do know that in the past FurMark was very power hungry on the GPUs but I have read that Nvidia/AMD modified their drivers to detect FurMark running and then used software to limit the power usage on the GPU. Not sure if that's still the case or not.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Roadpower
Posts: 71
Joined: Mon Mar 16, 2020 5:11 pm

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by Roadpower »

flarbear wrote:I can try reseating them, or at least touching them to see if they feel especially overheated.
I'm sorry, I confused you. The purpose for reseating was not to manually check for heat, the purpose is to exercise the connections in an attempt to clear away any material that might be interfering with a proper connection. Professional techs will commonly resort to this technique in their troubleshooting practices. Some techs also employ contact or electronics cleaner but I doubt you need those unless you happen to be a smoker.
flarbear wrote:All PSU cables feel cold wherever I can reach them and visually inspecting the modular cable connectors all look fully seated.
I suppose if you really want to be sure sure you could check cables with a continuity tester. Again though you can not depend on if the connection looks seated, you want to "exercise" the connection. Not bending or anything like that, just unplugging and replugging will do it.
flarbear wrote:Is there a way to see if it is being power starved from looking at MSI Afterburner? The voltage gauge on the main page seems to always read 0mv and I found a power and voltage limit graph that look like they spend a lot of time at 1 (since those are just 0->1, are they binary indicators?).
The response by PantherX was instructional, I wasn't aware of that binary aspect. The platform I am using is Linux, so I haven't bothered to look for MSI's Afterburner control panel.

EDIT: I forgot to add that FAHBench exists, I assume that it can equally give you a proper stress test.
flarbear
Posts: 27
Joined: Fri Apr 03, 2020 7:45 am

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by flarbear »

Roadpower wrote:
flarbear wrote:I can try reseating them, or at least touching them to see if they feel especially overheated.
I'm sorry, I confused you. The purpose for reseating was not to manually check for heat, the purpose is to exercise the connections in an attempt to clear away any material that might be interfering with a proper connection. Professional techs will commonly resort to this technique in their troubleshooting practices. Some techs also employ contact or electronics cleaner but I doubt you need those unless you happen to be a smoker.
(chuckle) I understood what you meant, but I can't quite reach those connections until I take the system down and remove the side panels (which aren't screwed on, but they can bind and need a little force to dislodge them at times). Until then, the next best thing I could do was to reach in and feel the wires and they don't seem to be heating up which would indicate that they were not up to the job of carrying the necessary current. The PSU and the cables are brand new as of a few weeks ago and the card was just bought last June and had the cables plugged and replugged about a dozen times over the course of March as I built and rebuilt the rig.
flarbear wrote:All PSU cables feel cold wherever I can reach them and visually inspecting the modular cable connectors all look fully seated.
I suppose if you really want to be sure sure you could check cables with a continuity tester. Again though you can not depend on if the connection looks seated, you want to "exercise" the connection. Not bending or anything like that, just unplugging and replugging will do it.
The cables were hand-made to custom lengths and the person doing it tested them with a PSU cable tester configured for my PSU and GPU, so there shouldn't be any connection problems there. I also verified the wiring before I asked him about that because I was curious how he made all the connections reinforce the custom bends.
EDIT: I forgot to add that FAHBench exists, I assume that it can equally give you a proper stress test.
I was running that before I under-clocked the card. It failed some times and succeeded others. One interesting thing is that if it ever failed then it would always fail immediately on starting the test until I quit the benchmark and reran it - indicating that some internal state got messed up on an error. I should probably try quitting and cleaning FaHClient when I get an error, but I know that once it errors even DDU and reinstalling the nVidia drivers which involves 2 reboots, and then reinstalling FaHClient doesn't even clear the errors. I suppose I could try it again now that it appears stable, but it is busy folding several WUs to again prove that it is working so I'd rather just leave it to that for now.
flarbear
Posts: 27
Joined: Fri Apr 03, 2020 7:45 am

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by flarbear »

PantherX wrote:
flarbear wrote:...Is there a way to see if it is being power starved from looking at MSI Afterburner? The voltage gauge on the main page seems to always read 0mv and I found a power and voltage limit graph that look like they spend a lot of time at 1 (since those are just 0->1, are they binary indicators?)...
The Power and volatage limit graphs are binary, as in:
Did the GPU reach the power limit of the board: Yes (1) / No (0)
Did the GPU reach the voltage limit of the board: Yes (1) / No (0)

The same logic applies to other values that are Yes (1) / No (0) in the graph.
Great, thanks! So I assume that this isn't necessarily a bad thing as long as it throttles to keep things stable? It is running at a -300MHz offset now, but that graph still showed that it was hitting the power and voltage limits. I can only imagine what was happening when it didn't have the clock offsets...?

One question this raises is whether or not raising the power limits on the card might be another way to achieve stability and then let Boost go back to doing its thing. The watercooling may have let it start overclocking into territories where it was simply running out of power...? Normally watercooling is supposed to be good for stability, but after looking into the Boost 4.0 stuff I fear it may have shifted the limiting factors on the card from something it was well designed to manage (thermals) to something else that it is less practiced at managing (power).

I should probably take that discussion to a dedicated watercooling or overclocking board unless someone here has some specific experience on that front that they'd like to share. And, unless nVidia wants to cross-ship an RMA replacement, I'm reasonably happy running it with this offset for the near future.
If you want to test the power to the GPU, try running a power intensive application and see if it crashes or not. I do know that in the past FurMark was very power hungry on the GPUs but I have read that Nvidia/AMD modified their drivers to detect FurMark running and then used software to limit the power usage on the GPU. Not sure if that's still the case or not.
I've run Furmark and Cinebench and Blender and Prime95 and something like GeekBench? or RealBench? on the system. I also ran something called OCCT5 which had a GPU memory stress tester back when I thought it was a memory problem on the card (mine has the Samsung VRAM chips which aren't suspected of having issues, but I wanted some testing to double-check).
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by PantherX »

I am aware of the GPU Boost and I refreshed myself by read this page: https://www.tweaktown.com/reviews/8738/ ... dex11.html

If you set temperature value at X, power at Y, and fan at Z, then the GPU in theory should automatically self regulate itself without any issues and should be 100% stable. My understanding is that is not the case with your GPU. It would be interesting to post on a hardware forum and see if this is an abnormal behavior of your GPU or expected.

Throttling would occur when any of the X, Y, or Z value was reached hence the True/False statement. Hitting that value meant that your GPU clocks can't go any higher and depending on the setting, may start reducing to continue to hit those X, Y, Z targets.

I have a GTX 1080 TI where I set the fan to 100%, power to max and left it as is. I have seen my clock speed vary from 1607 MHz to 2025 MHz depending on the ambient temperature and WU assigned.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
olems
Posts: 8
Joined: Sun Apr 05, 2020 4:10 pm

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by olems »

I wonder if I have a similar issue with my RTX2070 Super (see viewtopic.php?f=19&t=33988 ) but instead of CL errors in the log my GPU outright crashes and I have to power cycle the PC.
It's only happened on certain work units but always within seconds of starting and on the same point each time. My suspicion now is that these work units end up making my GPU boost so high it draws too much power.
Roadpower
Posts: 71
Joined: Mon Mar 16, 2020 5:11 pm

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by Roadpower »

olems wrote:I wonder if I have a similar issue with my RTX2070 Super (see viewtopic.php?f=19&t=33988 ) but instead of CL errors in the log my GPU outright crashes and I have to power cycle the PC.
It's only happened on certain work units but always within seconds of starting and on the same point each time. My suspicion now is that these work units end up making my GPU boost so high it draws too much power.
I also have a 2070 Super being supplied by a 750 PSU. Thankfully I've not seen this issue with the GPU, normally I suspect PSU issues when I experience crashes. Just seen a weak PSU be the culprit one too many times. I hope this helps.
olems
Posts: 8
Joined: Sun Apr 05, 2020 4:10 pm

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by olems »

Yeah, mine is a 600W which should technically be enough, but after having read up on it the last few days I'm starting to suspect a full OpenCL load might be enough to make this PSU unstable. It's odd that only some WU's does it, I guess only some of them have a configuration that pushes the card high enough to trigger the issue.

I've ordered a 850W PSU now along with some other stuff I needed, should be plenty.
flarbear
Posts: 27
Joined: Fri Apr 03, 2020 7:45 am

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by flarbear »

PantherX wrote:If you set temperature value at X, power at Y, and fan at Z, then the GPU in theory should automatically self regulate itself without any issues and should be 100% stable. My understanding is that is not the case with your GPU.
It would seem that I get errors when the GPU overdrives itself, yes.

My premise is that with most stock cards using the factory cooling solution, they all end up primarily running into the thermal limit when Boost 4.0 decides to stop bumping the clocks and the other limits only rarely cause it to throttle. With my water block doing such a good job at keeping the temps under 50c regardless of load, my particular card is now running into other limits more often than it had been, and more often than most of the cards do with the stock cooler.

And I'm also supposing that Boost 4.0 isn't necessarily as perfectly stable when using those other limits to throttle the clocks as it is when using the thermal limit. For one thing, the thermal limit has some wiggle room in terms of rate of change compared to the other limits.

As you said, this is probably something better explored in a more hardware-oriented forum.

Thanks!
ipkh
Posts: 175
Joined: Thu Jul 16, 2015 2:03 pm

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by ipkh »

Boost 4.0 is guaranteed stable by Nvidia and the adjusted boost clock is guaranteed by the Card Manufacturer. Boost and factory overclocking should not cause failed work units period.
In Windows please open Nvifia Control Panel and enable debug mode. This will force stock clocks and can be used to isolate the factory overclock from the mix. If you still get errors at default clocks, either start an RMA or diagnose other potential causes such as CPU and RAM instability.

TL/DR - Boost does not cause WU failures, get card RMA.
flarbear
Posts: 27
Joined: Fri Apr 03, 2020 7:45 am

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by flarbear »

Thanks ipkh for the suggestions, but I'm already past those points. I've already investigated and found answers about stability with reduced boost clocks and I'm looking into potential issues related to power delivery.
RMCholewa
Posts: 29
Joined: Fri Mar 27, 2020 2:25 pm
Hardware configuration: Lenovo Y540 Notebook with an Intel Core i7-9750H, nvidia RTX2060 Mobile amd 32GB RAM
Location: Recife, Pernambuco / BRAZIL
Contact:

Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.

Post by RMCholewa »

Hi there,

I have been folding since march with 3 notebooks at home. My comments bellow concern a Lenovo Y540 ("gamer") notebook with an Intel i7-i7-9750H, a RTX2060 Mobile and 32GB of RAM.

It used to throttle all the time (GPU at 68c and CPUs at 91c); sometimes the CPUs also power throttle, specially when the GPU is idle. Bought one of those notebook cooler stands and two of those fans that you can attach do the hot air exits. They are yet to arrive.

Quarantine is a weird thing: you start tweaking... looking for something to do. Then, I decided to try to push this notebook a little further.

My first idea was to undervolt both the CPU and GPU to keep them cooler at higher frequencies. It worked and performance increased 10 to 15%. Then the problems started: clWaitForEvents on the GPU slot. It was not happening with every WU but on 30 to 50% of them. HFM states that I have failed 11. That's when I started digging the forums and found this thread.

With a lot of spare time, I started troubleshooting it a variable at a time. My first thought was that the GPU was to blame. Restored it to default settings but GPU WUs were still failing with clWaitforEvents. When I got the CPU undervolting back to normal, the errors went away. No failed WUs since then.

Some things I have noticed:

From time to time (some minutes), the FahCore_22 process stops using the GPU and uses a lot of CPU (I suppose it is loading the task to be processed by the GPU). That was exactly when the WU was failing;

Used MSI afterburner to scan the GPU for overclocking (instead of undervolting it). Got an increase in the frequency curve and performance overall. CPU is working with default settings;

Temperatures are higher: GPU is now topping at 76c and CPU at 94c;

Tools used: ThrottleStop, MSI Afterburner (for tweaking) and HWiNFO and rainmeter for monitoring.

special notes:
When I undervolted, I used XTU, prime95, FurMark and PCMark to stress the system and all of them ran without a single problem;

I use 4 additional monitors on this notebook and lots of open applications. Everytime FahCore_22 used the CPU, the system started stuttering. Process Lasso solved the responsiveness problem.
Post Reply