Page 1 of 3

Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Posted: Sun Mar 04, 2018 12:56 am
by ntsarb
Hello,

I've used F@H on a system with Quad 1080Ti + ASRock X99 WS + i7 6850K, for several months without any major issue.

I recently upgraded motherboard+CPU to ASUS X299 SAGE WS + i9 7900X and I can't run F@H on all GPUs, as in this case the system freezes for 1-2 minutes before it crashes to a BSOD with error code: DPC_WATCHDOG_VIOLATION. OS is Windows 10 Pro 64bit.

Last year, I experienced the same problem with an ASUS X99-E WS motherboard, using same GPUs and system components. Notably, both ASUS motherboards feature a PLX chip, i.e. PCI-E switch.

RAM has passed latest MemTest86 and F@H doesn't have any issues with any two GPUs on the motherboard. Blender3D doesn't have any issues rendering with all four GPUs, using CUDA, either.

Anyone else having similar issues with the particular motherboard in quad GPUs configuration?

Is this something that needs reporting to F@H developers as a potential bug? Could be an NVIDIA driver issue, too, but I presume F@H developers would be in a much better position to communicate the issue with NVIDIA's Tech Support?

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Posted: Sun Mar 04, 2018 8:58 am
by foldy
FAH may be the trigger but something else could be the reason.

Do you have latest BIOS for the mainboard and drivers updated?

In Windows event log there should be a reason shown what caused the DPC_WATCHDOG_VIOLATION.

Here are some general solutions for the problem https://thewindowsplus.org/dpc_watchdog_violation/

Does your system freeze and crash instantly when you start FAH on all GPUs or does it first run for some time without problem?

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Posted: Sun Mar 04, 2018 3:24 pm
by toTOW
Something may be wrong with one of your GPU or your PSU ...

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Posted: Sun Mar 04, 2018 9:36 pm
by ntsarb
Hi foldy, toTOW,

Thanks for your responses. Here's the info you asked for:

- Already using the latest UEFI/BIOS firmware and drivers from ASUS's and NVIDIA's web sites.
- I've already gone through the common problems and solutions for dpc_watchdog_violation. None of these appear to be relevant to or help with the particular configuration.
- The system can freeze for several seconds, at random times. When the freeze lasts long enough (about 1-2 minutes), the dpc_watchdog_violation BSOD is triggered.
- The PSU is an EVGA 1600P2. The same PSU and the same other components (GPUs, CPU, RAM, SSD) have be used without any problem at all on an ASRock X99 WS motherboard; that was rock solid!

Worth noting:

Facts:
* The problem has been confirmed on two brand new ASUS X99-E WS motherboard and another brand new ASUS X299 SAGE motherboard. The particular motherboards employee Broadcom PEX8747 PCI-E switches, for x16 PCI-E lanes (3.0) on each slot (which is useful for Deep Learning applications).

Opinion:
I suspect there's an incompatibility between the NVIDIA GPUs or Kernel Driver and the PLX PEX9747 switch... but I can't confirm, debug and resolve this. Only NVIDIA could do this, as they have the source code for their Kernel Driver.

Facts:
* All Kernel Memory MiniDump files (from multiple BSOD minidumps investigated with WinDBG) indicates "Probably caused by : nvlddmkm.sys ( nvlddmkm+1c8301 )". Temperatures of GPUs, PCH (chipset) and CPU are within limits, most often far below the limits (e.g. 50-60 degrees C).

Opinion:
I suspect the NVIDIA Kernel Driver loses communication with one of the GPUs. Maybe the PLX chip freezes. Whatever happens, it's always related to NVIDIA's Kernel Driver, on a PLX-based motherboard, which is why I suspect an incompatibility with the PLX switch.

Facts:
* Each GPU has been tested on its own on the ASUS motherboards, no problem at all.
* Each pair (all permutations tests) of GPUs were tested on the ASUS motherboards, no problem at all with that either.
* Add a third GPU or a fourth GPU and the system becomes unstable.
* I've tested all (4) permutations of 3 GPUs, in good hope I may single out the one that may cause instability. All three permutations lead to unstable system.

Opinions:
One could theorise that one of the GPUs has a hardware defect that prevents it from working well in a tripple or quad GPU setup, but this should be exhibitted on the ASRock motherboard, too, where 3 and 4 GPUs were working perfectly fine for about 6 months (prior to upgrading to the ASUS X299 Sage).

Hence, I'm quite confident the GPUs are good. Both ASUS motherboards have exactly the same issue and they were tested with X99 CPU (same i7 6850K that was used on the ASRock motherboard) and X299 CPU (i9 7900X).

I've reported these to ASUS UK, which doesn't reply to my support requests (doesn't pick up the phone and doesn't respond to web form tickets), and NVIDIA's Tech Support. NVIDIA's tech support are still asking for reinstalling drivers and other basics, which have be performed numerous times.

If there are other users of the same hardware configuration who don't experience this issue, I'd like to hear from them, so as to better understand if it's a more general problem. From another forum, of 3D Rendering professionals, I've so far only found people with the same setup who suffer from the same problem.

I'm hopeful this is a software driver issue that can be fixed but I don't know how to persuade NVIDIA and/or ASUS to look into this. If there's a F@H developer or an NVIDIA or ASUS employee herein, who can help towards this direction, I'd be very grateful.

Regards

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Posted: Mon Mar 05, 2018 12:22 pm
by foldy
"From another forum, of 3D Rendering professionals, I've so far only found people with the same setup who suffer from the same problem."

"If there's a F@H developer or an NVIDIA or ASUS employee herein, who can help towards this direction, I'd be very grateful."
There is none of them here. I guess we cannot solve this issue.

Last ideas: FAH has a very high pcie bus usage. Maybe there is some BIOS settings which can change the pcie somehow, e.g. reduce to pcie gen 2 or change the clock speed. Or the PLX chip still gets too hot, try to find it on the mobo and feel temp with your finger.

(You did put in the 2x extra 8pin plugs for the mainboard?)

Or a roughly similar bluescreen with GPUs freeze: increase TdrDelay from 2 to 10.
https://docs.microsoft.com/de-de/window ... d-recovery

And found some registry values for DPC_WATCHDOG_VIOLATION from Windows 2012. Increase timeout could help or not.
https://support.symantec.com/en_US/arti ... 36958.html

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Posted: Mon Mar 05, 2018 7:01 pm
by bruce
ntsarb wrote:Is this something that needs reporting to F@H developers as a potential bug? Could be an NVIDIA driver issue, too, but I presume F@H developers would be in a much better position to communicate the issue with NVIDIA's Tech Support?
It's highly unlikely that FAH developers can do anything about this problem. Each FAHCore runs independently of the others and on a different GPU. It's up to your hardware and to the OS to supply the necessary resources to all of them.

The WATCHDOG VIOLATION is a generic Windows problem indicating that one of your drivers is hogging resources, but it's not smart enough to give a meaningful diagnosis of which one. (In fact, early watchdog violations were due to an SSD driver that windows included in their list of approved drivers -- but it can be any one of your drivers.) FAH is not the problem :!: While we will cooperating with you getting it fixed, ultimately, it's a Windows problem that will be best solved by a site with better experience in diagnosing general driver problems.

Foldy has already suggested that it's probably a problem in distributing PCIe resources when there's a lot of contention ... and that sounds reasonable to me. FAH can't do anything about that.

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Posted: Mon Mar 05, 2018 11:43 pm
by ntsarb
Foldy, bruce, thanks for your constructive feedback, very much appreciated.

- Regarding thermals:
* There are 2 PLX chips ( PEX8747 ) under the same heatsink as the PCH (X299 chipset), which means the temperature of the PLX chips cannot deviate much from the heatsink's, which is measured in real time.
* The PCH temperature doesn't exceed 65C and the PLX chips are meant to operate up to 100C.
* There are 2x industrial-grade 14mm NOCTUA fans (up to 1800rpm) located in the front of the computer case, blowing cool air (ambient temperature 20-21C) over the motherboard. 3x NOCTUA fans (same type) are used for exhaust, one at the back and two more at the top.

- TDRDelay. Indeed, I'm aware of this setting and how it affects the operation of the computer. I've seen it in action and I don't think TDR is triggered.

- Regarding contention of PCI-E resources: I expect F@H to be exchanging small amounts of data with the GPUs. Furthermore, each GPU completes its work at a different time. There shouldn't be an issue there, except if there is a defect, which can't be true for 3 brand new motherboards.

- More Testing - turning things around!

Last night, I installed Ubuntu Linux 17.10.1 with NVIDIA's closed source drivers. I loaded all 4x GPUs with workload from Folding@Home for 6+ hours and Linux did not "panic" at all, i.e. no kernel panic (the equivalent of Window's BSOD). This is great news, but I need to run lots more tests before I'm confident.

If confirmed with more tests in Linux passing succesfully, that would mean the issue affects Windows OS in particular. In this case, it could be one or more of the following:

- Microsoft Windows 10 kernel
- Intel Chipset driver
- NVIDIA driver

Unfortunately, ASUS Tech UK does not respond to calls or support requests (tickets) that I submitted. As a matter of fact, their drivers are actually Intel's (chipset) drivers, but they should still respond, talk to Intel about the problem and let me know if/when a solution can be provided. NVIDIA is still looking for possible common causes, Microsoft blamed ASUS as incompatible with Windows 10 Pro Creator's Update (but they still allow ASUS to advertise its motherboard as being compatible with Windows 10) and closed the ticket. I think it's time to open a ticket with Intel, too.

Engineers from these companies need to talk to each other, otherwise there's little hope for this issue to be resolved.

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Posted: Tue Mar 06, 2018 11:20 am
by foldy
So Linux is the solution. It often also has better FAH performance in PPD. Do you need to switch back to Windows?

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Posted: Sat Mar 17, 2018 11:37 am
by Jimboc
ntsarb wrote:Foldy, bruce, thanks for your constructive feedback, very much appreciated.

Last night, I installed Ubuntu Linux 17.10.1 with NVIDIA's closed source drivers. I loaded all 4x GPUs with workload from Folding@Home for 6+ hours and Linux did not "panic" at all, i.e. no kernel panic (the equivalent of Window's BSOD). This is great news, but I need to run lots more tests before I'm confident.
Hi ntsarb,

I was sorry to learn of the difficulties you faced with this issue especially when your hardware is so high-end. I know that frustration only too well myself.

I will be upgrading from Windows 8.1 to Windows 10 and using an Asus X299 motherboard in the next 2 to 3 months. Since I will have 2 GPUs, I probably won’t experience this but please do let us know how the testing on Linux you mentioned works out.

Many thanks for this information.

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Posted: Mon Mar 19, 2018 1:07 am
by Kuno
You've stated that you are testing in Linux and having no issues, which would mean it's more than likely an issue with the drivers in windows or the way that Windows is handling the data on the bus. You should stick with Linux anyways as you will be able to get more work done, and have less overhead on your folding rig. Windows should not be used for folding as there is seriously just too much overhead and you end up losing about 15%-20% of the performance of your cards.

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Posted: Thu Apr 05, 2018 12:11 am
by networkingdude
I am experiencing the exact same error with the same motherboard. I have 2 GTX 1080 ti's and an intel X520 10gb card. The 2 GPU's are installed in slot 1 and 3 and the 10gb card is in port 7.

Bios is up to date, drivers are updated with SDI driver to latest editions.

The crash occurs when benching both GPU's at the same time.

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Posted: Thu Apr 05, 2018 3:08 am
by bruce
Kuno wrote:Windows should not be used for folding as there is seriously just too much overhead and you end up losing about 15%-20% of the performance of your cards.
It's true that Windows has more overhead that Linux, but folding with whatever you have is much more important than not folding, just because you have a Windows system. Continuing to fold with Windows is a no-brainer. Upgrading a Windows system with a Linux system may be easy for those with a good computer background, or those with a desire to learn something new, but for those with moderate (or less) computer skills, can become a non-trivial system upgrade.

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Posted: Thu Apr 05, 2018 12:33 pm
by foldy
Does the Broadcom/Avago/PLX PEX 8747 chip have any drivers listed in Windows Device Manager => System devices?

Try to edit Windows Power Options => Change plan settings and set PCI Express Link State Power Mangement to Off

I found this Windows SDK package which is for developers only but maybe it has some magic to fix the FAH issue?
https://docs.broadcom.com/docs/SDK-Complete-Package

ASUS X299-e WS Multi GPU problem solved

Posted: Tue May 08, 2018 11:24 am
by ilxli
Hey guys, I solved the multi GPU problem at my end.

My system is:
Win 10
ASUS X299-E WS
i7 5930K
2 x 1080 TI
2 x Titan X
64 GB Kingston memory.
All Sata ports in use.

I had the same problem DPC_WATCHDOG_VIOLATION every 5 minutes.
The problem started after the Win10 January update.

What I did to solve it:
Updated Win10 to the most resent version. (took a long white with loads of WATCHDOG'S in between).
When that was finally done my computer was still very unstable like a DPC_WATCHDOG_VIOLATION every view minutes.
then I uninstalled all Nvidia drivers and installed the following drivers from Nvidia: 382.53-desktop-win10-64bit-international-whql

And that did the trick for me : )
Its a week later now and still running smoothly without any crash!

Here a link to the drivers:

Code: Select all

http://www.nvidia.com/download/driverResults.aspx/119914/en-us
For me this is the only driver version that is stable.

This is my first post ever, it was to big a problem for me to let other people suffer from it.

Hope it helps some of you : )

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Posted: Thu May 17, 2018 9:42 am
by windbeutel
Hello guys,

I do 3D rendering on GPUs (Octane and Redshift). My machine is the same ASUS x99-e ws with 4x1070.
And I experience the same problems on Win10 with the latest drivers. Random crashes/freezes and BSOD Watchdog violation.

I can confirm, the latest stable driver version is 382.53.
Unfortunately I need to install the latest drivers (supporting CUDA 9) in order to fully use functionality in the newest render builds.
Linux is no option for me since there is no Linux version of my 3D program (C4D).

There are many other people having this issue with PLX mainboards.
Everyone should contact NVIDIA support and open a ticket to push them fixing the driver.
I did this a month ago. But I have the feeling that since no more than two gpus are supported for gaming any longer,
NVIDIA thinks our folks (professional 3D-rendering, folding, deep learning etc.) should go with "professional" cards like the Quadro or Tesla series.