Odd crashing issue

A forum for discussing FAH-related hardware choices and info on actual products (not speculation).

Moderator: Site Moderators

Forum rules
Please read the forum rules before posting.
Post Reply
ftb28064212
Posts: 2
Joined: Sun Apr 12, 2020 7:31 pm

Odd crashing issue

Post by ftb28064212 »

Hello, I've been having some odd issues and wanted to see if anyone else has had a similar issue. I'd like to find a solution to this.

Okay, so I'm successfully folding on three PCs with zero issues at all. Those three are:

1) Desktop PC with an i5-9400F and an RTX 2060 Super
2) Dell PowerEdge T110 II with a Xeon E31220
3) Dell PowerEdge R710 with dual Xeon X5550s

I've had no issues with those machines at all.

Now the machines I'm having an issue with are:

1) Dell PowerEdge R710 with dual Xeon X5647s
2) Desktop PC with i7-8700k and GTX 1080 Ti

The two failing machines have the exact same issue. While folding, they will randomly reboot and will come up and say no boot device found. The only way to get them to boot again is to power them completely off and on. Once I do that, both will boot without any issue and will continue folding just fine. Event logs do not indicate any issue other than an unexpected shutdown. If this were one machine or the other, I might blame the machine. Considering the two have very different hardware configurations and experience the same issue, it's really confusing me. I've already given up on the 8700k system, since it's my daily driver. I'm also about to give up on the second R710 because it's also my Plex server and it's annoying when it goes down and I have to go to it and force a reboot.

Any ideas?

Mod Edit: Moved To Correct Forum - PantherX
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Odd crashing issue

Post by PantherX »

Welcome to the F@H Forum ftb28064212,

Considering that the error message is "no boot device found" I am tempted to think that it could be:
1) Warning of a HDD/SSD staring to fail - Run diagnostic to see if you have any failures (bad sectors for HDD and see if the increase in re-allocated cells for SSD)
2) Loose/degraded SATA cables between your storage devices and your motherboard - Reset the cables and/or swap them out
3) Failing motherboard

Initially, those systems were folding fine and then something changed. For how long did they fold without issues? Did anything change that caused them to start folding (relocated to a new area, added new components, etc.)?
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
ftb28064212
Posts: 2
Joined: Sun Apr 12, 2020 7:31 pm

Re: Odd crashing issue

Post by ftb28064212 »

PantherX wrote:Welcome to the F@H Forum ftb28064212,

Considering that the error message is "no boot device found" I am tempted to think that it could be:
1) Warning of a HDD/SSD staring to fail - Run diagnostic to see if you have any failures (bad sectors for HDD and see if the increase in re-allocated cells for SSD)
2) Loose/degraded SATA cables between your storage devices and your motherboard - Reset the cables and/or swap them out
3) Failing motherboard

Initially, those systems were folding fine and then something changed. For how long did they fold without issues? Did anything change that caused them to start folding (relocated to a new area, added new components, etc.)?
I'll elaborate below but here are the quick and dirty answers to your three suggestions:

1) The drives are perfectly fine and have passed diagnostics with zero issues.
2) There are no SATA cables. The server has drives that connect directly to a SAS backplane, connected to a Dell PERC H200 RAID controller. The PC uses M.2 NVMe drives.
3) No issues were present on the desktop before folding and none have been present after removing the folding client. I have suspended folding on the server to see how it responds. The longest it has gone without crashing is four days. If it makes it to a week without crashing, now that I have removed the folding client, I can only assume the client was causing the problem somehow.

The server lasted a day or two before the first crash (and anywhere from 2 hours to several days between crashes after that). After the first crash, I suspected an issue with the SSD that the host OS runs off of (the virtual machines are running on a RAID10 array) so I removed it from its sled and ran diagnostics on it from another PC. It came back fine. Dell OMSA reports no hardware issues. Recently, it went four days without any crashes. There are no crash dumps, which tells me it didn't blue-screen but simply hard rebooted. For now, I've completely disabled folding on this server at this time. If it goes a week or more without any issues, then I can only assume F@H was the cause.

The desktop with the 8700k only folded for about 30 minutes before it crashed the first time and it would crash at random after that, usually after no more than an hour. All of the drives are fine and nothing else makes it crash. At first, I suspected it could have something to do with the overclock, so I disabled CPU folding and did GPU-only. I still had the same issue, so I gave up on that one. It's been completely fine since I uninstalled the folding client. This machine did blue-screen and the crash dump shows "DPC_WATCHDOG_VIOLATION" with the faulting module listed as "ndis.sys". I'm unsure why the network driver would have anything to do with the issue but I regularly perform network/disk intensive operations on this PC with no issues at all. I can also run torture tests, such as Prime95 for the CPU and Furmark for the GPU. Both are cooled using AIOs and will not overheat or crash even under full load for several hours. I am currently at 5 days of uptime on this PC with zero issues after removing the client. I suppose it's possible that folding could be stressing the system in some way that all of the other things I run do not, and that it's potentially exposing a previously unknown issue. It's just odd to me that nothing else will cause this issue. Because I overclock my system, I've performed many different benchmarks and stress tests with no issues.

The "no boot device" error only happens after a folding-related crash. I can reboot either machine from the OS and they will both boot just fine with zero issues. If it only happened on the server, I would start looking at the RAID controller, backplane, etc. It's just odd that it happened on an otherwise perfectly fine desktop PC.

OH! I actually almost forgot to mention that it happened on a third PC as well. I almost forgot about it because it's just an old PC I almost never use and I just kind of wrote it off as being due to its age. I have an old retro-gaming PC I put together, with an Athlon II X4 640 and a Radeon HD 7750. I started to run it a couple of weeks ago and checked in on it after a few hours to find it sitting on a screen that also said no boot device was found. Powering it off and back on allowed it to boot again.
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Odd crashing issue

Post by PantherX »

For the Servers, check to see if there are any firmware updates available for the devices (HDD, BIOS, etc.) and if you can, do check the desktops for any new firmware/BIOS updates.

If a system crashes while folding, it is an indication of a hardware issue. F@H is build using scientific programming which is very specialized and highly optimized (GROMACS). Thus, while other benchmarks or torture tests focus on either generating extra heat or artificial load, F@H uses scientific data to stress your system out. Thus, it is completely possible to not have any issues while gaming or during normal use, but face issues while folding. Do note that a misplaced or discolored pixel can be insignificant for games, an incorrect decimal value in a scientific calculation can be noticed.

The weird part is folding tends to stress out the CPU and GPU. It doesn't really use a lot of RAM or storage so having the "no boot device found" message appear only during a crash caused by folding is weird :|

Silly idea, have you attempted to simulate a crash by simply yanking the power cable on the system while it isn't folding and see if the same error message appears or not? I am guessing that your PSUs are of a known reliable brand and have enough wattage?
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
uyaem
Posts: 222
Joined: Sat Mar 21, 2020 7:35 pm
Location: Esslingen, Germany

Re: Odd crashing issue

Post by uyaem »

ftb28064212 wrote:Because I overclock my system, I've performed many different benchmarks and stress tests with no issues.
To reduce noise I reduced Vmax, so like with overclocking, it was about "finding the limit".
I went far under, all stress tests were fine. But FAHClient crashed after a minute.

With regards to the random duration until the crash happens for you, I believe there are some projects which are even even more demanding, so it could be that this could be the final drop in the barrel.
Image
CPU: Ryzen 9 3900X (1x21 CPUs) ~ GPU: nVidia GeForce GTX 1660 Super (Asus)
HugoNotte
Posts: 70
Joined: Tue Apr 07, 2020 7:09 pm

Re: Odd crashing issue

Post by HugoNotte »

Very few benchmarking programs and stress tests compare to F@H WUs. The WUs make extensive use of AVX 256 extensions, which puts much more load on the components than what the average stress test does, even though the stress test might put 100% load on the CPU. Running F@H creates higher temperatures than most stress tests. This means, it taxes the cooling system more and the PSU.
Is it possible that the PSUs struggle keeping up?
Even though your CPU and GPU cooling might be good enough, is it possible that other, passively cooled components on the MB get very hot and haven't cooled down sufficiently by the time the system reboots after a system crash? A few seconds later temperatures might have dropped sufficiently and therefore the reboot initiated by you would be successful.
Post Reply