[Solved, kinda] Strange crash/reboot and CMOS corruption only with F@H

A forum for discussing FAH-related hardware choices and info on actual products (not speculation).

Moderator: Site Moderators

Forum rules
Please read the forum rules before posting.
pcwolf
Posts: 36
Joined: Fri Apr 03, 2020 4:49 pm

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by pcwolf »

@aetch @Neil-B

My folds are never rejected by the F@H servers. I would not agree the system is unstable.

There is a long-standing discussion on this exact error message on the kernel Bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=206903
Image
aetch
Posts: 447
Joined: Thu Jun 25, 2020 3:04 pm
Location: Between chair and keyboard

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by aetch »

A stable system does not reboot itself every few days.

My own systems are typically 3-4 weeks between reboots, even then it's because I'm manually instigating it as part of maintenance to update the OS/drivers and clean the system of dust.
Folding Rigs - None (25-Jun-2022)

ImageImage
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Neil-B »

Whatever is causing it bug, manufacturing issue, or act of higher being, a system that reboots itself at seemingly random intervals cannot be called stable. It may not be an instability caused by clocks or any of the normally attributed causes but it is still not stable.

Whilst it is operational it may work perfectly and fold properly but the fact it it reboots when not specifically commanded to ... that is not stable.

It may not be a hardware driven instability, firmware, os, even applications may be the trigger ... but it is unstable.

The tacoma narrows bridge wasn't unstable under most conditions however under the wrong conditions it was critically unstable ;)

.. but to get away from semantics what really matters is helping you and Marius get you kit and folding to the point where uncommanded reboots do not happen ... I'll keep digging around and see if I can find any other possible tests/adjustments that might be worth trying in case they stop the reboots from occuring.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
pcwolf
Posts: 36
Joined: Fri Apr 03, 2020 4:49 pm

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by pcwolf »

Good points, All. Thank you sincerely for your suggestions based upon your obvious experience.

Still, as Bugzilla demonstrates, it is a false flag and not harmful to hardware. Kindly read the thread to the end. I will run my system as I see fit; it is a home desktop and not running any nuclear missiles or fusion reactors. Today's Stable Branch update suddenly restored my Intel AX210 M.2 Bluetooth function without notice out of nowhere after being broken since kernel 5.10, and I am confident the errors I recorded are kernel related, not the result hardware failure nor the F@H program. I will be satisfied to wait patiently for the Arch/Manjaro developers to eventually getting around to the fix.

-Phil
Image
Marius
Posts: 34
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Marius »

@pcwolf @Neil-B

Thanks for the info, but it doesn't seem to apply in my case. I haven't been running Kernel 5.11 for many months, I'm currently on 5.15.x. And in my case, the problem also happens when running the Windows version. It just simply reboots, and corrupts the CMOS in the process. I don't ever get any MCE errors in the log. And I have _only_ seen it do that so far with F@H. No other stress tests I threw at it had any problems, be it Linux or Windows.
pcwolf
Posts: 36
Joined: Fri Apr 03, 2020 4:49 pm

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by pcwolf »

Took my own advice and re-read closely that Bugzilla kernel bug report.
After digesting things I got elbow deep into my BIOS today and altered a good handful of power settings.
Knocking wood but the performance of the system today is leaving me hopeful that error code is long in my rearview mirror.
Image
pcwolf
Posts: 36
Joined: Fri Apr 03, 2020 4:49 pm

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by pcwolf »

@Marius

Two straight days without the MCE! I went into BIOS and reset some power-saving settings that are set on by default.

Read that Bugzilla thread closely. There are some settings suggested that worked for me.

Turns out, the errors were most likely due to a power sag, whether intentional for "power saving" or perhaps related to load-line settings. Basically, the PCIe GPU bus would pause while the Ryzen was polling, and with the missing data, the Ryzen would pop the MCE.
Image
Marius
Posts: 34
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Marius »

@pcwolf
Thanks for the info. Yes, the ASPM logic will pause the PCIe bus for power-saving reasons, but I have that disabled. The problem still occurs, though.
pcwolf
Posts: 36
Joined: Fri Apr 03, 2020 4:49 pm

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by pcwolf »

AMD "Cool'n'Quiet" switch seems to be one culprit.
Image
Marius
Posts: 34
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Marius »

@pcwolf

Thanks for the info. I think "Cool'n'Quiet" is not relevant for Zen and above anymore. Power saving when idle is now controlled by the C-States technology. The Gigabyte BIOS still has the old tech enabled by default, but I had it disabled just in case. That's not the problem, as the crash/reboot + CMOS corruption still occurs. The crashes that I heard caused by C-States enabled are when the CPU's are mostly idle. That's not the case here. This system crashes only with F@H, with GPU active or not, be it running on Windows or Linux. No solution in sight yet. In the meantime, I'm running Boinc + Rosetta and MilkyWay, without any problems.
calro
Posts: 1
Joined: Fri Dec 31, 2021 6:09 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by calro »

@Marius

I've had the same problem myself as of late. It's been a while since I've run my computer (5950X on an Asus Crosshair Impact VIII X570 and an EVGA 3080 XC and a Corsair SF750) on F@H, and it just has random restarts, mainly on running the CPU. It takes maybe a couple of hours. In some previous builds of this machine, and so previous versions of F@H, I'd get the same sort of random crash: no BSOD, just an immediate restart to the loading screen, but as of a few months ago, it would run fine.

Just wanted to let you know that I don't think you're alone in this issue, but I seriously didn't know how else to describe what was happening. I'm just trying to finish a quite long (24 hour) folding job, but with a crash every hour or two, it's really taking its time. The only thing that's not similar for me is no CMOS corruption. I'll be playing around with my system, and if I find anything that seems to stop it, I'll be sure to share.
Marius
Posts: 34
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Marius »

@Cairo

Thanks for the info! I have tried everything I could on my side, configuration wise. The problem only manifests with F@H, so it's good to know that I'm not crazy and other people are having similar problems. I hope we can figure this out soon.
pcwolf
Posts: 36
Joined: Fri Apr 03, 2020 4:49 pm

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by pcwolf »

Got tired of steady mce [Hardware Error]: so I switched my CPU crunching from F@H to BOINC and the errors disappeared. GPU remains with F@H.
Image
Marius
Posts: 34
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Marius »

Gentoo finally updated the F@H client to 7.6.21, up from 7.6.13. I just tried F@H again after a couple of months, having run BOINC only for that time. I let F@H run for about 40 hours in my Gentoo Linux environment, and there were no problems.
My Gentoo Linux OS is updated almost daily, and there have been several updates of the NVidia driver and kernel, between the time I stopped F@H because of the silent reset issue and now. The current nvidia-drivers version is 510.64.
I also switched kernels from the standard gentoo-sources to zen-sources, and set some options that are not available in the former, such as the PDS CPU scheduler. I'm on the zen-sources-5.16.10 release.
I don't know if any of that was the fix for the reset problem. I had tested F@H 7.6.21 on Windows last year, and hit the silent reset as well. What I'm sure of now, is that it is _not_ a hardware issue.
I will keep testing with F@H for some time before I close this topic.
Update 3/9/22:
UGGGH! The bug is still there. I left it running tonight for a couple of hours, and it happened again. It silently reset and corrupted the cmos. Back to boinc for now.
FrankMB
Posts: 7
Joined: Sat Mar 12, 2022 12:38 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by FrankMB »

@Marius
You are definitively not alone with that bug. I also have an AMD 5950X on a Gigabyte X570 Aorus Master (not the new X570s model) with an nVidia GTX 1080. System seems ultra stable on everything except FAH (version 7.6.21). Memtest 86, Linkpack or 3DMark are stable. On FAH it crashes Win 10 or Win 11 (fresh install) and automatic reboot without any blue screen. I do not have any crash dumb (no blue screen) as the reboot is too fast. Having those bugs for months.
On my system, if I only fold on the GPU the system is stable. But anything on the CPU will crash in hours (10, 15, 30 CPU slots with or without GPU). It only takes longer before crashing if a have less CPU slots. I also observed a CMOS corruption once and had to reset the BIOS.
I tried :
1) Changing PSU
2) Changing SSD
3) Changing RAM (2x16 GB, listed on the Gigabyte RAM support list)
4) Changing GPU (ZOTAC GTX 1080 instead of EVGA GTX 1080)
5) RAM on stock (2100 MHz) setting
6) RAM on XMP speed (3200 or 3500 MHz)
7) Removing any USB peripherals except mouse and keyboard
8) Windows 10 or Windows 11 (fresh install)
9) CPU at stock, no overclock
10) Maximum fan speed, CPU temps below 70 degrees Celsius (Noctua NH-D15 CPU cooler)

Every single one of those changes were unsuccessful. I just ordered a big liquid CPU cooler to lower the CPU temps even more but I honestly do not think it will make a difference.
I will keep checking that thread hoping someday someone will find the solution.
Post Reply