Strange crash/reboot and CMOS corruption only with F@H

A forum for discussing FAH-related hardware choices and info on actual products (not speculation).

Moderator: Site Moderators

Forum rules
Please read the forum rules before posting.

Re: Strange crash/reboot and CMOS corruption only with F@H

Postby pcwolf » Fri Dec 10, 2021 8:40 pm

@aetch @Neil-B

My folds are never rejected by the F@H servers. I would not agree the system is unstable.

There is a long-standing discussion on this exact error message on the kernel Bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=206903
Image
pcwolf
 
Posts: 26
Joined: Fri Apr 03, 2020 5:49 pm

Re: Strange crash/reboot and CMOS corruption only with F@H

Postby aetch » Fri Dec 10, 2021 8:52 pm

A stable system does not reboot itself every few days.

My own systems are typically 3-4 weeks between reboots, even then it's because I'm manually instigating it as part of maintenance to update the OS/drivers and clean the system of dust.
AMD Ryzen 9 3900X, 16GB, RTX 2070 Super, Win 10 Pro, F@H 7.6.21
Intel i5-7600K, 16GB, GTX 1080 Ti, Win 10 Pro, F@H 7.6.21

ImageImage

How to post logs and other useful info
aetch
 
Posts: 328
Joined: Thu Jun 25, 2020 4:04 pm
Location: Between chair and keyboard

Re: Strange crash/reboot and CMOS corruption only with F@H

Postby Neil-B » Fri Dec 10, 2021 9:40 pm

Whatever is causing it bug, manufacturing issue, or act of higher being, a system that reboots itself at seemingly random intervals cannot be called stable. It may not be an instability caused by clocks or any of the normally attributed causes but it is still not stable.

Whilst it is operational it may work perfectly and fold properly but the fact it it reboots when not specifically commanded to ... that is not stable.

It may not be a hardware driven instability, firmware, os, even applications may be the trigger ... but it is unstable.

The tacoma narrows bridge wasn't unstable under most conditions however under the wrong conditions it was critically unstable ;)

.. but to get away from semantics what really matters is helping you and Marius get you kit and folding to the point where uncommanded reboots do not happen ... I'll keep digging around and see if I can find any other possible tests/adjustments that might be worth trying in case they stop the reboots from occuring.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Neil-B
 
Posts: 1980
Joined: Sun Mar 22, 2020 6:52 pm
Location: UK

Re: Strange crash/reboot and CMOS corruption only with F@H

Postby pcwolf » Sat Dec 11, 2021 1:21 am

Good points, All. Thank you sincerely for your suggestions based upon your obvious experience.

Still, as Bugzilla demonstrates, it is a false flag and not harmful to hardware. Kindly read the thread to the end. I will run my system as I see fit; it is a home desktop and not running any nuclear missiles or fusion reactors. Today's Stable Branch update suddenly restored my Intel AX210 M.2 Bluetooth function without notice out of nowhere after being broken since kernel 5.10, and I am confident the errors I recorded are kernel related, not the result hardware failure nor the F@H program. I will be satisfied to wait patiently for the Arch/Manjaro developers to eventually getting around to the fix.

-Phil
pcwolf
 
Posts: 26
Joined: Fri Apr 03, 2020 5:49 pm

Re: Strange crash/reboot and CMOS corruption only with F@H

Postby Marius » Sat Dec 11, 2021 7:49 am

@pcwolf @Neil-B

Thanks for the info, but it doesn't seem to apply in my case. I haven't been running Kernel 5.11 for many months, I'm currently on 5.15.x. And in my case, the problem also happens when running the Windows version. It just simply reboots, and corrupts the CMOS in the process. I don't ever get any MCE errors in the log. And I have _only_ seen it do that so far with F@H. No other stress tests I threw at it had any problems, be it Linux or Windows.
Marius
 
Posts: 18
Joined: Thu Nov 04, 2021 4:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Postby pcwolf » Sun Dec 12, 2021 4:55 am

Took my own advice and re-read closely that Bugzilla kernel bug report.
After digesting things I got elbow deep into my BIOS today and altered a good handful of power settings.
Knocking wood but the performance of the system today is leaving me hopeful that error code is long in my rearview mirror.
pcwolf
 
Posts: 26
Joined: Fri Apr 03, 2020 5:49 pm

Re: Strange crash/reboot and CMOS corruption only with F@H

Postby pcwolf » Mon Dec 13, 2021 9:56 pm

@Marius

Two straight days without the MCE! I went into BIOS and reset some power-saving settings that are set on by default.

Read that Bugzilla thread closely. There are some settings suggested that worked for me.

Turns out, the errors were most likely due to a power sag, whether intentional for "power saving" or perhaps related to load-line settings. Basically, the PCIe GPU bus would pause while the Ryzen was polling, and with the missing data, the Ryzen would pop the MCE.
pcwolf
 
Posts: 26
Joined: Fri Apr 03, 2020 5:49 pm

Re: Strange crash/reboot and CMOS corruption only with F@H

Postby Marius » Mon Dec 13, 2021 11:43 pm

@pcwolf
Thanks for the info. Yes, the ASPM logic will pause the PCIe bus for power-saving reasons, but I have that disabled. The problem still occurs, though.
Marius
 
Posts: 18
Joined: Thu Nov 04, 2021 4:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Postby pcwolf » Fri Dec 17, 2021 10:08 pm

AMD "Cool'n'Quiet" switch seems to be one culprit.
pcwolf
 
Posts: 26
Joined: Fri Apr 03, 2020 5:49 pm

Re: Strange crash/reboot and CMOS corruption only with F@H

Postby Marius » Sat Dec 18, 2021 12:45 pm

@pcwolf

Thanks for the info. I think "Cool'n'Quiet" is not relevant for Zen and above anymore. Power saving when idle is now controlled by the C-States technology. The Gigabyte BIOS still has the old tech enabled by default, but I had it disabled just in case. That's not the problem, as the crash/reboot + CMOS corruption still occurs. The crashes that I heard caused by C-States enabled are when the CPU's are mostly idle. That's not the case here. This system crashes only with F@H, with GPU active or not, be it running on Windows or Linux. No solution in sight yet. In the meantime, I'm running Boinc + Rosetta and MilkyWay, without any problems.
Marius
 
Posts: 18
Joined: Thu Nov 04, 2021 4:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Postby calro » Fri Dec 31, 2021 6:40 pm

@Marius

I've had the same problem myself as of late. It's been a while since I've run my computer (5950X on an Asus Crosshair Impact VIII X570 and an EVGA 3080 XC and a Corsair SF750) on F@H, and it just has random restarts, mainly on running the CPU. It takes maybe a couple of hours. In some previous builds of this machine, and so previous versions of F@H, I'd get the same sort of random crash: no BSOD, just an immediate restart to the loading screen, but as of a few months ago, it would run fine.

Just wanted to let you know that I don't think you're alone in this issue, but I seriously didn't know how else to describe what was happening. I'm just trying to finish a quite long (24 hour) folding job, but with a crash every hour or two, it's really taking its time. The only thing that's not similar for me is no CMOS corruption. I'll be playing around with my system, and if I find anything that seems to stop it, I'll be sure to share.
calro
 
Posts: 1
Joined: Fri Dec 31, 2021 7:09 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Postby Marius » Sun Jan 02, 2022 1:57 pm

@Cairo

Thanks for the info! I have tried everything I could on my side, configuration wise. The problem only manifests with F@H, so it's good to know that I'm not crazy and other people are having similar problems. I hope we can figure this out soon.
Marius
 
Posts: 18
Joined: Thu Nov 04, 2021 4:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Postby pcwolf » Tue Jan 11, 2022 1:11 am

Got tired of steady mce [Hardware Error]: so I switched my CPU crunching from F@H to BOINC and the errors disappeared. GPU remains with F@H.
pcwolf
 
Posts: 26
Joined: Fri Apr 03, 2020 5:49 pm

Previous

Return to FAH Hardware

Who is online

Users browsing this forum: Mstenholm, Yandex [Bot] and 1 guest

cron