[Solved, kinda] Strange crash/reboot and CMOS corruption only with F@H

A forum for discussing FAH-related hardware choices and info on actual products (not speculation).

Moderator: Site Moderators

Forum rules
Please read the forum rules before posting.
gunnarre
Posts: 567
Joined: Sun May 24, 2020 7:23 pm
Location: Norway

Re: Strange crash and CMOS corruption after switching to 308

Post by gunnarre »

Marius wrote:For example, I ran FurMark, which really taxes the GPU, while at the same time running its "CPU Burner" benchmark in the background, with 32 threads @100% utilization, for 10 hours.
Unless Furmark's CPU Burner does AVX, AVX_256 and AVX_512 loads, then it's not a realistic benchmark for folding or other vector operations. Please try Prime95's AVX tests alone or at the same time as the Furmark GPU benchmark if you want to replicate a similar load as folding.
Image
Online: GTX 1660 Super, GTX 1080, GTX 1050 Ti 4G OC, RX580 + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 960, GTX 950
Marius
Posts: 34
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash and CMOS corruption after switching to 308

Post by Marius »

@gunnarre

I just ran a 17-hour stress test as you suggested, with Prime95 AVX on the CPU and FurMark on the GPU. The PC was drawing about 650W as measured by the AX1600i PSU. There were no problems, and the PC didn't crash/reboot as it does with the F@H client. So it seems really stable hardware wise. But thanks for yet another idea for testing.
JimF
Posts: 652
Joined: Thu Jan 21, 2010 2:03 pm

Re: Strange crash and CMOS corruption after switching to 308

Post by JimF »

Marius wrote:I ran memtest86 overnight to make sure the memory timing was correct, and found no problem. As soon as I configured the system and started Folding, the crash/reboot/corrupt cmos problem happened again. And the symptoms were the same; I had to clear the CMOS to be able to reboot.
It is the memory. Memtest may find a faulty module, but instability is another bag.

Do you have four modules? If you use the XMP profiles, reduce them to two modules.
If you need four, then try the motherboard defaults, but for Folding, you certainly will not need much.
Marius
Posts: 34
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash and CMOS corruption after switching to 308

Post by Marius »

@JimF

[UPDATE 11/23/21]: So the system fooled me into thinking it was memory timing. :lol: A few hours later, the system crashed again. With DDR4 2100 settings. Case NOT closed!

Yes, I have 4 sticks of DDR4 that were set for XMP 3200. After I set it back to 2100, it has been running for 42 hours without problems.
gunnarre
Posts: 567
Joined: Sun May 24, 2020 7:23 pm
Location: Norway

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by gunnarre »

Try running with just two sticks, then the other two sticks.
Image
Online: GTX 1660 Super, GTX 1080, GTX 1050 Ti 4G OC, RX580 + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 960, GTX 950
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by bruce »

Also try un-overclocking.
Marius
Posts: 34
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Marius »

gunnarre wrote:Try running with just two sticks, then the other two sticks.
Yes, testing that now. We'll see.
bruce wrote:Also try un-overclocking.
This system is not overclocked.

Thanks for all the ideas!
Marius
Posts: 34
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Marius »

Well, I split the original 4 sticks of 32GB dual rank DDR4 into 2 groups, and tested each separately at 2100 timings. No deal, both tests crashed and rebooted after a few hours. The CMOS was again corrupted, resulting in an unbootable system. CMOS clearing is necessary to get past that. I also tested a different set of 2 sticks of 8GB single rank DDR4 3600, but at 2100. Nope, that didn't work either. It rebooted in a mere 10 minutes, while I was looking at the screen monitoring temps, which were OK. So at this point I can say for sure it's not memory instability, or a bad stick of RAM, or any kind of hardware problem on my side. Again, the crash only occurs when using F@H. Since I have already replaced and tested __every__ __single__ system component, I'm running out of ideas. Anyway, happy Thanksgiving to you guys in the US.
toTOW
Site Moderator
Posts: 6296
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by toTOW »

If I were you, I'd get my motherboard replaced ... it's not a normal behaviour.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
Marius
Posts: 34
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Marius »

I already replaced the mobo! It's the second one that displays the same behavior, and is totally unrelated to the first, which was a Zenith Extreme for the AMD Threadripper from Asus. The current one is a Gigabyte Aorus Master x570s that came out in June, which I use with an AMD 5950x. So, different processor and mobo!! It looks like I have a transmissible Gremlin!
JimF
Posts: 652
Joined: Thu Jan 21, 2010 2:03 pm

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by JimF »

You haven't mentioned the disk drive, which I assume is an SSD.
I don't think that I have ever had one go bad on me, but it looks like you have eliminated all other possibilities.
(It isn't FAH).

Also, I always use a write cache to protect the SSD. On Ubuntu, I use the built-in Linux write cache.
https://lonesysadmin.net/2013/12/22/bet ... rty_ratio/

On Win10, I use PrimoCache.
https://www.romexsoftware.com/en-us/pri ... index.html

It is not necessary for FAH, which is very light on writes, but for some of the BOINC projects it is.
It looks like the Rosetta pythons are bad, though I don't know how bad yet.
I use at least 2 GB of memory for the cache, and at least 30 minutes of latency (write-delay), though less will help.

PS - You may have transmitted it to my Ryzen 3950X, which started seizing up a few days ago, but it has 128 GB of memory and I suspect one stick may be bad.
But try to deal with it at you end, please.
Marius
Posts: 34
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Marius »

@JimF

The drive is an Areca SSD RAID card, with battery back-up and a bank of super-capacitors to protect the write-cache. That card hasn't given any problems yet. To remove it and test with another drive will be a chore that I don't have time yet to do. Maybe on the holiday breaks.

But yes, I also have 128GB of RAM. I tested the memory configuration extensively, as I documented above. Even with other memory sticks. It's not that either. I'm not sure what it is at this point. The problem is really elusive and difficult to trace. Ugh! I hope you haven't caught the Gremlin! It's a really nasty one!

Other than the Raid card above, my system config is an AMD 5950X on the Gigabyte Aorus Master X570S mobo, 4x 128GB Corsair Vengeance LPX 3200, an EVGA 3080Ti with 12GB, a Corsair H170i Capellix CPU AIO cooler, and a Corsair AX1600i PS.

I really just over-dimensioned it for my needs, expecting to not have any problems with power delivery or CPU overheating. I used the Kryonaut thermal paste from Thermal Grizzly. The CPU stayed in the low 70's (Celsius) when running F@H, with 32 threads. That is well below the 95C max temp expected by AMD. I haven't overclocked it yet, and I don't expect to have to.

Well Happy Holidays everybody. Hopefully I will find the issue some time soon.
pcwolf
Posts: 36
Joined: Fri Apr 03, 2020 4:49 pm

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by pcwolf »

No solution to offer, just another data point for you:

I run Manjaro Linux on Gigabyte x570S Aorus Master with Ryzen 5950x, NVidia RTX 2070 and GTX 1650, 16c/32t all F@H.
About every 2 to 4 days the system will spontaneously reboot, with errors I will add here to the end of my post. CMOS (ver 53c) is unaffected by reboot.
I have de-rated the RTX 2070 power using the "Coolbits" setting to run at 160W limit.
I have also set the UEFI/BIOS to run the Ryzen on ECO mode.
Both power settings give me nearly the PPD but run significantly cooler - CPU @ 52c and GPU @ 81c

Errors:
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000001000108
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: TSC 0 ADDR 7fbdc4a4bae1 MISC d012000200000000 SYND 4d000000 IPID 500b000000000
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1639063104 SOCKET 0 APIC 0 microcode a201016
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 5: bea0000000000108
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: TSC 0 ADDR 8a3a8a MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Dec 09 15:18:27 Ryzen5950X kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1639063104 SOCKET 0 APIC 18 microcode a201016

-Phil
Yorktown, Virginia, USA
Image
aetch
Posts: 447
Joined: Thu Jun 25, 2020 3:04 pm
Location: Between chair and keyboard

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by aetch »

pcwolf wrote:About every 2 to 4 days the system will spontaneously reboot, with errors I will add here to the end of my post.
I really hate to say this but your system is unstable.
You really need to look at it and fix it before you do any more folding.
You should be able to run it continuously for weeks/months on end without issue.
Folding Rigs - None (25-Jun-2022)

ImageImage
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Neil-B »

@marius @pcwolf

crop from https://wiki.archlinux.org/title/Ryzen ...

With Ryzen 5, particularly the enthusiast models of 5950X and 5900X there seem to be some slight instability issues under Linux, related possibly to the 5.11+ kernel, as shown by this kernel bug. After investigating and reading reports on the Internet I discovered that out of the box, windows seems to run the CPUs at higher voltage and lower peak frequencies, compared to the stock linux kernel, which depending on your draw from the silicone lottery could cause a host of random application crashes or hardware errors that lead to reboots. You will recognise those by dmesg logs that look like:

Code: Select all

kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 22: Machine Check: 0 Bank 1: bc800800060c0859
lightbringer kernel: mce: [Hardware Error]: TSC 0 ADDR 7ea8f5b00 MISC d012000000000000 IPID 100b000000000 
lightbringer kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1636645367 SOCKET 0 APIC d microcode a201016
The CPU ID and the Processor number may vary. To solve this problem you need to supply higher voltage to your CPU so that it is stable when running at peak frequencies.

Might be worth trying slightly higher voltages and discount the above as the cause.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Post Reply