Page 5 of 6

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Mon Mar 28, 2022 1:55 am
by Marius
@FrankMB

I ran Linpack Extreme in Linux at max settings for about 74 hours. Not a hiccup. This machine is very stable. The only software that causes the silent reset issue is still F@H. Not sure what else I can try.

Thanks,
Marius.

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Tue Apr 12, 2022 9:56 am
by mitalapo
Reporting a similar issue: System reboot with a Machine Check Error (MCE) while running F@H for several hours (usually less than 24h running with 16 threads out of 32). No reboot with other stress tests (Prime95 and Linpack Xtreme running for days). Issue occurs on Linux and Windows alike. Tested with default overclock, underclock, memory at 3200 and 2400 MHz. All eventually fail with F@H while other stress tests are stable. There is no CMOS corruption (maybe due to the different mobo).

I suspect F@H is hitting a CPU issue. Maybe it is bad luck with my specimen. I wonder if other AMD Ryzen 9 5950X users did not experience such an issue.

A typical Linux message shown after reboot:

Code: Select all

mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1643461814 SOCKET 0 APIC 14 microcode a201016
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 5: bea0000000000108
mce: [Hardware Error]: TSC 0 ADDR 8a3a5a MISC d012000100000000 SYND 4d000000 IPID 500b000000 000
Which translates to:

Code: Select all

Machine check events logged
Hardware event. This is not a software error.
CPU 10 5 fixed-issue reoder  
MISC d012000100000000 ADDR 8a3a5a  
       bit55 = res23
       bit57 = processor context corrupt
       bit59 = misc error valid
       bit61 = error uncorrected
  memory/cache 00000108 MCGSTATUS 0
SYND 4d000000 IPID 500b000000000  
mcelog: Unknown CPU type vendor 2 family 25 model 1
Hardware event. This is not a software error.
CPU 0 0 data cache  
TIME 1643461814 Sat Jan 29 15:10:14 2022
STATUS 0 MCGSTATUS 0
error 'generic error mem transaction, generic transaction, level 0'
STATUS bea00000CPUID Vendor AMD Family 25 Model 1 Step 0
SOCKET 0 APIC 14 microcode a201016
A typical Windows error message:

Code: Select all

Log Name:      System
Source:        Microsoft-Windows-WHEA-Logger
Date:          3/21/2022 4:05:37 PM
Event ID:      18
Task Category: None
Level:         Error
Keywords:      
User:          LOCAL SERVICE
Computer:      DESKTOP-UFRTUPL
Description:
A fatal hardware error has occurred.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 30
Windows error event XML:

Code: Select all

<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-WHEA-Logger" Guid="{c26c4f3c-3f66-4e99-8f8a-39405cfed220}" />
    <EventID>18</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000000</Keywords>
    <TimeCreated SystemTime="2022-03-21T14:05:37.8429511Z" />
    <EventRecordID>2618</EventRecordID>
    <Correlation ActivityID="{b3bd9fc6-0489-4f3c-bee7-054ca83d87c7}" />
    <Execution ProcessID="2424" ThreadID="4600" />
    <Channel>System</Channel>
    <Computer>DESKTOP-UFRTUPL</Computer>
    <Security UserID="S-1-5-19" />
  </System>
  <EventData>
    <Data Name="ErrorSource">3</Data>
    <Data Name="ApicId">30</Data>
    <Data Name="MCABank">5</Data>
    <Data Name="MciStat">0xbea0000001000108</Data>
    <Data Name="MciAddr">0x7ff64e605d51</Data>
    <Data Name="MciMisc">0xd01a0ffe00000000</Data>
    <Data Name="ErrorType">9</Data>
    <Data Name="TransactionType">2</Data>
    <Data Name="Participation">256</Data>
    <Data Name="RequestType">0</Data>
    <Data Name="MemorIO">256</Data>
    <Data Name="MemHierarchyLvl">0</Data>
    <Data Name="Timeout">256</Data>
    <Data Name="OperationType">256</Data>
    <Data Name="Channel">256</Data>
    <Data Name="Length">936</Data>
    <Data Name="RawData">435045521002FFFFFFFF03000100000002000000A80300001A050E00150316140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB57131FE6FF5E89C91C54CBA8865ABE14913BBA31E15B52C3DD80102000000000000000000000000000000000000000000000058010000C00000000003000001000000ADCC7698B447DB4BB65E16F193C4F3DB0000000000000000000000000000000001000000000000000000000000000000000000000000000018020000800000000003000000000000B0A03EDC44A19747B95B53FA242B6E1D0000000000000000000000000000000001000000000000000000000000000000000000000000000098020000100100000003000000000000011D1E8AF94257459C33565E5CC3F7E8000000000000000000000000000000000100000000000000000000000000000000000000000000007F010000000000000002010000000000100FA2000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001E00000000000000000000000000000000000000000000000000000000000000000000000000000007000000000000001E00000000000000100FA2000008201E0B32F87EFFFB8B170000000000000000000000000000000000000000000000000000000000000000F50157A5EFE3DE43AC72249B573FAD2C03000000000000009F00020600000000515D604EF67F00000000000000000000000000000000000000000000000000000200000002000000AD5ACBB62C3DD8011E0000000000000000000000000000000000000005000000080100010000A0BE515D604EF67F000000000000FE0F1AD0000000001E00000000000000B00005000000004D00000000F9010000230000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003B00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000</Data>
  </EventData>
</Event>
My setup:
Mobo: GIGABYTE B550 AORUS ELITE AX (latest BIOS)
CPU: AMD Ryzen 9 5950X
Memory: Kingston HyperX Predator RGB 2x32GB DDR4 3200MHz CL16
Cooler: Arctic Liquid Freezer II 120
PSU: Corsair TX650M Modular 650W Gold Active PFC 12cm Fan

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Thu Apr 14, 2022 6:35 am
by Marius
Hi @mitalapo,

Thanks for the info. It seems there might be an issue when running F@H with 5950X's in Gigabyte MB's, you are the third person to report a similar problem in this post. I don't get any MCE's, the system just reboots silently, and only when running F@H.

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Sun Apr 17, 2022 8:43 am
by mitalapo
@Marius

Got this tip from the F@H discord channel:
FN4 on Discord wrote:In BIOS try setting power idle control to typical current idle. I have seen similar problems with Ryzen 3000 hardware
My system is at the repair shop for a possible RMA, so I can't test it myself right now.

Another comment:
FN4 on Discord wrote:F@H puts load on AVX2, which some other "stress tests" don't load
I noticed that Prime95 can be tuned to use AVX2 (the default would be AVX512 on the Ryzen 9), so this may worth a try too.

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Thu Apr 21, 2022 11:10 pm
by Marius
@mitalapo

Thanks for the info, will try those settings.

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Wed Apr 27, 2022 6:27 pm
by mitalapo
I am sorry to report that setting "power idle control" to "typical current idle" in the BIOS did not prevent the crash :(
(though it took 4 days to crash compared to 1-2 days without this setting)

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Sat May 07, 2022 5:20 am
by Marius
@mitalapo
Sorry to hear that. I also tried running the Linux Prime95 AVX stress tests, but that did not generate any problems, even after several hours. The mystery continues.

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Thu May 12, 2022 9:14 am
by mitalapo
@Marius
I am running the Prime95 stress test with AVX512 disabled (assuming this implies AVX2 is used more) for the last two weeks, without any issue. F@H remains the only way I know to crash the CPU.

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Tue May 17, 2022 4:32 am
by Marius
@mitalapo

Yes, that has been my experience as well. I hope we can find what the problem is. Thanks for sharing your data.

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Mon Jun 06, 2022 2:12 pm
by ElephantFly
I have the same issues too. I have 2 desktop with 5950x CPU. One is MSI B550 motherboard, another is ASUS C8DH (Thor 1200W PSU and Acer Predator 3600c14, Colorful 3090 volcan OC, all parts are top quality things). I use many test tools on both computers and I use both computers for other science programs that run 24*7, they only get crashed when running FAH on CPU. In the C8DH Windows I found WHEA18 when the computer abnormally reboot. The WHEA18 occurs on random cores. I tried RMA new CPU, change new sets of RAMs, disable PBO, disable Turboboost, adjust voltage calibration, adding voltage, updating BIOS, change OS, reinstall OS, none of them help. I guess there are something about switching work status In the FAH code that are incompatable with using the whole 5950x CPU, this problem happens about 1 time per day(running FAH in full 24 hours).

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Mon Jun 06, 2022 6:28 pm
by Marius
Thanks for sharing your data, @ElephantFly. It appears FAH code really has problems with the 5950x. I hope developers are paying attention.

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Mon Jun 06, 2022 9:43 pm
by muziqaz
Marius wrote: Mon Jun 06, 2022 6:28 pm Thanks for sharing your data, @ElephantFly. It appears FAH code really has problems with the 5950x. I hope developers are paying attention.
FAH is not specifically causing this issue. FAH is using AVX256 instructions, which are taxing CPU like only few other power viruses do. The only thing which taxes CPU more is AVX512. The only solution for 5950x issue fix is RMA with AMD, they send you new one. test that if that has same issue, RMA again.
I am now on my 3rd 5950x, and finally this one seems to be stable under FAH.
5900x has similar issues, though far less widespread
5800x3d seems to be stable so far

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Mon Jun 06, 2022 10:18 pm
by Marius
@muziqaz
Actually, this problem had started some time ago on my previous system, based on the AMD 1950x, on an Asus board. It showed the same symptoms, and that's why I upgraded my system. No other software causes this problem, so I doubt this is hardware. I ran AVX benchmarks on Linux and had no problems either.

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Mon Jun 06, 2022 10:47 pm
by muziqaz
No other software is hitting cache like F@H. Benchmarks are to benchmark very specific things, not to simulate complicated proteins. Linpaks, prime95s are ancient pieces of code with patches to include AVX2. Is there much of cache communication in Prime95 as it is in F@H? Doubt it.
I believe I've seen someone crashing their 5000 series with Blender, since that can be set up to render for months on the CPU. Also, we, FAH people are only select few in the sea of other 5000 series users who are having same exact WHEA error issues while doing other stuff, or doing nothing. Majority of WHEA Logger victims never folded in their life. If it was software thing, 5800x3d would be crashing as well, also my 3rd 5950x would be also crashing. But they are not.

Re: Strange crash/reboot and CMOS corruption only with F@H

Posted: Tue Jun 07, 2022 1:19 am
by Marius
@muziqaz
Thanks for sharing your data. I'm going to wait until Zen4 comes out and upgrade my system then. If it is a hardware issue, hopefully it will be gone in the new arch. Meanwhile, I'm contributing to other Boinc projects like Rosetta Stone and GPU Grid, without problems. F@H will have to wait until then.