Strange crash/reboot and CMOS corruption only with F@H

A forum for discussing FAH-related hardware choices and info on actual products (not speculation).

Moderator: Site Moderators

Forum rules
Please read the forum rules before posting.
Marius
Posts: 30
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Marius »

@FrankMB

I ran Linpack Extreme in Linux at max settings for about 74 hours. Not a hiccup. This machine is very stable. The only software that causes the silent reset issue is still F@H. Not sure what else I can try.

Thanks,
Marius.
mitalapo
Posts: 4
Joined: Tue Apr 12, 2022 8:33 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by mitalapo »

Reporting a similar issue: System reboot with a Machine Check Error (MCE) while running F@H for several hours (usually less than 24h running with 16 threads out of 32). No reboot with other stress tests (Prime95 and Linpack Xtreme running for days). Issue occurs on Linux and Windows alike. Tested with default overclock, underclock, memory at 3200 and 2400 MHz. All eventually fail with F@H while other stress tests are stable. There is no CMOS corruption (maybe due to the different mobo).

I suspect F@H is hitting a CPU issue. Maybe it is bad luck with my specimen. I wonder if other AMD Ryzen 9 5950X users did not experience such an issue.

A typical Linux message shown after reboot:

Code: Select all

mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1643461814 SOCKET 0 APIC 14 microcode a201016
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 5: bea0000000000108
mce: [Hardware Error]: TSC 0 ADDR 8a3a5a MISC d012000100000000 SYND 4d000000 IPID 500b000000 000
Which translates to:

Code: Select all

Machine check events logged
Hardware event. This is not a software error.
CPU 10 5 fixed-issue reoder  
MISC d012000100000000 ADDR 8a3a5a  
       bit55 = res23
       bit57 = processor context corrupt
       bit59 = misc error valid
       bit61 = error uncorrected
  memory/cache 00000108 MCGSTATUS 0
SYND 4d000000 IPID 500b000000000  
mcelog: Unknown CPU type vendor 2 family 25 model 1
Hardware event. This is not a software error.
CPU 0 0 data cache  
TIME 1643461814 Sat Jan 29 15:10:14 2022
STATUS 0 MCGSTATUS 0
error 'generic error mem transaction, generic transaction, level 0'
STATUS bea00000CPUID Vendor AMD Family 25 Model 1 Step 0
SOCKET 0 APIC 14 microcode a201016
A typical Windows error message:

Code: Select all

Log Name:      System
Source:        Microsoft-Windows-WHEA-Logger
Date:          3/21/2022 4:05:37 PM
Event ID:      18
Task Category: None
Level:         Error
Keywords:      
User:          LOCAL SERVICE
Computer:      DESKTOP-UFRTUPL
Description:
A fatal hardware error has occurred.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 30
Windows error event XML:

Code: Select all

<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-WHEA-Logger" Guid="{c26c4f3c-3f66-4e99-8f8a-39405cfed220}" />
    <EventID>18</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000000</Keywords>
    <TimeCreated SystemTime="2022-03-21T14:05:37.8429511Z" />
    <EventRecordID>2618</EventRecordID>
    <Correlation ActivityID="{b3bd9fc6-0489-4f3c-bee7-054ca83d87c7}" />
    <Execution ProcessID="2424" ThreadID="4600" />
    <Channel>System</Channel>
    <Computer>DESKTOP-UFRTUPL</Computer>
    <Security UserID="S-1-5-19" />
  </System>
  <EventData>
    <Data Name="ErrorSource">3</Data>
    <Data Name="ApicId">30</Data>
    <Data Name="MCABank">5</Data>
    <Data Name="MciStat">0xbea0000001000108</Data>
    <Data Name="MciAddr">0x7ff64e605d51</Data>
    <Data Name="MciMisc">0xd01a0ffe00000000</Data>
    <Data Name="ErrorType">9</Data>
    <Data Name="TransactionType">2</Data>
    <Data Name="Participation">256</Data>
    <Data Name="RequestType">0</Data>
    <Data Name="MemorIO">256</Data>
    <Data Name="MemHierarchyLvl">0</Data>
    <Data Name="Timeout">256</Data>
    <Data Name="OperationType">256</Data>
    <Data Name="Channel">256</Data>
    <Data Name="Length">936</Data>
    <Data Name="RawData">435045521002FFFFFFFF03000100000002000000A80300001A050E00150316140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB57131FE6FF5E89C91C54CBA8865ABE14913BBA31E15B52C3DD80102000000000000000000000000000000000000000000000058010000C00000000003000001000000ADCC7698B447DB4BB65E16F193C4F3DB0000000000000000000000000000000001000000000000000000000000000000000000000000000018020000800000000003000000000000B0A03EDC44A19747B95B53FA242B6E1D0000000000000000000000000000000001000000000000000000000000000000000000000000000098020000100100000003000000000000011D1E8AF94257459C33565E5CC3F7E8000000000000000000000000000000000100000000000000000000000000000000000000000000007F010000000000000002010000000000100FA2000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001E00000000000000000000000000000000000000000000000000000000000000000000000000000007000000000000001E00000000000000100FA2000008201E0B32F87EFFFB8B170000000000000000000000000000000000000000000000000000000000000000F50157A5EFE3DE43AC72249B573FAD2C03000000000000009F00020600000000515D604EF67F00000000000000000000000000000000000000000000000000000200000002000000AD5ACBB62C3DD8011E0000000000000000000000000000000000000005000000080100010000A0BE515D604EF67F000000000000FE0F1AD0000000001E00000000000000B00005000000004D00000000F9010000230000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003B00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000</Data>
  </EventData>
</Event>
My setup:
Mobo: GIGABYTE B550 AORUS ELITE AX (latest BIOS)
CPU: AMD Ryzen 9 5950X
Memory: Kingston HyperX Predator RGB 2x32GB DDR4 3200MHz CL16
Cooler: Arctic Liquid Freezer II 120
PSU: Corsair TX650M Modular 650W Gold Active PFC 12cm Fan
Marius
Posts: 30
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Marius »

Hi @mitalapo,

Thanks for the info. It seems there might be an issue when running F@H with 5950X's in Gigabyte MB's, you are the third person to report a similar problem in this post. I don't get any MCE's, the system just reboots silently, and only when running F@H.
mitalapo
Posts: 4
Joined: Tue Apr 12, 2022 8:33 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by mitalapo »

@Marius

Got this tip from the F@H discord channel:
FN4 on Discord wrote:In BIOS try setting power idle control to typical current idle. I have seen similar problems with Ryzen 3000 hardware
My system is at the repair shop for a possible RMA, so I can't test it myself right now.

Another comment:
FN4 on Discord wrote:F@H puts load on AVX2, which some other "stress tests" don't load
I noticed that Prime95 can be tuned to use AVX2 (the default would be AVX512 on the Ryzen 9), so this may worth a try too.
Marius
Posts: 30
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Marius »

@mitalapo

Thanks for the info, will try those settings.
mitalapo
Posts: 4
Joined: Tue Apr 12, 2022 8:33 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by mitalapo »

I am sorry to report that setting "power idle control" to "typical current idle" in the BIOS did not prevent the crash :(
(though it took 4 days to crash compared to 1-2 days without this setting)
Marius
Posts: 30
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Marius »

@mitalapo
Sorry to hear that. I also tried running the Linux Prime95 AVX stress tests, but that did not generate any problems, even after several hours. The mystery continues.
mitalapo
Posts: 4
Joined: Tue Apr 12, 2022 8:33 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by mitalapo »

@Marius
I am running the Prime95 stress test with AVX512 disabled (assuming this implies AVX2 is used more) for the last two weeks, without any issue. F@H remains the only way I know to crash the CPU.
Marius
Posts: 30
Joined: Thu Nov 04, 2021 3:08 am

Re: Strange crash/reboot and CMOS corruption only with F@H

Post by Marius »

@mitalapo

Yes, that has been my experience as well. I hope we can find what the problem is. Thanks for sharing your data.
Post Reply