16600 consistently crashing on AMD Radeon VII

Moderators: Site Moderators, FAHC Science Team

Re: 16600 consistently crashing on AMD Radeon VII

Postby UofM.MartinK » Mon Aug 17, 2020 2:38 am

OK, now I have the data of the last 85 WUs processed by my RX580.
TL;DR: there is no "two states" of the card. It's all intrinsic to the WUs as other posters already mentioned, and driver & clock speeds have no significant effect.

My original suspicion - card has a "working" and a "non-working" state - was a typical case of "data artifact" caused by too small N numbers, and some clusters, with p16600 at the "fringes" of a string of working p13421 made it appear like some randomly successful p16600 might have had the same GPU "fingerprint" (Power/Temp/Vdd/Clocks) condition as the p13421, but when actually matching times precisely, this turned out to be false.

Still, the clusters are very unlikely to have happened randomly, but the round-robin behavior of assignments (and re-assignments for faulty units) can be blamed for that.

Here the basic breakdown:

p16600: 38 WUs, 4 completed, the other 34 failed. All 38 had NaN exceptions (statistically distributed over processing time), median time to fail: 3489 seconds. The 4 completed - just by probability ("luck") - after 4-5 hours without hitting the project-internal retry limit, but again, also had at least twice NaN exceptions and resumed from a checkpoint.

p13421: 37 WUs, 7 completed, the other 30 of them discarded after 9-17 seconds, due to '0x22:ERROR:NaNs detected in forces. 0 0'.

p13423: 9 WUs, 1 completed. The other 8 of them discarded as above, after 9-17 seconds, due to '0x22:ERROR:NaNs detected in forces. 0 0'.

p16920: 1 WU, successfully completed.

Summary: In the last 48 hours alone, my RX580 (and most similar AMD cards, I assume) spent ~14 hours for useful computations, or ~18 hours if the one "completed" p16600 in this time window is actually useful (which I doubt, it might actually weaken the project results). That's 60-70% wasted time and energy. And this is going on since at least August 3rd.

Update: I now run a script which tracks the log in real-time, and if a "0x22:An exception occurred at step XXX: Particle coordinate is nan" is found, it dumps that WU. (Since FAHClient --dump doesn't seem to be able to communicate to a running client, the script is instead: pausing the slot, deleting the corresponding work folder, and then un-pausing the slot)
UofM.MartinK
 
Posts: 55
Joined: Tue Apr 07, 2020 9:53 pm

Re: 16600 consistently crashing on AMD Radeon VII

Postby bruce » Mon Aug 17, 2020 6:43 am

I'm surprised that the --dump <n> doesn't work but I don't use it. You have to know which WU needs to be dumped (if you run more than one slot). Please explain how you have tested it and what happens along with the client version number.
bruce
 
Posts: 20009
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: 16600 consistently crashing on AMD Radeon VII

Postby UofM.MartinK » Mon Aug 17, 2020 7:14 am

I couldn't figure out how to give a running client the --dump command, with 7.6.13 under linux. It doesn't behave like --send-pause and --send-unpause etc, and preceding it with --send-command didn't help either. Using "help" while directly connected to the client via "nc localhost 36330" didn't reveal any dump command either.

It only seems to work if the client is stopped, then FAHClient --dump <WU> is run exclusively, and then the client re-started.

This disturbs the other slots, and re-starts the log file etc, so I chose the other solution for now.

I would prefer a variant which properly communicates back to the server that the WU was dumped, though.

A related feature request: https://github.com/FoldingAtHome/fah-issues/issues/1547

Although, reading the "sibling" bug report: https://github.com/FoldingAtHome/fah-issues/issues/1549

It seems that whichever way I currently dump, it's a problem - either it wrongly counts as a "faulty" WU (which it would become an hour later anyway, so it would still the better way if it was available as command to a running client), or the WU has to wait for it's timeout to be reassigned. A classic loose-loose situation :)
UofM.MartinK
 
Posts: 55
Joined: Tue Apr 07, 2020 9:53 pm

Re: 16600 consistently crashing on AMD Radeon VII

Postby UofM.MartinK » Wed Aug 19, 2020 12:18 am

I just noticed that the latest WU, project:16600 run:0 clone:1430 gen:235, allows more than 3 restarts, it did 8 so far and is about to complete the WU on my RX580!

I gather that this means a WU finished this way is still useful, after all?

I that can be confirmed, I will not dump 16600 anymore :)
UofM.MartinK
 
Posts: 55
Joined: Tue Apr 07, 2020 9:53 pm

Re: 16600 consistently crashing on AMD Radeon VII

Postby bruce » Wed Aug 19, 2020 4:47 am

If the WU is crashing because of an overclocked GPU, there's nothing FAH can do about it except to ask folks not to overclock. If it's crashing because of defective hardware, we can admonish you to RMA the hardware. If it's crashing because of a defect in the driver, we can ask you to convince the manufacturer to build good drivers.

If there's a defect in the FAHCore or in the construction of the WU, there's an ongoing project to collect the error reports returned from errors like yours and fix the associated problem(s). FAH does pay attention to those error reports and through them, science can do a better job in the future although I can't promise the fix will be rolled out soon enough to satisfy the folks making the reports.
bruce
 
Posts: 20009
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: 16600 consistently crashing on AMD Radeon VII

Postby UofM.MartinK » Wed Aug 19, 2020 5:55 am

Bruce, I appreciate your very valuable contributions, but in this case, it's clear that many AMD models trash 16600 WUs - nothing to do with of overclocking or individual hardware issues, and not even the driver (version) might be at fault - it happens across all drivers and operating systems.

It's actually confirmed that there is something like an "incompatibility" of this project with many AMD models. According to SlinkyDolphinClock on discord the day before yesterday, it might have slipped Beta & Advanced testing because it was tested on an old FAH core - that's at least one hypothesis very actively discussed on slack with the lead developers, and some patch is in the works.

Back to business: My previous post was stating the observation that I now encountered at least one p16600 WU which internally has a significantly higher "Max number of attempts to resume from last checkpoint reached." (usually 3, but project:16600 run:0 clone:1430 gen:235 did resume 8 times and finally was completed). Other p16600 WUs after had the old internal restart limit of 3 again and thus were sent back "faulty" because they only made it to 15%, 32% or 42% before the "resumes" were used up.

bruce wrote:If there's a defect in the FAHCore or in the construction of the WU, there's an ongoing project to collect the error reports returned from errors like yours and fix the associated problem(s). FAH does pay attention to those error reports and through them, science can do a better job in the future although I can't promise the fix will be rolled out soon enough to satisfy the folks making the reports.


Well, seems to apply in this case. But this is going on since August 3rd, so yes, not very satisfying - and "soon enough" is in the eye of the beholder :)

Now all I want to know is whether that was a deliberate change to let these "problematic" AMD GPU models complete p16600 WUs (perhaps because they serve some sort of purpose after all?) or if this was just a fluke and there is no value in processing them with an AMD card.
UofM.MartinK
 
Posts: 55
Joined: Tue Apr 07, 2020 9:53 pm

Re: 16600 consistently crashing on AMD Radeon VII

Postby Nuitari » Wed Aug 19, 2020 6:18 am

UofM.MartinK wrote:
Code: Select all
grep -h logs/* log.txt -e '^\*' -e 'project:16600' -e 'project:13421'



I did the grep. The forum has a limit of 60000 characters, so I put it in this gist
https://gist.github.com/Nuitari/1306a2a ... 6ecaad7f6e

Rig 1 (5x rx570, 1x rx560, 1x carrizo APU)
Rig 2 (1x RX560 "OC version", 3x RX570)
Rig 3, 1x NVIDIA 660
Image
Nuitari
 
Posts: 79
Joined: Sun Jun 09, 2019 5:03 am

Re: 16600 consistently crashing on AMD Radeon VII

Postby n_w95482 » Wed Aug 19, 2020 7:06 am

Another one here having issues with 16600 on an AMD GPU. I'm running a Sapphire Pulse RX 580 8 GB in my home theater PC, underclocked to RX 480 levels (-7%). Here's a tally of the WUs it's worked on in the last two weeks:

13421: 72 finished, 1 failed
13423: 13 finished, 0 failed
16600: 10 finished, 66 failed
16920: 1 finished, 0 failed

Here's the log of one that failed this afternoon:
Code: Select all
20:12:44:WU01:FS01:Starting
20:12:44:WU01:FS01:Running FahCore: \"C:\\Program Files (x86)\\FAHClient/FAHCoreWrapper.exe\" C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 705 -lifeline 9488 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
20:12:44:WU01:FS01:Started FahCore on PID 11412
20:12:44:WU01:FS01:Core PID:10468
20:12:44:WU01:FS01:FahCore 0x22 started
20:12:45:WU01:FS01:0x22:*********************** Log Started 2020-08-18T20:12:44Z ***********************
20:12:45:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
20:12:45:WU01:FS01:0x22:       Core: Core22
20:12:45:WU01:FS01:0x22:       Type: 0x22
20:12:45:WU01:FS01:0x22:    Version: 0.0.11
20:12:45:WU01:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
20:12:45:WU01:FS01:0x22:  Copyright: 2020 foldingathome.org
20:12:45:WU01:FS01:0x22:   Homepage: https://foldingathome.org/
20:12:45:WU01:FS01:0x22:       Date: Jun 26 2020
20:12:45:WU01:FS01:0x22:       Time: 19:49:16
20:12:45:WU01:FS01:0x22:   Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
20:12:45:WU01:FS01:0x22:     Branch: core22-0.0.11
20:12:45:WU01:FS01:0x22:   Compiler: Visual C++ 2015
20:12:45:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
20:12:45:WU01:FS01:0x22:   Platform: win32 10
20:12:45:WU01:FS01:0x22:       Bits: 64
20:12:45:WU01:FS01:0x22:       Mode: Release
20:12:45:WU01:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
20:12:45:WU01:FS01:0x22:             <peastman@stanford.edu>
20:12:45:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 705 -lifeline 11412 -checkpoint 15
20:12:45:WU01:FS01:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
20:12:45:WU01:FS01:0x22:************************************ libFAH ************************************
20:12:45:WU01:FS01:0x22:       Date: Jun 26 2020
20:12:45:WU01:FS01:0x22:       Time: 19:47:12
20:12:45:WU01:FS01:0x22:   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
20:12:45:WU01:FS01:0x22:     Branch: HEAD
20:12:45:WU01:FS01:0x22:   Compiler: Visual C++ 2015
20:12:45:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
20:12:45:WU01:FS01:0x22:   Platform: win32 10
20:12:45:WU01:FS01:0x22:       Bits: 64
20:12:45:WU01:FS01:0x22:       Mode: Release
20:12:45:WU01:FS01:0x22:************************************ CBang *************************************
20:12:45:WU01:FS01:0x22:       Date: Jun 26 2020
20:12:45:WU01:FS01:0x22:       Time: 19:46:11
20:12:45:WU01:FS01:0x22:   Revision: f8529962055b0e7bde23e429f5072ff758089dee
20:12:45:WU01:FS01:0x22:     Branch: master
20:12:45:WU01:FS01:0x22:   Compiler: Visual C++ 2015
20:12:45:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
20:12:45:WU01:FS01:0x22:   Platform: win32 10
20:12:45:WU01:FS01:0x22:       Bits: 64
20:12:45:WU01:FS01:0x22:       Mode: Release
20:12:45:WU01:FS01:0x22:************************************ System ************************************
20:12:45:WU01:FS01:0x22:        CPU: AMD Ryzen 5 3600 6-Core Processor
20:12:45:WU01:FS01:0x22:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
20:12:45:WU01:FS01:0x22:       CPUs: 12
20:12:45:WU01:FS01:0x22:     Memory: 15.95GiB
20:12:45:WU01:FS01:0x22:Free Memory: 13.43GiB
20:12:45:WU01:FS01:0x22:    Threads: WINDOWS_THREADS
20:12:45:WU01:FS01:0x22: OS Version: 6.2
20:12:45:WU01:FS01:0x22:Has Battery: false
20:12:45:WU01:FS01:0x22: On Battery: false
20:12:45:WU01:FS01:0x22: UTC Offset: -7
20:12:45:WU01:FS01:0x22:        PID: 10468
20:12:45:WU01:FS01:0x22:        CWD: C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\work
20:12:45:WU01:FS01:0x22:********************************************************************************
20:12:45:WU01:FS01:0x22:Project: 16600 (Run 0, Clone 933, Gen 384)
20:12:45:WU01:FS01:0x22:Unit: 0x000001b08f59f36f5ec36911c061f769
20:12:45:WU01:FS01:0x22:Reading tar file core.xml
20:12:45:WU01:FS01:0x22:Reading tar file integrator.xml
20:12:45:WU01:FS01:0x22:Reading tar file state.xml
20:12:46:WU01:FS01:0x22:Reading tar file system.xml
20:12:47:WU01:FS01:0x22:Digital signatures verified
20:12:47:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
20:12:47:WU01:FS01:0x22:Version 0.0.11
20:12:47:WU01:FS01:0x22:  Checkpoint write interval: 25000 steps (5%) [20 total]
20:12:47:WU01:FS01:0x22:  JSON viewer frame write interval: 5000 steps (1%) [100 total]
20:12:47:WU01:FS01:0x22:  XTC frame write interval: 20000 steps (4%) [25 total]
20:12:47:WU01:FS01:0x22:  Global context and integrator variables write interval: disabled
20:13:05:WU01:FS01:0x22:Completed 0 out of 500000 steps (0%)
20:14:33:WU01:FS01:0x22:Completed 5000 out of 500000 steps (1%)
20:15:59:WU01:FS01:0x22:Completed 10000 out of 500000 steps (2%)
20:17:25:WU01:FS01:0x22:Completed 15000 out of 500000 steps (3%)
20:18:51:WU01:FS01:0x22:Completed 20000 out of 500000 steps (4%)
20:20:17:WU01:FS01:0x22:Completed 25000 out of 500000 steps (5%)
20:21:45:WU01:FS01:0x22:Completed 30000 out of 500000 steps (6%)
20:23:12:WU01:FS01:0x22:Completed 35000 out of 500000 steps (7%)
20:24:38:WU01:FS01:0x22:Completed 40000 out of 500000 steps (8%)
20:26:05:WU01:FS01:0x22:Completed 45000 out of 500000 steps (9%)
20:27:31:WU01:FS01:0x22:Completed 50000 out of 500000 steps (10%)
20:28:58:WU01:FS01:0x22:Completed 55000 out of 500000 steps (11%)
20:30:25:WU01:FS01:0x22:Completed 60000 out of 500000 steps (12%)
20:31:51:WU01:FS01:0x22:Completed 65000 out of 500000 steps (13%)
20:33:18:WU01:FS01:0x22:Completed 70000 out of 500000 steps (14%)
20:34:44:WU01:FS01:0x22:Completed 75000 out of 500000 steps (15%)
20:36:12:WU01:FS01:0x22:Completed 80000 out of 500000 steps (16%)
20:37:38:WU01:FS01:0x22:Completed 85000 out of 500000 steps (17%)
20:39:04:WU01:FS01:0x22:Completed 90000 out of 500000 steps (18%)
20:40:31:WU01:FS01:0x22:Completed 95000 out of 500000 steps (19%)
20:41:57:WU01:FS01:0x22:Completed 100000 out of 500000 steps (20%)
20:43:25:WU01:FS01:0x22:Completed 105000 out of 500000 steps (21%)
20:44:52:WU01:FS01:0x22:Completed 110000 out of 500000 steps (22%)
20:46:18:WU01:FS01:0x22:Completed 115000 out of 500000 steps (23%)
20:47:44:WU01:FS01:0x22:Completed 120000 out of 500000 steps (24%)
20:49:11:WU01:FS01:0x22:Completed 125000 out of 500000 steps (25%)
20:50:39:WU01:FS01:0x22:Completed 130000 out of 500000 steps (26%)
20:52:05:WU01:FS01:0x22:Completed 135000 out of 500000 steps (27%)
20:53:56:WU01:FS01:0x22:Completed 140000 out of 500000 steps (28%)
20:54:07:WU01:FS01:0x22:An exception occurred at step 140057: Particle coordinate is nan
20:54:07:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
20:54:07:WU01:FS01:0x22:Folding@home Core Shutdown: CORE_RESTART
20:54:07:WARNING:WU01:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
20:54:08:WU01:FS01:Starting
20:54:08:WU01:FS01:Running FahCore: \"C:\\Program Files (x86)\\FAHClient/FAHCoreWrapper.exe\" C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 705 -lifeline 9488 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
20:54:08:WU01:FS01:Started FahCore on PID 3516
20:54:08:WU01:FS01:Core PID:8688
20:54:08:WU01:FS01:FahCore 0x22 started
20:54:08:WU01:FS01:0x22:*********************** Log Started 2020-08-18T20:54:08Z ***********************
20:54:08:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
20:54:08:WU01:FS01:0x22:       Core: Core22
20:54:08:WU01:FS01:0x22:       Type: 0x22
20:54:08:WU01:FS01:0x22:    Version: 0.0.11
20:54:08:WU01:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
20:54:08:WU01:FS01:0x22:  Copyright: 2020 foldingathome.org
20:54:08:WU01:FS01:0x22:   Homepage: https://foldingathome.org/
20:54:08:WU01:FS01:0x22:       Date: Jun 26 2020
20:54:08:WU01:FS01:0x22:       Time: 19:49:16
20:54:08:WU01:FS01:0x22:   Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
20:54:08:WU01:FS01:0x22:     Branch: core22-0.0.11
20:54:08:WU01:FS01:0x22:   Compiler: Visual C++ 2015
20:54:08:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
20:54:08:WU01:FS01:0x22:   Platform: win32 10
20:54:08:WU01:FS01:0x22:       Bits: 64
20:54:08:WU01:FS01:0x22:       Mode: Release
20:54:08:WU01:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
20:54:08:WU01:FS01:0x22:             <peastman@stanford.edu>
20:54:08:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 705 -lifeline 3516 -checkpoint 15
20:54:08:WU01:FS01:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
20:54:08:WU01:FS01:0x22:************************************ libFAH ************************************
20:54:08:WU01:FS01:0x22:       Date: Jun 26 2020
20:54:08:WU01:FS01:0x22:       Time: 19:47:12
20:54:08:WU01:FS01:0x22:   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
20:54:08:WU01:FS01:0x22:     Branch: HEAD
20:54:08:WU01:FS01:0x22:   Compiler: Visual C++ 2015
20:54:08:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
20:54:08:WU01:FS01:0x22:   Platform: win32 10
20:54:08:WU01:FS01:0x22:       Bits: 64
20:54:08:WU01:FS01:0x22:       Mode: Release
20:54:08:WU01:FS01:0x22:************************************ CBang *************************************
20:54:08:WU01:FS01:0x22:       Date: Jun 26 2020
20:54:08:WU01:FS01:0x22:       Time: 19:46:11
20:54:08:WU01:FS01:0x22:   Revision: f8529962055b0e7bde23e429f5072ff758089dee
20:54:08:WU01:FS01:0x22:     Branch: master
20:54:08:WU01:FS01:0x22:   Compiler: Visual C++ 2015
20:54:08:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
20:54:08:WU01:FS01:0x22:   Platform: win32 10
20:54:08:WU01:FS01:0x22:       Bits: 64
20:54:08:WU01:FS01:0x22:       Mode: Release
20:54:08:WU01:FS01:0x22:************************************ System ************************************
20:54:08:WU01:FS01:0x22:        CPU: AMD Ryzen 5 3600 6-Core Processor
20:54:08:WU01:FS01:0x22:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
20:54:08:WU01:FS01:0x22:       CPUs: 12
20:54:08:WU01:FS01:0x22:     Memory: 15.95GiB
20:54:08:WU01:FS01:0x22:Free Memory: 13.43GiB
20:54:08:WU01:FS01:0x22:    Threads: WINDOWS_THREADS
20:54:08:WU01:FS01:0x22: OS Version: 6.2
20:54:08:WU01:FS01:0x22:Has Battery: false
20:54:08:WU01:FS01:0x22: On Battery: false
20:54:08:WU01:FS01:0x22: UTC Offset: -7
20:54:08:WU01:FS01:0x22:        PID: 8688
20:54:08:WU01:FS01:0x22:        CWD: C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\work
20:54:08:WU01:FS01:0x22:********************************************************************************
20:54:08:WU01:FS01:0x22:Project: 16600 (Run 0, Clone 933, Gen 384)
20:54:08:WU01:FS01:0x22:Unit: 0x000001b08f59f36f5ec36911c061f769
20:54:08:WU01:FS01:0x22:Digital signatures verified
20:54:08:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
20:54:08:WU01:FS01:0x22:Version 0.0.11
20:54:08:WU01:FS01:0x22:  Checkpoint write interval: 25000 steps (5%) [20 total]
20:54:08:WU01:FS01:0x22:  JSON viewer frame write interval: 5000 steps (1%) [100 total]
20:54:08:WU01:FS01:0x22:  XTC frame write interval: 20000 steps (4%) [25 total]
20:54:08:WU01:FS01:0x22:  Global context and integrator variables write interval: disabled
20:54:27:WU01:FS01:0x22:Completed 125000 out of 500000 steps (25%)
20:55:53:WU01:FS01:0x22:Completed 130000 out of 500000 steps (26%)
20:57:20:WU01:FS01:0x22:Completed 135000 out of 500000 steps (27%)
20:58:47:WU01:FS01:0x22:Completed 140000 out of 500000 steps (28%)
21:00:13:WU01:FS01:0x22:Completed 145000 out of 500000 steps (29%)
21:01:39:WU01:FS01:0x22:Completed 150000 out of 500000 steps (30%)
21:03:07:WU01:FS01:0x22:Completed 155000 out of 500000 steps (31%)
21:04:06:WU01:FS01:0x22:An exception occurred at step 157627: Particle coordinate is nan
21:04:06:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
21:04:06:WU01:FS01:0x22:Folding@home Core Shutdown: CORE_RESTART
21:04:07:WARNING:WU01:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
21:04:07:WU01:FS01:Starting
21:04:07:WU01:FS01:Running FahCore: \"C:\\Program Files (x86)\\FAHClient/FAHCoreWrapper.exe\" C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 705 -lifeline 9488 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
21:04:07:WU01:FS01:Started FahCore on PID 9160
21:04:07:WU01:FS01:Core PID:5532
21:04:07:WU01:FS01:FahCore 0x22 started
21:04:08:WU01:FS01:0x22:*********************** Log Started 2020-08-18T21:04:07Z ***********************
21:04:08:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
21:04:08:WU01:FS01:0x22:       Core: Core22
21:04:08:WU01:FS01:0x22:       Type: 0x22
21:04:08:WU01:FS01:0x22:    Version: 0.0.11
21:04:08:WU01:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
21:04:08:WU01:FS01:0x22:  Copyright: 2020 foldingathome.org
21:04:08:WU01:FS01:0x22:   Homepage: https://foldingathome.org/
21:04:08:WU01:FS01:0x22:       Date: Jun 26 2020
21:04:08:WU01:FS01:0x22:       Time: 19:49:16
21:04:08:WU01:FS01:0x22:   Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
21:04:08:WU01:FS01:0x22:     Branch: core22-0.0.11
21:04:08:WU01:FS01:0x22:   Compiler: Visual C++ 2015
21:04:08:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
21:04:08:WU01:FS01:0x22:   Platform: win32 10
21:04:08:WU01:FS01:0x22:       Bits: 64
21:04:08:WU01:FS01:0x22:       Mode: Release
21:04:08:WU01:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
21:04:08:WU01:FS01:0x22:             <peastman@stanford.edu>
21:04:08:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 705 -lifeline 9160 -checkpoint 15
21:04:08:WU01:FS01:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
21:04:08:WU01:FS01:0x22:************************************ libFAH ************************************
21:04:08:WU01:FS01:0x22:       Date: Jun 26 2020
21:04:08:WU01:FS01:0x22:       Time: 19:47:12
21:04:08:WU01:FS01:0x22:   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
21:04:08:WU01:FS01:0x22:     Branch: HEAD
21:04:08:WU01:FS01:0x22:   Compiler: Visual C++ 2015
21:04:08:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
21:04:08:WU01:FS01:0x22:   Platform: win32 10
21:04:08:WU01:FS01:0x22:       Bits: 64
21:04:08:WU01:FS01:0x22:       Mode: Release
21:04:08:WU01:FS01:0x22:************************************ CBang *************************************
21:04:08:WU01:FS01:0x22:       Date: Jun 26 2020
21:04:08:WU01:FS01:0x22:       Time: 19:46:11
21:04:08:WU01:FS01:0x22:   Revision: f8529962055b0e7bde23e429f5072ff758089dee
21:04:08:WU01:FS01:0x22:     Branch: master
21:04:08:WU01:FS01:0x22:   Compiler: Visual C++ 2015
21:04:08:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
21:04:08:WU01:FS01:0x22:   Platform: win32 10
21:04:08:WU01:FS01:0x22:       Bits: 64
21:04:08:WU01:FS01:0x22:       Mode: Release
21:04:08:WU01:FS01:0x22:************************************ System ************************************
21:04:08:WU01:FS01:0x22:        CPU: AMD Ryzen 5 3600 6-Core Processor
21:04:08:WU01:FS01:0x22:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
21:04:08:WU01:FS01:0x22:       CPUs: 12
21:04:08:WU01:FS01:0x22:     Memory: 15.95GiB
21:04:08:WU01:FS01:0x22:Free Memory: 13.43GiB
21:04:08:WU01:FS01:0x22:    Threads: WINDOWS_THREADS
21:04:08:WU01:FS01:0x22: OS Version: 6.2
21:04:08:WU01:FS01:0x22:Has Battery: false
21:04:08:WU01:FS01:0x22: On Battery: false
21:04:08:WU01:FS01:0x22: UTC Offset: -7
21:04:08:WU01:FS01:0x22:        PID: 5532
21:04:08:WU01:FS01:0x22:        CWD: C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\work
21:04:08:WU01:FS01:0x22:********************************************************************************
21:04:08:WU01:FS01:0x22:Project: 16600 (Run 0, Clone 933, Gen 384)
21:04:08:WU01:FS01:0x22:Unit: 0x000001b08f59f36f5ec36911c061f769
21:04:08:WU01:FS01:0x22:Digital signatures verified
21:04:08:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
21:04:08:WU01:FS01:0x22:Version 0.0.11
21:04:08:WU01:FS01:0x22:  Checkpoint write interval: 25000 steps (5%) [20 total]
21:04:08:WU01:FS01:0x22:  JSON viewer frame write interval: 5000 steps (1%) [100 total]
21:04:08:WU01:FS01:0x22:  XTC frame write interval: 20000 steps (4%) [25 total]
21:04:08:WU01:FS01:0x22:  Global context and integrator variables write interval: disabled
21:04:26:WU01:FS01:0x22:Completed 150000 out of 500000 steps (30%)
21:05:53:WU01:FS01:0x22:Completed 155000 out of 500000 steps (31%)
21:07:04:WU01:FS01:0x22:An exception occurred at step 156623: Particle coordinate is nan
21:07:04:WU01:FS01:0x22:Max number of attempts to resume from last checkpoint (2) reached. Aborting.
21:07:04:WU01:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
21:07:04:WU01:FS01:0x22:Saving result file ..\\logfile_01.txt
21:07:04:WU01:FS01:0x22:Saving result file science.log
21:07:04:WU01:FS01:0x22:Saving result file state.xml
21:07:07:WU01:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
21:07:07:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
21:07:07:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16600 run:0 clone:933 gen:384 core:0x22 unit:0x000001b08f59f36f5ec36911c061f769
21:07:07:WU01:FS01:Uploading 19.64MiB to 143.89.243.111
21:07:07:WU01:FS01:Connecting to 143.89.243.111:8080


After that, it worked on and successfully finished five 13423's in a row. Right now, it's working on another 16600, cranking away at 582k PPD. It's almost halfway done and has restarted the core once:

Code: Select all
04:24:03:WU01:FS01:Starting
04:24:03:WU01:FS01:Running FahCore: \"C:\\Program Files (x86)\\FAHClient/FAHCoreWrapper.exe\" C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 705 -lifeline 9488 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
04:24:03:WU01:FS01:Started FahCore on PID 10084
04:24:03:WU01:FS01:Core PID:11064
04:24:03:WU01:FS01:FahCore 0x22 started
04:24:04:WU01:FS01:0x22:*********************** Log Started 2020-08-19T04:24:03Z ***********************
04:24:04:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
04:24:04:WU01:FS01:0x22:       Core: Core22
04:24:04:WU01:FS01:0x22:       Type: 0x22
04:24:04:WU01:FS01:0x22:    Version: 0.0.11
04:24:04:WU01:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
04:24:04:WU01:FS01:0x22:  Copyright: 2020 foldingathome.org
04:24:04:WU01:FS01:0x22:   Homepage: https://foldingathome.org/
04:24:04:WU01:FS01:0x22:       Date: Jun 26 2020
04:24:04:WU01:FS01:0x22:       Time: 19:49:16
04:24:04:WU01:FS01:0x22:   Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
04:24:04:WU01:FS01:0x22:     Branch: core22-0.0.11
04:24:04:WU01:FS01:0x22:   Compiler: Visual C++ 2015
04:24:04:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
04:24:04:WU01:FS01:0x22:   Platform: win32 10
04:24:04:WU01:FS01:0x22:       Bits: 64
04:24:04:WU01:FS01:0x22:       Mode: Release
04:24:04:WU01:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
04:24:04:WU01:FS01:0x22:             <peastman@stanford.edu>
04:24:04:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 705 -lifeline 10084 -checkpoint 15
04:24:04:WU01:FS01:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
04:24:04:WU01:FS01:0x22:************************************ libFAH ************************************
04:24:04:WU01:FS01:0x22:       Date: Jun 26 2020
04:24:04:WU01:FS01:0x22:       Time: 19:47:12
04:24:04:WU01:FS01:0x22:   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
04:24:04:WU01:FS01:0x22:     Branch: HEAD
04:24:04:WU01:FS01:0x22:   Compiler: Visual C++ 2015
04:24:04:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
04:24:04:WU01:FS01:0x22:   Platform: win32 10
04:24:04:WU01:FS01:0x22:       Bits: 64
04:24:04:WU01:FS01:0x22:       Mode: Release
04:24:04:WU01:FS01:0x22:************************************ CBang *************************************
04:24:04:WU01:FS01:0x22:       Date: Jun 26 2020
04:24:04:WU01:FS01:0x22:       Time: 19:46:11
04:24:04:WU01:FS01:0x22:   Revision: f8529962055b0e7bde23e429f5072ff758089dee
04:24:04:WU01:FS01:0x22:     Branch: master
04:24:04:WU01:FS01:0x22:   Compiler: Visual C++ 2015
04:24:04:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
04:24:04:WU01:FS01:0x22:   Platform: win32 10
04:24:04:WU01:FS01:0x22:       Bits: 64
04:24:04:WU01:FS01:0x22:       Mode: Release
04:24:04:WU01:FS01:0x22:************************************ System ************************************
04:24:04:WU01:FS01:0x22:        CPU: AMD Ryzen 5 3600 6-Core Processor
04:24:04:WU01:FS01:0x22:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
04:24:04:WU01:FS01:0x22:       CPUs: 12
04:24:04:WU01:FS01:0x22:     Memory: 15.95GiB
04:24:04:WU01:FS01:0x22:Free Memory: 13.25GiB
04:24:04:WU01:FS01:0x22:    Threads: WINDOWS_THREADS
04:24:04:WU01:FS01:0x22: OS Version: 6.2
04:24:04:WU01:FS01:0x22:Has Battery: false
04:24:04:WU01:FS01:0x22: On Battery: false
04:24:04:WU01:FS01:0x22: UTC Offset: -7
04:24:04:WU01:FS01:0x22:        PID: 11064
04:24:04:WU01:FS01:0x22:        CWD: C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\work
04:24:04:WU01:FS01:0x22:********************************************************************************
04:24:04:WU01:FS01:0x22:Project: 16600 (Run 0, Clone 1566, Gen 116)
04:24:04:WU01:FS01:0x22:Unit: 0x000000898f59f36f5ec36910c82d72db
04:24:04:WU01:FS01:0x22:Reading tar file core.xml
04:24:04:WU01:FS01:0x22:Reading tar file integrator.xml
04:24:04:WU01:FS01:0x22:Reading tar file state.xml
04:24:05:WU01:FS01:0x22:Reading tar file system.xml
04:24:06:WU01:FS01:0x22:Digital signatures verified
04:24:06:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
04:24:06:WU01:FS01:0x22:Version 0.0.11
04:24:06:WU01:FS01:0x22:  Checkpoint write interval: 25000 steps (5%) [20 total]
04:24:06:WU01:FS01:0x22:  JSON viewer frame write interval: 5000 steps (1%) [100 total]
04:24:06:WU01:FS01:0x22:  XTC frame write interval: 20000 steps (4%) [25 total]
04:24:06:WU01:FS01:0x22:  Global context and integrator variables write interval: disabled
04:24:24:WU01:FS01:0x22:Completed 0 out of 500000 steps (0%)
04:25:51:WU01:FS01:0x22:Completed 5000 out of 500000 steps (1%)
04:27:17:WU01:FS01:0x22:Completed 10000 out of 500000 steps (2%)
04:28:42:WU01:FS01:0x22:Completed 15000 out of 500000 steps (3%)
04:30:08:WU01:FS01:0x22:Completed 20000 out of 500000 steps (4%)
04:31:35:WU01:FS01:0x22:Completed 25000 out of 500000 steps (5%)
04:33:03:WU01:FS01:0x22:Completed 30000 out of 500000 steps (6%)
04:34:30:WU01:FS01:0x22:Completed 35000 out of 500000 steps (7%)
04:35:56:WU01:FS01:0x22:Completed 40000 out of 500000 steps (8%)
04:37:23:WU01:FS01:0x22:Completed 45000 out of 500000 steps (9%)
04:38:50:WU01:FS01:0x22:Completed 50000 out of 500000 steps (10%)
04:40:19:WU01:FS01:0x22:Completed 55000 out of 500000 steps (11%)
04:41:45:WU01:FS01:0x22:Completed 60000 out of 500000 steps (12%)
04:43:12:WU01:FS01:0x22:Completed 65000 out of 500000 steps (13%)
04:44:39:WU01:FS01:0x22:Completed 70000 out of 500000 steps (14%)
04:45:19:WU01:FS01:0x22:An exception occurred at step 72036: Particle coordinate is nan
04:45:19:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
04:45:19:WU01:FS01:0x22:Folding@home Core Shutdown: CORE_RESTART
04:45:20:WARNING:WU01:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
04:45:20:WU01:FS01:Starting
04:45:20:WU01:FS01:Running FahCore: \"C:\\Program Files (x86)\\FAHClient/FAHCoreWrapper.exe\" C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 705 -lifeline 9488 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
04:45:20:WU01:FS01:Started FahCore on PID 6468
04:45:20:WU01:FS01:Core PID:5160
04:45:20:WU01:FS01:FahCore 0x22 started
04:45:21:WU01:FS01:0x22:*********************** Log Started 2020-08-19T04:45:20Z ***********************
04:45:21:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
04:45:21:WU01:FS01:0x22:       Core: Core22
04:45:21:WU01:FS01:0x22:       Type: 0x22
04:45:21:WU01:FS01:0x22:    Version: 0.0.11
04:45:21:WU01:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
04:45:21:WU01:FS01:0x22:  Copyright: 2020 foldingathome.org
04:45:21:WU01:FS01:0x22:   Homepage: https://foldingathome.org/
04:45:21:WU01:FS01:0x22:       Date: Jun 26 2020
04:45:21:WU01:FS01:0x22:       Time: 19:49:16
04:45:21:WU01:FS01:0x22:   Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
04:45:21:WU01:FS01:0x22:     Branch: core22-0.0.11
04:45:21:WU01:FS01:0x22:   Compiler: Visual C++ 2015
04:45:21:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
04:45:21:WU01:FS01:0x22:   Platform: win32 10
04:45:21:WU01:FS01:0x22:       Bits: 64
04:45:21:WU01:FS01:0x22:       Mode: Release
04:45:21:WU01:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
04:45:21:WU01:FS01:0x22:             <peastman@stanford.edu>
04:45:21:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 705 -lifeline 6468 -checkpoint 15
04:45:21:WU01:FS01:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
04:45:21:WU01:FS01:0x22:************************************ libFAH ************************************
04:45:21:WU01:FS01:0x22:       Date: Jun 26 2020
04:45:21:WU01:FS01:0x22:       Time: 19:47:12
04:45:21:WU01:FS01:0x22:   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
04:45:21:WU01:FS01:0x22:     Branch: HEAD
04:45:21:WU01:FS01:0x22:   Compiler: Visual C++ 2015
04:45:21:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
04:45:21:WU01:FS01:0x22:   Platform: win32 10
04:45:21:WU01:FS01:0x22:       Bits: 64
04:45:21:WU01:FS01:0x22:       Mode: Release
04:45:21:WU01:FS01:0x22:************************************ CBang *************************************
04:45:21:WU01:FS01:0x22:       Date: Jun 26 2020
04:45:21:WU01:FS01:0x22:       Time: 19:46:11
04:45:21:WU01:FS01:0x22:   Revision: f8529962055b0e7bde23e429f5072ff758089dee
04:45:21:WU01:FS01:0x22:     Branch: master
04:45:21:WU01:FS01:0x22:   Compiler: Visual C++ 2015
04:45:21:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
04:45:21:WU01:FS01:0x22:   Platform: win32 10
04:45:21:WU01:FS01:0x22:       Bits: 64
04:45:21:WU01:FS01:0x22:       Mode: Release
04:45:21:WU01:FS01:0x22:************************************ System ************************************
04:45:21:WU01:FS01:0x22:        CPU: AMD Ryzen 5 3600 6-Core Processor
04:45:21:WU01:FS01:0x22:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
04:45:21:WU01:FS01:0x22:       CPUs: 12
04:45:21:WU01:FS01:0x22:     Memory: 15.95GiB
04:45:21:WU01:FS01:0x22:Free Memory: 13.38GiB
04:45:21:WU01:FS01:0x22:    Threads: WINDOWS_THREADS
04:45:21:WU01:FS01:0x22: OS Version: 6.2
04:45:21:WU01:FS01:0x22:Has Battery: false
04:45:21:WU01:FS01:0x22: On Battery: false
04:45:21:WU01:FS01:0x22: UTC Offset: -7
04:45:21:WU01:FS01:0x22:        PID: 5160
04:45:21:WU01:FS01:0x22:        CWD: C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\work
04:45:21:WU01:FS01:0x22:********************************************************************************
04:45:21:WU01:FS01:0x22:Project: 16600 (Run 0, Clone 1566, Gen 116)
04:45:21:WU01:FS01:0x22:Unit: 0x000000898f59f36f5ec36910c82d72db
04:45:21:WU01:FS01:0x22:Digital signatures verified
04:45:21:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
04:45:21:WU01:FS01:0x22:Version 0.0.11
04:45:21:WU01:FS01:0x22:  Checkpoint write interval: 25000 steps (5%) [20 total]
04:45:21:WU01:FS01:0x22:  JSON viewer frame write interval: 5000 steps (1%) [100 total]
04:45:21:WU01:FS01:0x22:  XTC frame write interval: 20000 steps (4%) [25 total]
04:45:21:WU01:FS01:0x22:  Global context and integrator variables write interval: disabled
04:45:39:WU01:FS01:0x22:Completed 50000 out of 500000 steps (10%)
04:47:06:WU01:FS01:0x22:Completed 55000 out of 500000 steps (11%)
04:48:32:WU01:FS01:0x22:Completed 60000 out of 500000 steps (12%)
04:49:59:WU01:FS01:0x22:Completed 65000 out of 500000 steps (13%)
04:51:25:WU01:FS01:0x22:Completed 70000 out of 500000 steps (14%)
04:52:51:WU01:FS01:0x22:Completed 75000 out of 500000 steps (15%)
04:54:19:WU01:FS01:0x22:Completed 80000 out of 500000 steps (16%)
04:55:45:WU01:FS01:0x22:Completed 85000 out of 500000 steps (17%)
04:57:12:WU01:FS01:0x22:Completed 90000 out of 500000 steps (18%)
04:58:38:WU01:FS01:0x22:Completed 95000 out of 500000 steps (19%)
05:00:05:WU01:FS01:0x22:Completed 100000 out of 500000 steps (20%)
05:01:33:WU01:FS01:0x22:Completed 105000 out of 500000 steps (21%)
05:02:59:WU01:FS01:0x22:Completed 110000 out of 500000 steps (22%)
05:04:26:WU01:FS01:0x22:Completed 115000 out of 500000 steps (23%)
05:05:52:WU01:FS01:0x22:Completed 120000 out of 500000 steps (24%)
05:07:19:WU01:FS01:0x22:Completed 125000 out of 500000 steps (25%)
05:08:47:WU01:FS01:0x22:Completed 130000 out of 500000 steps (26%)
05:10:14:WU01:FS01:0x22:Completed 135000 out of 500000 steps (27%)
05:11:40:WU01:FS01:0x22:Completed 140000 out of 500000 steps (28%)
05:13:07:WU01:FS01:0x22:Completed 145000 out of 500000 steps (29%)
05:14:34:WU01:FS01:0x22:Completed 150000 out of 500000 steps (30%)
05:16:03:WU01:FS01:0x22:Completed 155000 out of 500000 steps (31%)
05:17:29:WU01:FS01:0x22:Completed 160000 out of 500000 steps (32%)
05:18:57:WU01:FS01:0x22:Completed 165000 out of 500000 steps (33%)
05:20:24:WU01:FS01:0x22:Completed 170000 out of 500000 steps (34%)
05:21:50:WU01:FS01:0x22:Completed 175000 out of 500000 steps (35%)
05:23:19:WU01:FS01:0x22:Completed 180000 out of 500000 steps (36%)
05:24:45:WU01:FS01:0x22:Completed 185000 out of 500000 steps (37%)
05:26:12:WU01:FS01:0x22:Completed 190000 out of 500000 steps (38%)
05:27:38:WU01:FS01:0x22:Completed 195000 out of 500000 steps (39%)
05:29:04:WU01:FS01:0x22:Completed 200000 out of 500000 steps (40%)
05:30:32:WU01:FS01:0x22:Completed 205000 out of 500000 steps (41%)
05:31:58:WU01:FS01:0x22:Completed 210000 out of 500000 steps (42%)
05:33:26:WU01:FS01:0x22:Completed 215000 out of 500000 steps (43%)
05:34:53:WU01:FS01:0x22:Completed 220000 out of 500000 steps (44%)
05:36:18:WU01:FS01:0x22:Completed 225000 out of 500000 steps (45%)
05:37:45:WU01:FS01:0x22:Completed 230000 out of 500000 steps (46%)
05:39:11:WU01:FS01:0x22:Completed 235000 out of 500000 steps (47%)
05:40:36:WU01:FS01:0x22:Completed 240000 out of 500000 steps (48%)


Things I've tried so far over this last weekend - 18 failures since then:
    Update drivers from 20.7.2 to 20.8.2
    Undo my SoC undervolt (0.975v) - manually set to 1.1v, auto went to 1.25v, yikes!
    Set VDDP and VDDG voltages to auto - the former went down, the latter went up
    Stop CPU folding
    Update BIOS to latest version w/AGESA 1.0.0.6
    Run Memtest86 for 20+ hours - 0 errors

Until I looked in HFM's work unit history today and later saw this thread, I was strongly suspecting a hardware issue, hence most of the above troubleshooting steps.

My main PC's GTX 1080 Ti has had no issue processing 16600 WUs - 63 so far this month. It hasn't failed a WU since early July, even with it running slightly above 2 GHz :D.
Folding since December 2003. In memory of my mother, who lost her battle with cancer.

Image
n_w95482
 
Posts: 65
Joined: Tue May 01, 2012 1:46 am
Location: California

Re: 16600 consistently crashing on AMD Radeon VII

Postby PantherX » Wed Aug 19, 2020 10:13 am

UofM.MartinK wrote:...Now all I want to know is whether that was a deliberate change to let these "problematic" AMD GPU models complete p16600 WUs (perhaps because they serve some sort of purpose after all?) or if this was just a fluke and there is no value in processing them with an AMD card.

Changes were made in FahCore_22 to fix the AMD issue and early testing showed promising results. Since there's a limited number of hardware that researchers have access to, they showed success. Beta testing didn't surface those issues and it only happened after being released to Full. Thus, reports like yours on this forum helps surface these issues since testing on every single GPU is not feasible so using the F@H community to identify and work together to solve it is really valuable :D

If the WU can fold successfully, that's counted towards science. If it can't the failures can still be seen by the researcher it isn't in vain.

Please note that I am aware that the researcher is aware of this issue on AMD GPUs and is allocating dedicated resources to further investigate this issue.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
User avatar
PantherX
Site Moderator
 
Posts: 6765
Joined: Wed Dec 23, 2009 10:33 am
Location: Land Of The Long White Cloud

Re: 16600 consistently crashing on AMD Radeon VII

Postby muziqaz » Wed Aug 19, 2020 10:24 am

Project has been disabled on all AMD cards but Navi. Please let us know if you still receive new p16600 WU on AMD GPU :)
User avatar
muziqaz
 
Posts: 690
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Re: 16600 consistently crashing on AMD Radeon VII

Postby Neil-B » Wed Aug 19, 2020 10:26 am

Ok so can I point out that we have one poster whose data is showing an issue with 16600 only and the other whose data is showing issues across the board including 16600 .. so there may be an issue with 16600 in some fashion but there also may be an issue with something in the setup of one on the rigs or a wider incompatibility with the current core for that rig? ... if it were simply 16600 it was failing on then yes look to the project but it isn't and so looking to the rig or core even if that is unpalatable may need to be considered - blaming one project for failures across the board seems odd?
Last edited by Neil-B on Wed Aug 19, 2020 11:09 am, edited 1 time in total.
1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent, Quadro K420 1GB, FAH 7.6.13
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro, Quadro M1000M 2GB, FAH 7.6.13
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro, GTX 750Ti 2GB, FAH 7.6.13
Neil-B
 
Posts: 1409
Joined: Sun Mar 22, 2020 6:52 pm
Location: UK

Re: 16600 consistently crashing on AMD Radeon VII

Postby muziqaz » Wed Aug 19, 2020 10:27 am

Failure rate of 16600 is 32% which is very high :)
User avatar
muziqaz
 
Posts: 690
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Re: 16600 consistently crashing on AMD Radeon VII

Postby Neil-B » Wed Aug 19, 2020 10:42 am

... and high rates of failure on 13421 (30 of 37 failed) and 13423 (7 of 8 failed) on the same rig ... that doesn't just feel like an issue with the 16600 project as far as that rig is concerned ... yes the 34 of 38 failures on 16600 may be down to an issue with the project but with the wider failures it feels like a rig issue or possibly an incompatible core to rig issue
Neil-B
 
Posts: 1409
Joined: Sun Mar 22, 2020 6:52 pm
Location: UK

Re: 16600 consistently crashing on AMD Radeon VII

Postby muziqaz » Wed Aug 19, 2020 10:52 am

This is not just this particular machine
User avatar
muziqaz
 
Posts: 690
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Re: 16600 consistently crashing on AMD Radeon VII

Postby PantherX » Wed Aug 19, 2020 10:53 am

Regarding Projects 13421 and 13423, since they are highly experimental, they have a higher than normal failure rates. John is aware of the higher than normal failure rates and keeping a close eye on the failures. As long as data is successfully uploaded, that's still valuable work being done.

The next version of FahCore_22 (version 0.0.12 or higher) plans to take care of this by running some automated tests upon failure to ensure that these use cases which can't be reproduced in their labs or available hardware can be addressed.
User avatar
PantherX
Site Moderator
 
Posts: 6765
Joined: Wed Dec 23, 2009 10:33 am
Location: Land Of The Long White Cloud

PreviousNext

Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 2 guests

cron