Page 1 of 1

WU stalled...after it was done?

Posted: Tue Apr 21, 2020 10:11 am
by thecomputerdude

Code: Select all

06:04:31:WU01:FS00:0xa7:Completed 249557 out of 250000 steps (99%)
06:25:07:WU01:FS00:0xa7:Completed 250000 out of 250000 steps (100%)
06:35:07:WARNING:WU01:FS00:FahCore returned an unknown error code which probably indicates that it crashed
06:35:07:WARNING:WU01:FS00:FahCore returned: WU_STALLED (127 = 0x7f)
But a simple pause/restart cleared the error (it was left stuck in queue for a few hours):

Code: Select all

09:59:31:FS00:Unpaused
09:59:31:WU01:FS00:Starting
09:59:31:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Administrator\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/avx/Core_a7.fah/FahCore_a7.exe -dir 01 -suffix 01 -version 706 -lifeline 5868 -checkpoint 15 -np 24
09:59:31:WU01:FS00:Started FahCore on PID 1612
09:59:31:WU01:FS00:Core PID:7020
09:59:31:WU01:FS00:FahCore 0xa7 started
09:59:32:WU01:FS00:0xa7:*********************** Log Started 2020-04-21T09:59:31Z ***********************
09:59:32:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
09:59:32:WU01:FS00:0xa7:       Type: 0xa7
09:59:32:WU01:FS00:0xa7:       Core: Gromacs
09:59:32:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 706 -lifeline 1612 -checkpoint 15 -np
09:59:32:WU01:FS00:0xa7:             24
09:59:32:WU01:FS00:0xa7:************************************ CBang *************************************
09:59:32:WU01:FS00:0xa7:       Date: Oct 26 2019
09:59:32:WU01:FS00:0xa7:       Time: 01:38:25
09:59:32:WU01:FS00:0xa7:   Revision: c46a1a011a24143739ac7218c5a435f66777f62f
09:59:32:WU01:FS00:0xa7:     Branch: master
09:59:32:WU01:FS00:0xa7:   Compiler: Visual C++ 2008
09:59:32:WU01:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
09:59:32:WU01:FS00:0xa7:   Platform: win32 10
09:59:32:WU01:FS00:0xa7:       Bits: 64
09:59:32:WU01:FS00:0xa7:       Mode: Release
09:59:32:WU01:FS00:0xa7:************************************ System ************************************
09:59:32:WU01:FS00:0xa7:        CPU: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
09:59:32:WU01:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 63 Stepping 2
09:59:32:WU01:FS00:0xa7:       CPUs: 24
09:59:32:WU01:FS00:0xa7:     Memory: 127.31GiB
09:59:32:WU01:FS00:0xa7:Free Memory: 123.93GiB
09:59:32:WU01:FS00:0xa7:    Threads: WINDOWS_THREADS
09:59:32:WU01:FS00:0xa7: OS Version: 6.2
09:59:32:WU01:FS00:0xa7:Has Battery: false
09:59:32:WU01:FS00:0xa7: On Battery: false
09:59:32:WU01:FS00:0xa7: UTC Offset: -7
09:59:32:WU01:FS00:0xa7:        PID: 7020
09:59:32:WU01:FS00:0xa7:        CWD: C:\Users\Administrator\AppData\Roaming\FAHClient\work
09:59:32:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
09:59:32:WU01:FS00:0xa7:    Version: 0.0.18
09:59:32:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
09:59:32:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
09:59:32:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
09:59:32:WU01:FS00:0xa7:       Date: Oct 26 2019
09:59:32:WU01:FS00:0xa7:       Time: 01:52:30
09:59:32:WU01:FS00:0xa7:   Revision: c1e3513b1bc0c16013668f2173ee969e5995b38e
09:59:32:WU01:FS00:0xa7:     Branch: master
09:59:32:WU01:FS00:0xa7:   Compiler: Visual C++ 2008
09:59:32:WU01:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
09:59:32:WU01:FS00:0xa7:   Platform: win32 10
09:59:32:WU01:FS00:0xa7:       Bits: 64
09:59:32:WU01:FS00:0xa7:       Mode: Release
09:59:32:WU01:FS00:0xa7:************************************ Build *************************************
09:59:32:WU01:FS00:0xa7:       SIMD: avx_256
09:59:32:WU01:FS00:0xa7:********************************************************************************
09:59:32:WU01:FS00:0xa7:Project: 16802 (Run 5, Clone 290, Gen 3)
09:59:32:WU01:FS00:0xa7:Unit: 0x0000000581d59d695e95d85d111a5faa
09:59:32:WU01:FS00:0xa7:Digital signatures verified
09:59:32:WU01:FS00:0xa7:Calling: mdrun -s frame3.tpr -o frame3.trr -cpi state.cpt -cpt 15 -nt 24
09:59:37:WU01:FS00:0xa7:Steps: first=750000 total=250000
09:59:42:WU01:FS00:0xa7:Completed 250001 out of 250000 steps (100%)
09:59:42:WU01:FS00:0xa7:Saving result file ..\logfile_01.txt
09:59:42:WU01:FS00:0xa7:Saving result file ener.edr
09:59:42:WU01:FS00:0xa7:Saving result file frame3.trr
09:59:42:WU01:FS00:0xa7:Saving result file md.log
09:59:42:WU01:FS00:0xa7:Saving result file science.log
09:59:42:WU01:FS00:0xa7:Folding@home Core Shutdown: FINISHED_UNIT
09:59:43:WU01:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
09:59:43:WU01:FS00:Sending unit results: id:01 state:SEND error:NO_ERROR project:16802 run:5 clone:290 gen:3 core:0xa7 unit:0x0000000581d59d695e95d85d111a5faa
Is this just buggy behavior or something that I can correct? It's my first time seeing it happen (ever), on a new client/machine.

Re: WU stalled...after it was done?

Posted: Tue Apr 21, 2020 10:20 am
by PantherX
Welcome back thecomputerdude,

Do you have any Anti-virus/Anti-malware/anti-spyware software that could potentially "stall" the WU? Add exceptions to those security software. I am glad that the WU wasn't lost and it managed to finish up. Also, it might worth checking the HDD/SSD for any issues.

Re: WU stalled...after it was done?

Posted: Tue Apr 21, 2020 10:34 am
by thecomputerdude
PantherX wrote:Welcome back thecomputerdude,

Do you have any Anti-virus/Anti-malware/anti-spyware software that could potentially "stall" the WU? Add exceptions to those security software. I am glad that the WU wasn't lost and it managed to finish up. Also, it might worth checking the HDD/SSD for any issues.
No AV software installed, running Windows Server 2016 Datacenter (with nothing extra configured over a normal desktop instance). Lenovo IMM is reporting all good on I/O with no controller or drive (Micron S630DC) errors.

Image

And thanks for the welcome, it's been so long that I didn't realize I had an account until I went to make an account again :lol:

Re: WU stalled...after it was done?

Posted: Wed Apr 22, 2020 4:23 am
by PantherX
If this was the first time it happened, I won't worry about it too much. However, if it happens again, then it might be something to investigate :)

Re: WU stalled...after it was done?

Posted: Wed Apr 22, 2020 4:34 am
by bruce
PantherX wrote:I am glad that the WU wasn't lost and it managed to finish up.
I am too. This used to be a bug and I complained about it. Apparently it has been fixed.

If the last checkpoint was at, say, 98% and the WU progressed to 100%, it immediately went into the code that compresses, zips, signs and writes the results for uploading. If an error occured during those steps, the 98% checkpoint had been deleted and the WU was LOST. It looks like maybe somebody rewrote that part of the process so a recovery was possible.