WU crashed, system wouldn't re-start it. Why?

Moderators: Site Moderators, FAHC Science Team

Post Reply
danielk
Posts: 8
Joined: Sun Oct 27, 2019 1:10 pm

WU crashed, system wouldn't re-start it. Why?

Post by danielk »

I was doing a WU for more than 30 hours, it was more than 50% completed when suddenly it crashed. Why didn't the system try to recover the process from the checkpoint? Instead it started doing a new WU and I lost everything.

Is there a way to complete the old WU? Will the incomplete WU affect my bonus?

Here's the log from moments before it crashed

Code: Select all

05:50:33:WU00:FS00:0xa7:Completed 1131461 out of 2500000 steps (45%)
06:15:20:WU00:FS00:0xa7:Completed 1150000 out of 2500000 steps (46%)
06:47:47:WU00:FS00:0xa7:Completed 1175000 out of 2500000 steps (47%)
07:19:39:WU00:FS00:0xa7:Completed 1200000 out of 2500000 steps (48%)
07:51:33:WU00:FS00:0xa7:Completed 1225000 out of 2500000 steps (49%)
08:23:34:WU00:FS00:0xa7:Completed 1250000 out of 2500000 steps (50%)
08:55:04:WU00:FS00:0xa7:Completed 1275000 out of 2500000 steps (51%)
09:26:48:WU00:FS00:0xa7:Completed 1300000 out of 2500000 steps (52%)
09:38:09:WARNING:WU00:FS00:FahCore returned an unknown error code which probably indicates that it crashed
09:38:09:WARNING:WU00:FS00:FahCore returned: WU_STALLED (127 = 0x7f)
09:38:09:WU00:FS00:Starting
09:38:09:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Dido\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/Core_a7.fah/FahCore_a7.exe -dir 00 -suffix 01 -version 705 -lifeline 3480 -checkpoint 8 -np 3
09:38:09:WU00:FS00:Started FahCore on PID 5908
09:38:09:WU00:FS00:Core PID:5460
09:38:09:WU00:FS00:FahCore 0xa7 started
09:38:10:WU00:FS00:0xa7:*********************** Log Started 2019-10-31T09:38:10Z ***********************
09:38:10:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
09:38:10:WU00:FS00:0xa7:       Type: 0xa7
09:38:10:WU00:FS00:0xa7:       Core: Gromacs
09:38:10:WU00:FS00:0xa7:    Website: https://foldingathome.org/
09:38:10:WU00:FS00:0xa7:  Copyright: (c) 2009-2018 foldingathome.org
09:38:10:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
09:38:10:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 5908 -checkpoint 8 -np 3
09:38:10:WU00:FS00:0xa7:     Config: <none>
09:38:10:WU00:FS00:0xa7:************************************ Build *************************************
09:38:10:WU00:FS00:0xa7:    Version: 0.0.17
09:38:10:WU00:FS00:0xa7:       Date: Apr 27 2018
09:38:10:WU00:FS00:0xa7:       Time: 16:19:35
09:38:10:WU00:FS00:0xa7: Repository: Git
09:38:10:WU00:FS00:0xa7:   Revision: 21359963583d09ec2063ef946399441c4df4ccd7
09:38:10:WU00:FS00:0xa7:     Branch: master
09:38:10:WU00:FS00:0xa7:   Compiler: Visual C++ 2008
09:38:10:WU00:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
09:38:10:WU00:FS00:0xa7:   Platform: win32 10
09:38:10:WU00:FS00:0xa7:       Bits: 64
09:38:10:WU00:FS00:0xa7:       Mode: Release
09:38:10:WU00:FS00:0xa7:       SIMD: sse2
09:38:10:WU00:FS00:0xa7:************************************ System ************************************
09:38:10:WU00:FS00:0xa7:        CPU: Unknown
09:38:10:WU00:FS00:0xa7:     CPU ID: 
09:38:10:WU00:FS00:0xa7:       CPUs: 4
09:38:10:WU00:FS00:0xa7:     Memory: 3.89GiB
09:38:10:WU00:FS00:0xa7:Free Memory: 1.78GiB
09:38:10:WU00:FS00:0xa7:    Threads: WINDOWS_THREADS
09:38:10:WU00:FS00:0xa7: OS Version: 6.1
09:38:10:WU00:FS00:0xa7:Has Battery: false
09:38:10:WU00:FS00:0xa7: On Battery: false
09:38:10:WU00:FS00:0xa7: UTC Offset: 2
09:38:10:WU00:FS00:0xa7:        PID: 5460
09:38:10:WU00:FS00:0xa7:        CWD: C:\Users\Dido\AppData\Roaming\FAHClient\work
09:38:10:WU00:FS00:0xa7:         OS: Windows 7 Ultimate Service Pack 1
09:38:10:WU00:FS00:0xa7:    OS Arch: AMD64
09:38:10:WU00:FS00:0xa7:********************************************************************************
09:38:10:WU00:FS00:0xa7:Project: 14182 (Run 21, Clone 166, Gen 10)
09:38:10:WU00:FS00:0xa7:Unit: 0x0000000d0002894b5cf6845f507cc0b8
09:38:10:WU00:FS00:0xa7:Digital signatures verified
09:38:10:WU00:FS00:0xa7:Calling: mdrun -s frame10.tpr -o frame10.trr -cpi state.cpt -cpt 8 -nt 3
09:38:10:WU00:FS00:0xa7:Steps: first=25000000 total=2500000
09:38:13:WU00:FS00:0xa7:Completed 1306181 out of 2500000 steps (52%)
10:07:49:WU00:FS00:0xa7:Completed 1325000 out of 2500000 steps (53%)
10:33:26:WARNING:WU00:FS00:FahCore returned an unknown error code which probably indicates that it crashed
10:33:26:WARNING:WU00:FS00:FahCore returned: WU_STALLED (127 = 0x7f)
10:33:26:WARNING:WU00:FS00:Too many errors, failing
10:33:27:WU00:FS00:Sending unit results: id:00 state:SEND error:FAILED project:14182 run:21 clone:166 gen:10 core:0xa7 unit:0x0000000d0002894b5cf6845f507cc0b8
10:33:27:WU00:FS00:Connecting to 155.247.166.219:8080
10:33:27:WU01:FS00:Connecting to 65.254.110.245:8080
10:33:27:WU00:FS00:Server responded WORK_ACK (400)
10:33:27:WU00:FS00:Cleaning up
10:33:28:WU01:FS00:Assigned to work server 155.247.166.219
10:33:28:WU01:FS00:Requesting new work unit for slot 00: READY cpu:3 from 155.247.166.219
10:33:28:WU01:FS00:Connecting to 155.247.166.219:8080
10:33:29:WU01:FS00:Downloading 1.36MiB
10:33:31:WU01:FS00:Download complete
10:33:32:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:14090 run:0 clone:37 gen:4 core:0xa7 unit:0x000000060002894b5dacdaef009e6ce0
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: WU crashed, system wouldn't re-start it. Why?

Post by bruce »

First of all, the failure of a single WU will not affect your bonus status. And no, you cannot complete that WU. It has already failed too many times and you really don't want the system to keep retrying and failing again and again. The system has dumped it. It has already been assigned to someone else.

Second, there's not enough information to know why the WU crashed. The message was that it stalled, which could be a hardware issue or a problem with corrupt data on your disk or overclocking/overheating (another form of hardware problems).

10:33:26:WARNING:WU00:FS00:Too many errors, failing
means that there were earlier failures of the same WU. What were the other messages?
danielk
Posts: 8
Joined: Sun Oct 27, 2019 1:10 pm

Re: WU crashed, system wouldn't re-start it. Why?

Post by danielk »

It seems like I have been credited for the crashed WU. Before that I had two units completed, now the system stats say I have finished 3 jobs. The crashed one was my third job overall. The points seem low though - 333, but I didn't finish the WU anyway, so it seems fair.

Is it a common thing for crashed jobs to be reported as completed by the system?
Joe_H
Site Admin
Posts: 7868
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: WU crashed, system wouldn't re-start it. Why?

Post by Joe_H »

It is reported as returned with an error, partial credit is an option that is selectable by the researcher who is running that project. The WU will be assigned to others. If it fails for others, after a set number of failures that sequence of Run and Clone WU's will be stopped at that Gen number.

As for common, by design the client does return a a WU that has failed for whatever reason. The sooner the WS gets one back that has failed, the sooner it can be assigned to others to see if that was a one time problem or not. Otherwise the server waits until the preferred deadline before reassigning a WU, in the case of this particular project that is 10.3 days after being assigned.

The actual report in the database is returned, but faulty:
Hi danielkostov (team 0), Your WU (P14182 R21 C166 G10) was added to the stats database on Invalid Date for 332.98 points of credit.
(ignore the "invalid date", that is a bug in the query to the database that will eventually get fixed)

Full base credit for that project is 2400, so partial credit was given.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: WU crashed, system wouldn't re-start it. Why?

Post by bruce »

bruce wrote:What were the other messages?
... and are you overclocking?
Post Reply