Core won't start, no error displayed

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: Site Moderators, FAHC Science Team

frest1
Posts: 15
Joined: Wed Mar 25, 2020 1:57 pm

Re: Core won't start, no error displayed

Post by frest1 »

In addition to the log from a Z2 machine above, found the setPosition-error from the recent logs of three pretty much identical HP Z240 workstations (i7-7700K & Quadro P4000). Seems to be mostly happening with 144XX units.
You consider the setPosition error in one WU would be causing a following WU to get stuck at 0,00%? I think there typically has indeed been some messages about failed WUs before the WU that got stuck.

Z240 #1:

Code: Select all

07:08:29:WU00:FS00:0x22:Project: 14460 (Run 0, Clone 165, Gen 64)
07:08:29:WU00:FS00:0x22:Unit: 0x0000006403854c135eb39e44dcb150f3
07:08:29:WU00:FS00:0x22:Reading tar file core.xml
07:08:29:WU00:FS00:0x22:Reading tar file integrator.xml
07:08:29:WU00:FS00:0x22:Reading tar file state.xml
07:08:29:WU00:FS00:0x22:Reading tar file system.xml
07:08:30:WU00:FS00:0x22:Digital signatures verified
07:08:30:WU00:FS00:0x22:Folding@home GPU Core22 Folding@home Core
07:08:30:WU00:FS00:0x22:Version 0.0.10
07:08:30:WU00:FS00:0x22:  Checkpoint write interval: 100000 steps (5%) [20 total]
07:08:30:WU00:FS00:0x22:  JSON viewer frame write interval: 20000 steps (1%) [100 total]
07:08:30:WU00:FS00:0x22:  XTC frame write interval: 20000 steps (1%) [100 total]
07:08:30:WU00:FS00:0x22:  Global context and integrator variables write interval: disabled
07:08:35:WU01:FS00:Upload 0.08%
07:08:39:WU00:FS00:0x22:ERROR:exception: Called setPositions() on a Context with the wrong number of positions
07:08:39:WU00:FS00:0x22:Saving result file ..\logfile_01.txt
07:08:39:WU00:FS00:0x22:Saving result file science.log
07:08:39:WU00:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
07:08:40:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
07:08:40:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:14460 run:0 clone:165 gen:64 core:0x22 unit:0x0000006403854c135eb39e44dcb150f3


12:45:30:WU00:FS00:0x22:Project: 14458 (Run 0, Clone 627, Gen 42)
12:45:30:WU00:FS00:0x22:Unit: 0x0000003f03854c135eb39a2e511d0fb5
12:45:30:WU00:FS00:0x22:Reading tar file core.xml
12:45:30:WU00:FS00:0x22:Reading tar file integrator.xml
12:45:30:WU00:FS00:0x22:Reading tar file state.xml
12:45:30:WU00:FS00:0x22:Reading tar file system.xml
12:45:31:WU00:FS00:0x22:Digital signatures verified
12:45:31:WU00:FS00:0x22:Folding@home GPU Core22 Folding@home Core
12:45:31:WU00:FS00:0x22:Version 0.0.11
12:45:31:WU00:FS00:0x22:  Checkpoint write interval: 100000 steps (5%) [20 total]
12:45:31:WU00:FS00:0x22:  JSON viewer frame write interval: 20000 steps (1%) [100 total]
12:45:31:WU00:FS00:0x22:  XTC frame write interval: 20000 steps (1%) [100 total]
12:45:31:WU00:FS00:0x22:  Global context and integrator variables write interval: disabled
12:45:41:WU00:FS00:0x22:ERROR:exception: Called setPositions() on a Context with the wrong number of positions
12:45:41:WU00:FS00:0x22:Saving result file ..\logfile_01.txt
12:45:41:WU00:FS00:0x22:Saving result file science.log
12:45:41:WU00:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
12:45:41:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
12:45:41:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:14458 run:0 clone:627 gen:42 core:0x22 unit:0x0000003f03854c135eb39a2e511d0fb5


23:05:39:WU00:FS00:0x22:Project: 14440 (Run 0, Clone 820, Gen 10)
23:05:39:WU00:FS00:0x22:Unit: 0x0000001a03854c135ea0a3018b606192
23:05:39:WU00:FS00:0x22:Reading tar file core.xml
23:05:39:WU00:FS00:0x22:Reading tar file integrator.xml
23:05:39:WU00:FS00:0x22:Reading tar file state.xml
23:05:39:WU00:FS00:0x22:Reading tar file system.xml
23:05:40:WU00:FS00:0x22:Digital signatures verified
23:05:40:WU00:FS00:0x22:Folding@home GPU Core22 Folding@home Core
23:05:40:WU00:FS00:0x22:Version 0.0.11
23:05:40:WU00:FS00:0x22:  Checkpoint write interval: 100000 steps (5%) [20 total]
23:05:40:WU00:FS00:0x22:  JSON viewer frame write interval: 20000 steps (1%) [100 total]
23:05:40:WU00:FS00:0x22:  XTC frame write interval: 20000 steps (1%) [100 total]
23:05:40:WU00:FS00:0x22:  Global context and integrator variables write interval: disabled
23:05:49:WU00:FS00:0x22:ERROR:exception: Called setPositions() on a Context with the wrong number of positions
23:05:49:WU00:FS00:0x22:Saving result file ..\logfile_01.txt
23:05:49:WU00:FS00:0x22:Saving result file science.log
23:05:49:WU00:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
23:05:49:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
23:05:49:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:14440 run:0 clone:820 gen:10 core:0x22 unit:0x0000001a03854c135ea0a3018b606192
Z240 #2:

Code: Select all


05:29:38:WU01:FS00:0x22:Project: 14456 (Run 0, Clone 1913, Gen 16)
05:29:38:WU01:FS00:0x22:Unit: 0x0000001b03854c135eb39a2f73e93baf
05:29:38:WU01:FS00:0x22:Reading tar file core.xml
05:29:38:WU01:FS00:0x22:Reading tar file integrator.xml
05:29:38:WU01:FS00:0x22:Reading tar file state.xml
05:29:38:WU01:FS00:0x22:Reading tar file system.xml
05:29:39:WU01:FS00:0x22:Digital signatures verified
05:29:39:WU01:FS00:0x22:Folding@home GPU Core22 Folding@home Core
05:29:39:WU01:FS00:0x22:Version 0.0.10
05:29:39:WU01:FS00:0x22:  Checkpoint write interval: 100000 steps (5%) [20 total]
05:29:39:WU01:FS00:0x22:  JSON viewer frame write interval: 20000 steps (1%) [100 total]
05:29:39:WU01:FS00:0x22:  XTC frame write interval: 20000 steps (1%) [100 total]
05:29:39:WU01:FS00:0x22:  Global context and integrator variables write interval: disabled
05:29:49:WU01:FS00:0x22:ERROR:exception: Called setPositions() on a Context with the wrong number of positions
05:29:49:WU01:FS00:0x22:Saving result file ..\logfile_01.txt
05:29:49:WU01:FS00:0x22:Saving result file science.log
05:29:49:WU01:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
05:29:50:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
05:29:50:WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:14456 run:0 clone:1913 gen:16 core:0x22 unit:0x0000001b03854c135eb39a2f73e93baf


05:30:56:WU02:FS00:0x22:Project: 14456 (Run 0, Clone 1260, Gen 15)
05:30:56:WU02:FS00:0x22:Unit: 0x0000001e03854c135eb39a2f84cfa39b
05:30:56:WU02:FS00:0x22:Reading tar file core.xml
05:30:56:WU02:FS00:0x22:Reading tar file integrator.xml
05:30:56:WU02:FS00:0x22:Reading tar file state.xml
05:30:56:WU02:FS00:0x22:Reading tar file system.xml
05:30:57:WU02:FS00:0x22:Digital signatures verified
05:30:57:WU02:FS00:0x22:Folding@home GPU Core22 Folding@home Core
05:30:57:WU02:FS00:0x22:Version 0.0.10
05:30:57:WU02:FS00:0x22:  Checkpoint write interval: 100000 steps (5%) [20 total]
05:30:57:WU02:FS00:0x22:  JSON viewer frame write interval: 20000 steps (1%) [100 total]
05:30:57:WU02:FS00:0x22:  XTC frame write interval: 20000 steps (1%) [100 total]
05:30:57:WU02:FS00:0x22:  Global context and integrator variables write interval: disabled
05:31:07:WU02:FS00:0x22:ERROR:exception: Called setPositions() on a Context with the wrong number of positions
05:31:07:WU02:FS00:0x22:Saving result file ..\logfile_01.txt
05:31:07:WU02:FS00:0x22:Saving result file science.log
05:31:07:WU02:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
05:31:07:WARNING:WU02:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
05:31:07:WU02:FS00:Sending unit results: id:02 state:SEND error:FAULTY project:14456 run:0 clone:1260 gen:15 core:0x22 unit:0x0000001e03854c135eb39a2f84cfa39b

Z240 #3:

Code: Select all

04:40:03:WU01:FS00:0x22:Project: 14442 (Run 0, Clone 876, Gen 66)
04:40:03:WU01:FS00:0x22:Unit: 0x0000007003854c135ea0a2fff093b28d
04:40:03:WU01:FS00:0x22:Reading tar file core.xml
04:40:03:WU01:FS00:0x22:Reading tar file integrator.xml
04:40:03:WU01:FS00:0x22:Reading tar file state.xml
04:40:03:WU01:FS00:0x22:Reading tar file system.xml
04:40:04:WU01:FS00:0x22:Digital signatures verified
04:40:04:WU01:FS00:0x22:Folding@home GPU Core22 Folding@home Core
04:40:04:WU01:FS00:0x22:Version 0.0.10
04:40:04:WU01:FS00:0x22:  Checkpoint write interval: 100000 steps (5%) [20 total]
04:40:04:WU01:FS00:0x22:  JSON viewer frame write interval: 20000 steps (1%) [100 total]
04:40:04:WU01:FS00:0x22:  XTC frame write interval: 20000 steps (1%) [100 total]
04:40:04:WU01:FS00:0x22:  Global context and integrator variables write interval: disabled
04:40:14:WU01:FS00:0x22:ERROR:exception: Called setPositions() on a Context with the wrong number of positions
04:40:14:WU01:FS00:0x22:Saving result file ..\logfile_01.txt
04:40:14:WU01:FS00:0x22:Saving result file science.log
04:40:14:WU01:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
04:40:15:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
04:40:15:WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:14442 run:0 clone:876 gen:66 core:0x22 unit:0x0000007003854c135ea0a2fff093b28d
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Core won't start, no error displayed

Post by bruce »

The messages
23:05:49:WU00:FS00:0x22:ERROR:exception: Called setPositions() on a Context with the wrong number of positions
and
23:05:49:WU00:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT

look like error messages to me.

This has been tracked down to an improprely constructed project. and has reportedly been corrected. You should be able to move on.
frest1
Posts: 15
Joined: Wed Mar 25, 2020 1:57 pm

Re: Core won't start, no error displayed

Post by frest1 »

Resurrecting this thread from couple of weeks back to give a status update:

Out of the 6 clients I have folding, this ”stuck at 0,00%” still comes up maybe once every two days, so the problem is clearly still there. Haven’t really had time to browse/collect any logs. I did browse once just enough to see it was again stuck at the xml part.

So maybe I’ll just continue the process of deleting the slot, waiting the core to be killed and then add the slot back. Or is there any troubleshooting procedures I could use to get to the root cause?
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Core won't start, no error displayed

Post by bruce »

I suggest you collect the PRCG numbers where this happens. Maybe somebody can find a pattern to the projects?
frest1
Posts: 15
Joined: Wed Mar 25, 2020 1:57 pm

Re: Core won't start, no error displayed

Post by frest1 »

Here’s a few latest ones:

PRCG 17200 (682, 3, 63) -- (inactive, but 17204 is starting early testing)
PRCG 11761 (0, 10454, 140) -- 128.252.203.10
PRCG 11759 (0, 12396, 99) -- 128.252.203.10

EDIT July 22nd, Today I got:
PRCG 11760 (0, 743, 110)

EDIT July 25th:
PRCG 11760 (0, 13115, 32)

EDIT July 26th:
PRCG 11760 (0, 8763, 149)

EDIT July 27th:
PRCG 11759 (0, 6681, 91)
And as a side note, I think all of these have been in Quadro RTX 4000 (TU104GL) machines, have not seen any incidents on Q P4000 (GP104GL) lately.

EDIT July 28th:
PRCG 11760 (0, 14483, 54)
Last edited by frest1 on Tue Jul 28, 2020 10:53 am, edited 3 times in total.
frest1
Posts: 15
Joined: Wed Mar 25, 2020 1:57 pm

Re: Core won't start, no error displayed

Post by frest1 »

Edited a few PRGCs to the message above
frest1
Posts: 15
Joined: Wed Mar 25, 2020 1:57 pm

Re: Core won't start, no error displayed

Post by frest1 »

After some days of no instances;

PRCG 16448 (0, 16, 178)
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Core won't start, no error displayed

Post by bruce »

What happens to Optimus if you reconfigure Windows to disable the Intel iGPU? It's not clear whether you're spending hours trying to fold with an unsupported GPU.

Our usual answer is to adjust your BIOS so the nVidia GPU runs 24x7

FAH cannot fold if your computer switches to power-saving mode.
Post Reply