Page 1 of 1

Project 13408 multiple failures

Posted: Mon Jun 01, 2020 4:16 am
by Nuitari
I've noticed a series of failures on project 13408. All except 1 was when the WU started getting processed.
All were down on various RX570, but on the same computer:

Code: Select all

[b]project:13408 run:445 clone:3 gen:0  [/b]
11:12:43:WU06:FS03:0x22:ERROR:Force RMSE error of 11.6803 with threshold of 5

Code: Select all

[b]project:13408 run:572 clone:3 gen:0[/b] (the only one with progress)
11:50:01:WU07:FS07:0x22:Completed 60000 out of 1000000 steps (6%)
11:51:42:WU07:FS07:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
11:51:42:WU07:FS07:0x22:Following exception occured: Particle coordinate is nan
11:52:54:WU07:FS07:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
11:52:54:WU07:FS07:0x22:Following exception occured: Particle coordinate is nan
11:53:55:WU07:FS07:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
11:53:55:WU07:FS07:0x22:Following exception occured: Particle coordinate is nan
11:53:56:WU07:FS07:0x22:ERROR:114: Max Retries Reached
11:53:56:WU07:FS07:0x22:Saving result file ../logfile_01.txt
11:53:56:WU07:FS07:0x22:Saving result file badstate-0.xml
11:53:56:WU07:FS07:0x22:Saving result file badstate-1.xml
11:53:56:WU07:FS07:0x22:Saving result file badstate-2.xml
11:53:56:WU07:FS07:0x22:Saving result file checkpt.crc
11:53:56:WU07:FS07:0x22:Saving result file globals.csv
11:53:56:WU07:FS07:0x22:Saving result file science.log
11:53:56:WU07:FS07:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
ESC[93m11:53:57:WARNING:WU07:FS07:FahCore returned: BAD_WORK_UNIT (114 = 0x72)ESC[0m
11:53:57:WU07:FS07:Sending unit results: id:07 state:SEND error:FAULTY project:13408 run:572 clone:3 gen:0 core:0x22 unit:0x0000000012bc7d9a5ed2f3dab64d0b3f

Code: Select all

[b]project:13408 run:232 clone:6 gen:0[/b] 
16:06:31:WU16:FS06:0x22:ERROR:Force RMSE error of 39.9426 with threshold of 5
16:06:31:WU16:FS06:0x22:Saving result file ../logfile_01.txt
16:06:31:WU16:FS06:0x22:Saving result file science.log
16:06:31:WU16:FS06:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT

Code: Select all

[b]project:13408 run:160 clone:7 gen:0[/b]
17:56:43:WU00:FS00:0x22:ERROR:Force RMSE error of 227.74 with threshold of 5

Code: Select all

[b]project:13408 run:429 clone:8 gen:0[/b]  (reported faulty by 4 others)
22:07:01:WU02:FS04:0x22:ERROR:Potential energy error of 263.948, threshold of 10
22:07:01:WU02:FS04:0x22:ERROR:Reference Potential Energy: -1.22761e+06 | Given Potential Energy: -1.22787e+06
Overall on that one rig I see the following for project 13408

Code: Select all

11:12:44:WU06:FS03:Sending unit results: id:06 state:SEND error:FAULTY project:13408 run:445 clone:3 gen:0 core:0x22 unit:0x0000000112bc7d9a5ed2f3da4f5303d1
11:53:57:WU07:FS07:Sending unit results: id:07 state:SEND error:FAULTY project:13408 run:572 clone:3 gen:0 core:0x22 unit:0x0000000012bc7d9a5ed2f3dab64d0b3f
16:06:31:WU16:FS06:Sending unit results: id:16 state:SEND error:FAULTY project:13408 run:232 clone:6 gen:0 core:0x22 unit:0x0000000012bc7d9a5ed2f3db7f1ea005
17:56:43:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:13408 run:160 clone:7 gen:0 core:0x22 unit:0x0000000012bc7d9a5ed2f3db707ea2fa
19:53:05:WU11:FS07:Sending unit results: id:11 state:SEND error:NO_ERROR project:13408 run:169 clone:3 gen:1 core:0x22 unit:0x0000000112bc7d9a5ed2f3db249fdbd0
22:06:06:WU12:FS04:Sending unit results: id:12 state:SEND error:NO_ERROR project:13408 run:465 clone:3 gen:1 core:0x22 unit:0x0000000112bc7d9a5ed2f3daedc0e6d2
22:07:02:WU02:FS04:Sending unit results: id:02 state:SEND error:FAULTY project:13408 run:429 clone:8 gen:0 core:0x22 unit:0x0000000312bc7d9a5ed2f3da6c37b62e
00:32:53:WU17:FS06:Sending unit results: id:17 state:SEND error:NO_ERROR project:13408 run:127 clone:5 gen:1 core:0x22 unit:0x0000000112bc7d9a5ed2f3db7acd2acf
02:04:19:WU16:FS05:Sending unit results: id:16 state:SEND error:NO_ERROR project:13408 run:339 clone:6 gen:0 core:0x22 unit:0x0000000012bc7d9a5ed2f3da8b7cd742
02:48:37:WU19:FS03:Sending unit results: id:19 state:SEND error:NO_ERROR project:13408 run:564 clone:4 gen:1 core:0x22 unit:0x0000000112bc7d9a5ed2f3da347f37bb
Other projects are working fine on that rig.

Re: Project 13408 multiple failures

Posted: Mon Jun 01, 2020 8:58 am
by PantherX
Please note that Projects 1340X are highly experimental as they are working specifically on COVID-19 simulation/tasks. Error rates are expected to be higher and this is helping the researchers understand what needs to be done and improve their research. Thus, you can carry on as normal as it isn't your GPU issue.