Project 13408 multiple failures

Moderators: Site Moderators, FAHC Science Team

Post Reply
Nuitari
Posts: 80
Joined: Sun Jun 09, 2019 4:03 am
Hardware configuration: 1x Nvidia 1050ti
1x Nvidia 1660Super
1x Nvidia GTX 660
1x Nvidia 1060 3gb
1x AMD rx570
2x AMD rx560
1x AMD Ryzen 7 PRO 1700
1x AMD Ryzen 7 3700X
1x AMD Phenom II
1x AMD A8-9600
1x Intel i5-4590S

Project 13408 multiple failures

Post by Nuitari »

I've noticed a series of failures on project 13408. All except 1 was when the WU started getting processed.
All were down on various RX570, but on the same computer:

Code: Select all

[b]project:13408 run:445 clone:3 gen:0  [/b]
11:12:43:WU06:FS03:0x22:ERROR:Force RMSE error of 11.6803 with threshold of 5

Code: Select all

[b]project:13408 run:572 clone:3 gen:0[/b] (the only one with progress)
11:50:01:WU07:FS07:0x22:Completed 60000 out of 1000000 steps (6%)
11:51:42:WU07:FS07:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
11:51:42:WU07:FS07:0x22:Following exception occured: Particle coordinate is nan
11:52:54:WU07:FS07:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
11:52:54:WU07:FS07:0x22:Following exception occured: Particle coordinate is nan
11:53:55:WU07:FS07:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
11:53:55:WU07:FS07:0x22:Following exception occured: Particle coordinate is nan
11:53:56:WU07:FS07:0x22:ERROR:114: Max Retries Reached
11:53:56:WU07:FS07:0x22:Saving result file ../logfile_01.txt
11:53:56:WU07:FS07:0x22:Saving result file badstate-0.xml
11:53:56:WU07:FS07:0x22:Saving result file badstate-1.xml
11:53:56:WU07:FS07:0x22:Saving result file badstate-2.xml
11:53:56:WU07:FS07:0x22:Saving result file checkpt.crc
11:53:56:WU07:FS07:0x22:Saving result file globals.csv
11:53:56:WU07:FS07:0x22:Saving result file science.log
11:53:56:WU07:FS07:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
ESC[93m11:53:57:WARNING:WU07:FS07:FahCore returned: BAD_WORK_UNIT (114 = 0x72)ESC[0m
11:53:57:WU07:FS07:Sending unit results: id:07 state:SEND error:FAULTY project:13408 run:572 clone:3 gen:0 core:0x22 unit:0x0000000012bc7d9a5ed2f3dab64d0b3f

Code: Select all

[b]project:13408 run:232 clone:6 gen:0[/b] 
16:06:31:WU16:FS06:0x22:ERROR:Force RMSE error of 39.9426 with threshold of 5
16:06:31:WU16:FS06:0x22:Saving result file ../logfile_01.txt
16:06:31:WU16:FS06:0x22:Saving result file science.log
16:06:31:WU16:FS06:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT

Code: Select all

[b]project:13408 run:160 clone:7 gen:0[/b]
17:56:43:WU00:FS00:0x22:ERROR:Force RMSE error of 227.74 with threshold of 5

Code: Select all

[b]project:13408 run:429 clone:8 gen:0[/b]  (reported faulty by 4 others)
22:07:01:WU02:FS04:0x22:ERROR:Potential energy error of 263.948, threshold of 10
22:07:01:WU02:FS04:0x22:ERROR:Reference Potential Energy: -1.22761e+06 | Given Potential Energy: -1.22787e+06
Overall on that one rig I see the following for project 13408

Code: Select all

11:12:44:WU06:FS03:Sending unit results: id:06 state:SEND error:FAULTY project:13408 run:445 clone:3 gen:0 core:0x22 unit:0x0000000112bc7d9a5ed2f3da4f5303d1
11:53:57:WU07:FS07:Sending unit results: id:07 state:SEND error:FAULTY project:13408 run:572 clone:3 gen:0 core:0x22 unit:0x0000000012bc7d9a5ed2f3dab64d0b3f
16:06:31:WU16:FS06:Sending unit results: id:16 state:SEND error:FAULTY project:13408 run:232 clone:6 gen:0 core:0x22 unit:0x0000000012bc7d9a5ed2f3db7f1ea005
17:56:43:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:13408 run:160 clone:7 gen:0 core:0x22 unit:0x0000000012bc7d9a5ed2f3db707ea2fa
19:53:05:WU11:FS07:Sending unit results: id:11 state:SEND error:NO_ERROR project:13408 run:169 clone:3 gen:1 core:0x22 unit:0x0000000112bc7d9a5ed2f3db249fdbd0
22:06:06:WU12:FS04:Sending unit results: id:12 state:SEND error:NO_ERROR project:13408 run:465 clone:3 gen:1 core:0x22 unit:0x0000000112bc7d9a5ed2f3daedc0e6d2
22:07:02:WU02:FS04:Sending unit results: id:02 state:SEND error:FAULTY project:13408 run:429 clone:8 gen:0 core:0x22 unit:0x0000000312bc7d9a5ed2f3da6c37b62e
00:32:53:WU17:FS06:Sending unit results: id:17 state:SEND error:NO_ERROR project:13408 run:127 clone:5 gen:1 core:0x22 unit:0x0000000112bc7d9a5ed2f3db7acd2acf
02:04:19:WU16:FS05:Sending unit results: id:16 state:SEND error:NO_ERROR project:13408 run:339 clone:6 gen:0 core:0x22 unit:0x0000000012bc7d9a5ed2f3da8b7cd742
02:48:37:WU19:FS03:Sending unit results: id:19 state:SEND error:NO_ERROR project:13408 run:564 clone:4 gen:1 core:0x22 unit:0x0000000112bc7d9a5ed2f3da347f37bb
Other projects are working fine on that rig.
Image
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Project 13408 multiple failures

Post by PantherX »

Please note that Projects 1340X are highly experimental as they are working specifically on COVID-19 simulation/tasks. Error rates are expected to be higher and this is helping the researchers understand what needs to be done and improve their research. Thus, you can carry on as normal as it isn't your GPU issue.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Post Reply