Work unit restarts at 0% if it gets paused past 99%

Moderators: Site Moderators, FAHC Science Team

Post Reply
pyrocyborg
Posts: 36
Joined: Wed Sep 28, 2022 1:45 am
Hardware configuration: 3060 12GB, 3060 Ti, 3070, 3080, 3090, 6700 XT, 6800 XT, 6900 XT

Work unit restarts at 0% if it gets paused past 99%

Post by pyrocyborg »

Hi,

I've had the following issue a couple of times lately over multiple PCs (at least 5 or 6 of them had this particular issue, but only since one or two weeks): The PC either hangs or won't upload the finished worked unit (and the work unit was certified without error). After rebooting or doing a "Pause / Unpause" command in F@H, the work unit will completely restart at 0%. It is the same work unit, but it F@H will completely forget that it was already done and that the only thing left was to send it. F@H simply waste a perfectly fine work unit that was already verified for errors, and start it over. This is happening over multiple PCs with different parts, most of them fairly recent. They were stable up to one or two weeks ago.

In the work folders, I can see that all the viewframe.json files are there.

This is the log for the last completed unit that wasn't sent :

Code: Select all

13:07:23:WU00:FS00:0x22:Project: 18601 (Run 10570, Clone 2, Gen 33)
13:07:23:WU00:FS00:0x22:Reading tar file core.xml
13:07:23:WU00:FS00:0x22:Reading tar file integrator.xml
13:07:23:WU00:FS00:0x22:Reading tar file state.xml.bz2
13:07:23:WU00:FS00:0x22:Reading tar file system.xml.bz2
13:07:23:WU00:FS00:0x22:Digital signatures verified
13:07:23:WU00:FS00:0x22:Folding@home GPU Core22 Folding@home Core
13:07:23:WU00:FS00:0x22:Version 0.0.20
13:07:23:WU00:FS00:0x22:  Checkpoint write interval: 25000 steps (2%) [50 total]
13:07:23:WU00:FS00:0x22:  JSON viewer frame write interval: 12500 steps (1%) [100 total]
13:07:23:WU00:FS00:0x22:  XTC frame write interval: 20000 steps (1.6%) [62 total]
13:07:23:WU00:FS00:0x22:  Global context and integrator variables write interval: disabled
13:07:23:WU00:FS00:0x22:There are 4 platforms available.
13:07:23:WU00:FS00:0x22:Platform 0: Reference
13:07:23:WU00:FS00:0x22:Platform 1: CPU
13:07:23:WU00:FS00:0x22:Platform 2: OpenCL
13:07:23:WU00:FS00:0x22:  opencl-device 0 specified
13:07:23:WU00:FS00:0x22:Platform 3: CUDA
13:07:23:WU00:FS00:0x22:  cuda-device 0 specified
13:07:27:WU01:FS00:Upload 88.85%
13:07:27:WU01:FS00:Upload complete
13:07:27:WU01:FS00:Server responded WORK_ACK (400)
13:07:27:WU01:FS00:Final credit estimate, 424143.00 points
13:07:27:WU01:FS00:Cleaning up
13:07:47:WU00:FS00:0x22:Attempting to create CUDA context:
13:07:47:WU00:FS00:0x22:  Configuring platform CUDA
13:07:59:WU00:FS00:0x22:  Using CUDA and gpu 0
13:07:59:WU00:FS00:0x22:Completed 0 out of 1250000 steps (0%)
13:08:02:WU00:FS00:0x22:Checkpoint completed at step 0
13:09:50:WU00:FS00:0x22:Completed 12500 out of 1250000 steps (1%)
13:11:39:WU00:FS00:0x22:Completed 25000 out of 1250000 steps (2%)
13:11:44:WU00:FS00:0x22:Checkpoint completed at step 25000
13:13:35:WU00:FS00:0x22:Completed 37500 out of 1250000 steps (3%)
13:15:21:WU00:FS00:0x22:Completed 50000 out of 1250000 steps (4%)
13:15:26:WU00:FS00:0x22:Checkpoint completed at step 50000
13:17:16:WU00:FS00:0x22:Completed 62500 out of 1250000 steps (5%)
13:19:07:WU00:FS00:0x22:Completed 75000 out of 1250000 steps (6%)
13:19:13:WU00:FS00:0x22:Checkpoint completed at step 75000
13:21:10:WU00:FS00:0x22:Completed 87500 out of 1250000 steps (7%)
13:23:10:WU00:FS00:0x22:Completed 100000 out of 1250000 steps (8%)
13:23:17:WU00:FS00:0x22:Checkpoint completed at step 100000
13:25:12:WU00:FS00:0x22:Completed 112500 out of 1250000 steps (9%)
13:27:08:WU00:FS00:0x22:Completed 125000 out of 1250000 steps (10%)
13:27:13:WU00:FS00:0x22:Checkpoint completed at step 125000
13:29:07:WU00:FS00:0x22:Completed 137500 out of 1250000 steps (11%)
13:31:00:WU00:FS00:0x22:Completed 150000 out of 1250000 steps (12%)
13:31:06:WU00:FS00:0x22:Checkpoint completed at step 150000
13:33:00:WU00:FS00:0x22:Completed 162500 out of 1250000 steps (13%)
13:34:57:WU00:FS00:0x22:Completed 175000 out of 1250000 steps (14%)
13:35:02:WU00:FS00:0x22:Checkpoint completed at step 175000
13:37:02:WU00:FS00:0x22:Completed 187500 out of 1250000 steps (15%)
13:39:02:WU00:FS00:0x22:Completed 200000 out of 1250000 steps (16%)
13:39:07:WU00:FS00:0x22:Checkpoint completed at step 200000
13:41:07:WU00:FS00:0x22:Completed 212500 out of 1250000 steps (17%)
13:43:03:WU00:FS00:0x22:Completed 225000 out of 1250000 steps (18%)
13:43:09:WU00:FS00:0x22:Checkpoint completed at step 225000
13:45:03:WU00:FS00:0x22:Completed 237500 out of 1250000 steps (19%)
13:46:52:WU00:FS00:0x22:Completed 250000 out of 1250000 steps (20%)
13:46:59:WU00:FS00:0x22:Checkpoint completed at step 250000
13:48:46:WU00:FS00:0x22:Completed 262500 out of 1250000 steps (21%)
13:50:35:WU00:FS00:0x22:Completed 275000 out of 1250000 steps (22%)
13:50:41:WU00:FS00:0x22:Checkpoint completed at step 275000
13:52:36:WU00:FS00:0x22:Completed 287500 out of 1250000 steps (23%)
13:54:25:WU00:FS00:0x22:Completed 300000 out of 1250000 steps (24%)
13:54:30:WU00:FS00:0x22:Checkpoint completed at step 300000
13:56:21:WU00:FS00:0x22:Completed 312500 out of 1250000 steps (25%)
13:58:13:WU00:FS00:0x22:Completed 325000 out of 1250000 steps (26%)
13:58:21:WU00:FS00:0x22:Checkpoint completed at step 325000
14:00:13:WU00:FS00:0x22:Completed 337500 out of 1250000 steps (27%)
14:02:04:WU00:FS00:0x22:Completed 350000 out of 1250000 steps (28%)
14:02:11:WU00:FS00:0x22:Checkpoint completed at step 350000
14:04:03:WU00:FS00:0x22:Completed 362500 out of 1250000 steps (29%)
14:05:53:WU00:FS00:0x22:Completed 375000 out of 1250000 steps (30%)
14:05:59:WU00:FS00:0x22:Checkpoint completed at step 375000
14:07:49:WU00:FS00:0x22:Completed 387500 out of 1250000 steps (31%)
14:09:40:WU00:FS00:0x22:Completed 400000 out of 1250000 steps (32%)
14:09:46:WU00:FS00:0x22:Checkpoint completed at step 400000
14:11:39:WU00:FS00:0x22:Completed 412500 out of 1250000 steps (33%)
14:13:28:WU00:FS00:0x22:Completed 425000 out of 1250000 steps (34%)
14:13:36:WU00:FS00:0x22:Checkpoint completed at step 425000
14:15:29:WU00:FS00:0x22:Completed 437500 out of 1250000 steps (35%)
14:17:22:WU00:FS00:0x22:Completed 450000 out of 1250000 steps (36%)
14:17:30:WU00:FS00:0x22:Checkpoint completed at step 450000
14:19:21:WU00:FS00:0x22:Completed 462500 out of 1250000 steps (37%)
14:21:12:WU00:FS00:0x22:Completed 475000 out of 1250000 steps (38%)
14:21:18:WU00:FS00:0x22:Checkpoint completed at step 475000
14:23:10:WU00:FS00:0x22:Completed 487500 out of 1250000 steps (39%)
14:25:00:WU00:FS00:0x22:Completed 500000 out of 1250000 steps (40%)
14:25:07:WU00:FS00:0x22:Checkpoint completed at step 500000
14:27:02:WU00:FS00:0x22:Completed 512500 out of 1250000 steps (41%)
14:28:54:WU00:FS00:0x22:Completed 525000 out of 1250000 steps (42%)
14:29:01:WU00:FS00:0x22:Checkpoint completed at step 525000
14:30:54:WU00:FS00:0x22:Completed 537500 out of 1250000 steps (43%)
14:32:49:WU00:FS00:0x22:Completed 550000 out of 1250000 steps (44%)
14:32:56:WU00:FS00:0x22:Checkpoint completed at step 550000
14:34:52:WU00:FS00:0x22:Completed 562500 out of 1250000 steps (45%)
14:36:48:WU00:FS00:0x22:Completed 575000 out of 1250000 steps (46%)
14:36:54:WU00:FS00:0x22:Checkpoint completed at step 575000
14:38:55:WU00:FS00:0x22:Completed 587500 out of 1250000 steps (47%)
14:40:56:WU00:FS00:0x22:Completed 600000 out of 1250000 steps (48%)
14:41:06:WU00:FS00:0x22:Checkpoint completed at step 600000
14:42:58:WU00:FS00:0x22:Completed 612500 out of 1250000 steps (49%)
14:44:47:WU00:FS00:0x22:Completed 625000 out of 1250000 steps (50%)
14:44:53:WU00:FS00:0x22:Checkpoint completed at step 625000
14:46:45:WU00:FS00:0x22:Completed 637500 out of 1250000 steps (51%)
14:48:36:WU00:FS00:0x22:Completed 650000 out of 1250000 steps (52%)
14:48:44:WU00:FS00:0x22:Checkpoint completed at step 650000
14:50:37:WU00:FS00:0x22:Completed 662500 out of 1250000 steps (53%)
14:52:26:WU00:FS00:0x22:Completed 675000 out of 1250000 steps (54%)
14:52:33:WU00:FS00:0x22:Checkpoint completed at step 675000
14:54:22:WU00:FS00:0x22:Completed 687500 out of 1250000 steps (55%)
14:56:11:WU00:FS00:0x22:Completed 700000 out of 1250000 steps (56%)
14:56:17:WU00:FS00:0x22:Checkpoint completed at step 700000
14:58:07:WU00:FS00:0x22:Completed 712500 out of 1250000 steps (57%)
14:59:58:WU00:FS00:0x22:Completed 725000 out of 1250000 steps (58%)
15:00:05:WU00:FS00:0x22:Checkpoint completed at step 725000
15:02:00:WU00:FS00:0x22:Completed 737500 out of 1250000 steps (59%)
15:03:57:WU00:FS00:0x22:Completed 750000 out of 1250000 steps (60%)
15:04:04:WU00:FS00:0x22:Checkpoint completed at step 750000
15:06:02:WU00:FS00:0x22:Completed 762500 out of 1250000 steps (61%)
15:07:58:WU00:FS00:0x22:Completed 775000 out of 1250000 steps (62%)
15:08:06:WU00:FS00:0x22:Checkpoint completed at step 775000
15:10:04:WU00:FS00:0x22:Completed 787500 out of 1250000 steps (63%)
15:12:04:WU00:FS00:0x22:Completed 800000 out of 1250000 steps (64%)
15:12:15:WU00:FS00:0x22:Checkpoint completed at step 800000
******************************* Date: 2022-11-25 *******************************
15:14:19:WU00:FS00:0x22:Completed 812500 out of 1250000 steps (65%)
15:16:20:WU00:FS00:0x22:Completed 825000 out of 1250000 steps (66%)
15:16:28:WU00:FS00:0x22:Checkpoint completed at step 825000
15:18:30:WU00:FS00:0x22:Completed 837500 out of 1250000 steps (67%)
15:20:36:WU00:FS00:0x22:Completed 850000 out of 1250000 steps (68%)
15:20:42:WU00:FS00:0x22:Checkpoint completed at step 850000
15:22:48:WU00:FS00:0x22:Completed 862500 out of 1250000 steps (69%)
15:24:52:WU00:FS00:0x22:Completed 875000 out of 1250000 steps (70%)
15:25:00:WU00:FS00:0x22:Checkpoint completed at step 875000
15:27:01:WU00:FS00:0x22:Completed 887500 out of 1250000 steps (71%)
15:29:02:WU00:FS00:0x22:Completed 900000 out of 1250000 steps (72%)
15:29:09:WU00:FS00:0x22:Checkpoint completed at step 900000
15:31:13:WU00:FS00:0x22:Completed 912500 out of 1250000 steps (73%)
15:33:16:WU00:FS00:0x22:Completed 925000 out of 1250000 steps (74%)
15:33:27:WU00:FS00:0x22:Checkpoint completed at step 925000
15:35:30:WU00:FS00:0x22:Completed 937500 out of 1250000 steps (75%)
15:37:30:WU00:FS00:0x22:Completed 950000 out of 1250000 steps (76%)
15:37:36:WU00:FS00:0x22:Checkpoint completed at step 950000
15:39:40:WU00:FS00:0x22:Completed 962500 out of 1250000 steps (77%)
15:41:37:WU00:FS00:0x22:Completed 975000 out of 1250000 steps (78%)
15:41:44:WU00:FS00:0x22:Checkpoint completed at step 975000
15:43:41:WU00:FS00:0x22:Completed 987500 out of 1250000 steps (79%)
15:45:35:WU00:FS00:0x22:Completed 1000000 out of 1250000 steps (80%)
15:45:41:WU00:FS00:0x22:Checkpoint completed at step 1000000
15:47:32:WU00:FS00:0x22:Completed 1012500 out of 1250000 steps (81%)
15:49:21:WU00:FS00:0x22:Completed 1025000 out of 1250000 steps (82%)
15:49:29:WU00:FS00:0x22:Checkpoint completed at step 1025000
15:51:20:WU00:FS00:0x22:Completed 1037500 out of 1250000 steps (83%)
15:53:11:WU00:FS00:0x22:Completed 1050000 out of 1250000 steps (84%)
15:53:17:WU00:FS00:0x22:Checkpoint completed at step 1050000
15:55:11:WU00:FS00:0x22:Completed 1062500 out of 1250000 steps (85%)
15:57:06:WU00:FS00:0x22:Completed 1075000 out of 1250000 steps (86%)
15:57:13:WU00:FS00:0x22:Checkpoint completed at step 1075000
15:59:09:WU00:FS00:0x22:Completed 1087500 out of 1250000 steps (87%)
16:01:10:WU00:FS00:0x22:Completed 1100000 out of 1250000 steps (88%)
16:01:18:WU00:FS00:0x22:Checkpoint completed at step 1100000
16:03:25:WU00:FS00:0x22:Completed 1112500 out of 1250000 steps (89%)
16:05:30:WU00:FS00:0x22:Completed 1125000 out of 1250000 steps (90%)
16:05:36:WU00:FS00:0x22:Checkpoint completed at step 1125000
16:07:40:WU00:FS00:0x22:Completed 1137500 out of 1250000 steps (91%)
16:09:38:WU00:FS00:0x22:Completed 1150000 out of 1250000 steps (92%)
16:09:51:WU00:FS00:0x22:Checkpoint completed at step 1150000
16:11:52:WU00:FS00:0x22:Completed 1162500 out of 1250000 steps (93%)
16:13:53:WU00:FS00:0x22:Completed 1175000 out of 1250000 steps (94%)
16:14:02:WU00:FS00:0x22:Checkpoint completed at step 1175000
16:16:06:WU00:FS00:0x22:Completed 1187500 out of 1250000 steps (95%)
16:18:08:WU00:FS00:0x22:Completed 1200000 out of 1250000 steps (96%)
16:18:15:WU00:FS00:0x22:Checkpoint completed at step 1200000
16:20:13:WU00:FS00:0x22:Completed 1212500 out of 1250000 steps (97%)
16:22:13:WU00:FS00:0x22:Completed 1225000 out of 1250000 steps (98%)
16:22:22:WU00:FS00:0x22:Checkpoint completed at step 1225000
16:24:22:WU00:FS00:0x22:Completed 1237500 out of 1250000 steps (99%)
16:24:23:WU01:FS00:Connecting to assign1.foldingathome.org:80
16:24:23:WU01:FS00:Assigned to work server 206.223.170.146
16:24:23:WU01:FS00:Requesting new work unit for slot 00: gpu:1:0 GA104 [GeForce RTX 3070] from 206.223.170.146
16:24:23:WU01:FS00:Connecting to 206.223.170.146:8080
16:24:23:WU01:FS00:Downloading 22.05MiB
16:24:27:WU01:FS00:Download complete
16:24:27:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:18601 run:9832 clone:2 gen:32 core:0x22 unit:0x0000000200000020000048a900002668
16:26:21:WU00:FS00:0x22:Completed 1250000 out of 1250000 steps (100%)
16:26:21:WU00:FS00:0x22:Average performance: 18.0753 ns/day
16:26:33:WU00:FS00:0x22:Checkpoint completed at step 1250000
This is the work unit now

Code: Select all

16:48:19:WU00:FS00:0x22:Project: 18601 (Run 10570, Clone 2, Gen 33)
16:48:19:WU00:FS00:0x22:Reading tar file core.xml
16:48:19:WU00:FS00:0x22:Reading tar file integrator.xml
16:48:19:WU00:FS00:0x22:Reading tar file state.xml.bz2
16:48:19:WU00:FS00:0x22:Reading tar file system.xml.bz2
16:48:19:WU00:FS00:0x22:Digital signatures verified
16:48:19:WU00:FS00:0x22:Folding@home GPU Core22 Folding@home Core
16:48:19:WU00:FS00:0x22:Version 0.0.20
16:48:19:WU00:FS00:0x22:  Checkpoint write interval: 25000 steps (2%) [50 total]
16:48:19:WU00:FS00:0x22:  JSON viewer frame write interval: 12500 steps (1%) [100 total]
16:48:19:WU00:FS00:0x22:  XTC frame write interval: 20000 steps (1.6%) [62 total]
16:48:19:WU00:FS00:0x22:  Global context and integrator variables write interval: disabled
16:48:19:WU00:FS00:0x22:There are 4 platforms available.
16:48:19:WU00:FS00:0x22:Platform 0: Reference
16:48:19:WU00:FS00:0x22:Platform 1: CPU
16:48:19:WU00:FS00:0x22:Platform 2: OpenCL
16:48:19:WU00:FS00:0x22:  opencl-device 0 specified
16:48:19:WU00:FS00:0x22:Platform 3: CUDA
16:48:19:WU00:FS00:0x22:  cuda-device 0 specified
16:48:24:FS00:Finishing
16:48:39:WU00:FS00:0x22:Attempting to create CUDA context:
16:48:39:WU00:FS00:0x22:  Configuring platform CUDA
16:48:48:WU00:FS00:0x22:  Using CUDA and gpu 0
16:48:48:WU00:FS00:0x22:Completed 0 out of 1250000 steps (0%)
16:48:51:WU00:FS00:0x22:Checkpoint completed at step 0
16:50:42:WU00:FS00:0x22:Completed 12500 out of 1250000 steps (1%)
16:52:30:WU00:FS00:0x22:Completed 25000 out of 1250000 steps (2%)
16:52:34:WU00:FS00:0x22:Checkpoint completed at step 25000
16:54:22:WU00:FS00:0x22:Completed 37500 out of 1250000 steps (3%)
16:56:09:WU00:FS00:0x22:Completed 50000 out of 1250000 steps (4%)
16:56:13:WU00:FS00:0x22:Checkpoint completed at step 50000
As stated before, it worked fine before and for some reason, I started getting this issue frequently on multiple PC, not only one. After speaking with some people in Discord, I'm not the only one with this issue as some people report being stuck at 99.99% and losing the work unit while pausing or restarting. It was probably a poor choice of words on their part as the work unit wasn't "lost", it simply restarted.
calxalot
Site Moderator
Posts: 878
Joined: Sat Dec 08, 2007 1:33 am
Location: San Francisco, CA
Contact:

Re: Work unit restarts at 0% if it gets paused past 99%

Post by calxalot »

This may be a stalled GPU WU. The 99% done is a false estimate by the client that doesn’t realize the unit stalled. Someone else may know why this happens. I would guess the GPU was reset after system sleep.
toTOW
Site Moderator
Posts: 6296
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Work unit restarts at 0% if it gets paused past 99%

Post by toTOW »

How long did you wait after this before restarting :

Code: Select all

16:26:21:WU00:FS00:0x22:Completed 1250000 out of 1250000 steps (100%)
16:26:21:WU00:FS00:0x22:Average performance: 18.0753 ns/day
16:26:33:WU00:FS00:0x22:Checkpoint completed at step 1250000
At this state, the WU is finishing, and if you pause/restart the core at this time, there's a high risk to corrupt the data and start the WU over ...

But this state shouldn't last very long ... if something is preventing the core from finishing/writing results properly, this is where you should look at.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
Post Reply