13500 double folded

Moderators: Site Moderators, FAHC Science Team

Post Reply
ChristianVirtual
Posts: 1596
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

13500 double folded

Post by ChristianVirtual »

On my CentOS with nV 370.23 I folded the same WU twice and got another WU assigned in between (which got delayed like hell); client 7.4.15

The 13500 (1,381,45) terminated once in between and restarted by checkpoint -> ok
Finished to 100% -> ok
Requested a new WU -> ok
Saved the WU, which failed -> bad
Restarted the full 13500 (1,381,45) -> bad, waste of time
Received and queued a different WU -> was idle for the time of Reexecution

Code: Select all

06:06:25:WU02:FS01:Download complete
06:06:25:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13500 run:1 clone:381 gen:45 core:0x21 unit:0x0000003b8ca304f457a35923c35089a4
06:06:32:WU02:FS01:Starting
06:06:32:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 17338 -checkpoint 15 -opencl-platform 0 -gpu-vendor nvidia -gpu 1
06:06:32:WU02:FS01:Started FahCore on PID 1677
06:06:32:WU02:FS01:Core PID:1681
06:06:32:WU02:FS01:FahCore 0x21 started
06:06:33:WU02:FS01:0x21:*********************** Log Started 2016-11-24T06:06:32Z ***********************
06:06:33:WU02:FS01:0x21:Project: 13500 (Run 1, Clone 381, Gen 45)
06:06:33:WU02:FS01:0x21:Unit: 0x0000003b8ca304f457a35923c35089a4
06:06:33:WU02:FS01:0x21:CPU: 0x00000000000000000000000000000000
06:06:33:WU02:FS01:0x21:Machine: 1
06:06:33:WU02:FS01:0x21:Reading tar file core.xml
06:06:33:WU02:FS01:0x21:Reading tar file system.xml
06:06:33:WU02:FS01:0x21:Reading tar file integrator.xml
06:06:33:WU02:FS01:0x21:Reading tar file state.xml
06:06:33:WU02:FS01:0x21:Digital signatures verified
06:06:33:WU02:FS01:0x21:Folding@home GPU Core21 Folding@home Core
06:06:33:WU02:FS01:0x21:Version 0.0.17
06:06:36:WU02:FS01:0x21:Completed 0 out of 5000000 steps (0%)
06:06:36:WU02:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900

06:08:09:WU02:FS01:0x21:Completed 50000 out of 5000000 steps (1%)
06:09:41:WU02:FS01:0x21:Completed 100000 out of 5000000 steps (2%)
<snip>
07:09:53:WU02:FS01:0x21:Completed 2050000 out of 5000000 steps (41%)
07:11:25:WU02:FS01:0x21:Completed 2100000 out of 5000000 steps (42%)
07:12:58:WU02:FS01:0x21:Completed 2150000 out of 5000000 steps (43%)
07:13:17:WU02:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
07:13:17:WU02:FS01:Starting
07:13:17:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 17338 -checkpoint 15 -opencl-platform 0 -gpu-vendor nvidia -gpu 1
07:13:17:WU02:FS01:Started FahCore on PID 1873
07:13:17:WU02:FS01:Core PID:1877
07:13:17:WU02:FS01:FahCore 0x21 started
07:13:17:WU02:FS01:0x21:*********************** Log Started 2016-11-24T07:13:17Z ***********************
07:13:17:WU02:FS01:0x21:Project: 13500 (Run 1, Clone 381, Gen 45)
07:13:17:WU02:FS01:0x21:Unit: 0x0000003b8ca304f457a35923c35089a4
07:13:17:WU02:FS01:0x21:CPU: 0x00000000000000000000000000000000
07:13:17:WU02:FS01:0x21:Machine: 1
07:13:17:WU02:FS01:0x21:Digital signatures verified
07:13:17:WU02:FS01:0x21:Folding@home GPU Core21 Folding@home Core
07:13:17:WU02:FS01:0x21:Version 0.0.17

07:13:17:WU02:FS01:0x21: Found a checkpoint file

07:13:20:WU02:FS01:0x21:Completed 2125000 out of 5000000 steps (42%)
07:13:20:WU02:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
07:14:07:WU02:FS01:0x21:Completed 2150000 out of 5000000 steps (43%)
07:15:39:WU02:FS01:0x21:Completed 2200000 out of 5000000 steps (44%)
<snip>
08:40:36:WU02:FS01:0x21:Completed 4950000 out of 5000000 steps (99%)
08:42:08:WU02:FS01:0x21:Completed 5000000 out of 5000000 steps (100%)
08:42:09:WU01:FS01:Connecting to 171.67.108.45:80
08:42:09:WU02:FS01:0x21:Saving result file logfile_01.txt
08:42:09:WU02:FS01:0x21:Saving result file checkpointState.xml

08:42:10:WU01:FS01:Assigned to work server 171.67.108.102
08:42:10:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GM200 [GeForce GTX 980 Ti] from 171.67.108.102
08:42:10:WU01:FS01:Connecting to 171.67.108.102:8080

08:42:10:WU02:FS01:0x21:Saving result file checkpt.crc
08:42:10:WU02:FS01:0x21:Saving result file log.txt
08:42:10:WU02:FS01:0x21:Saving result file positions.xtc
08:42:11:WU02:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT

08:42:11:WU02:FS01:FahCore returned: INTERRUPTED (102 = 0x66)

08:42:12:WU02:FS01:Starting
08:42:12:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 17338 -checkpoint 15 -opencl-platform 0 -gpu-vendor nvidia -gpu 1
08:42:12:WU02:FS01:Started FahCore on PID 2184
08:42:12:WU02:FS01:Core PID:2188
08:42:12:WU02:FS01:FahCore 0x21 started
08:42:12:WU02:FS01:0x21:*********************** Log Started 2016-11-24T08:42:12Z ***********************
08:42:12:WU02:FS01:0x21:Project: 13500 (Run 1, Clone 381, Gen 45)
08:42:12:WU02:FS01:0x21:Unit: 0x0000003b8ca304f457a35923c35089a4
08:42:12:WU02:FS01:0x21:CPU: 0x00000000000000000000000000000000
08:42:12:WU02:FS01:0x21:Machine: 1
08:42:12:WU02:FS01:0x21:Reading tar file core.xml
08:42:12:WU02:FS01:0x21:Reading tar file system.xml
08:42:12:WU02:FS01:0x21:Reading tar file integrator.xml
08:42:12:WU02:FS01:0x21:Reading tar file state.xml
08:42:12:WU02:FS01:0x21:Digital signatures verified
08:42:12:WU02:FS01:0x21:Folding@home GPU Core21 Folding@home Core
08:42:12:WU02:FS01:0x21:Version 0.0.17
08:42:15:WU02:FS01:0x21:Completed 0 out of 5000000 steps (0%)
08:42:15:WU02:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
08:42:29:WU01:FS01:Downloading 7.06MiB
08:42:31:WU01:FS01:Download complete
08:42:31:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13204 run:11 clone:1 gen:26 core:0x21 unit:0x00000010ab436c6657894f0cbbb7cfdd
08:43:48:WU02:FS01:0x21:Completed 50000 out of 5000000 steps (1%)
08:45:20:WU02:FS01:0x21:Completed 100000 out of 5000000 steps (2%)
08:46:53:WU02:FS01:0x21:Completed 150000 out of 5000000 steps (3%)
<snip>
11:10:35:WU02:FS01:0x21:Completed 4800000 out of 5000000 steps (96%)
11:12:07:WU02:FS01:0x21:Completed 4850000 out of 5000000 steps (97%)
11:13:40:WU02:FS01:0x21:Completed 4900000 out of 5000000 steps (98%)
11:15:13:WU02:FS01:0x21:Completed 4950000 out of 5000000 steps (99%)
11:16:46:WU02:FS01:0x21:Completed 5000000 out of 5000000 steps (100%)
11:16:47:WU02:FS01:0x21:Saving result file logfile_01.txt
11:16:47:WU02:FS01:0x21:Saving result file checkpointState.xml
11:16:48:WU02:FS01:0x21:Saving result file checkpt.crc
11:16:48:WU02:FS01:0x21:Saving result file log.txt
11:16:48:WU02:FS01:0x21:Saving result file positions.xtc
11:16:49:WU02:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
11:16:49:WU02:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
11:16:49:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:13500 run:1 clone:381 gen:45 core:0x21 unit:0x0000003b8ca304f457a35923c35089a4
11:16:49:WU02:FS01:Uploading 7.31MiB to 140.163.4.244
11:16:49:WU02:FS01:Connecting to 140.163.4.244:8080
11:16:49:WU01:FS01:Starting
11:16:49:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 01 -suffix 01 -version 704 -lifeline 17338 -checkpoint 15 -opencl-platform 0 -gpu-vendor nvidia -gpu 1
11:16:49:WU01:FS01:Started FahCore on PID 2913
11:16:49:WU01:FS01:Core PID:2917
11:16:49:WU01:FS01:FahCore 0x21 started
11:16:50:WU01:FS01:0x21:*********************** Log Started 2016-11-24T11:16:49Z ***********************
11:16:50:WU01:FS01:0x21:Project: 13204 (Run 11, Clone 1, Gen 26)
11:16:50:WU01:FS01:0x21:Unit: 0x00000010ab436c6657894f0cbbb7cfdd
11:16:50:WU01:FS01:0x21:CPU: 0x00000000000000000000000000000000
11:16:50:WU01:FS01:0x21:Machine: 1
11:16:50:WU01:FS01:0x21:Reading tar file core.xml
11:16:50:WU01:FS01:0x21:Reading tar file integrator.xml
11:16:50:WU01:FS01:0x21:Reading tar file state.xml
11:16:50:WU01:FS01:0x21:Reading tar file system.xml
11:16:51:WU01:FS01:0x21:Digital signatures verified
11:16:51:WU01:FS01:0x21:Folding@home GPU Core21 Folding@home Core
11:16:51:WU01:FS01:0x21:Version 0.0.17
11:17:00:WU02:FS01:Upload complete
11:17:00:WU02:FS01:Server responded WORK_ACK (400)
11:17:00:WU02:FS01:Final credit estimate, 55486.00 points
11:17:00:WU02:FS01:Cleaning up

11:18:14:WU01:FS01:0x21:Completed 0 out of 2000000 steps (0%)
11:18:14:WU01:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
11:19:46:WU01:FS01:0x21:Completed 20000 out of 2000000 steps (1%)
11:21:16:WU01:FS01:0x21:Completed 40000 out of 2000000 steps (2%)
11:22:47:WU01:FS01:0x21:Completed 60000 out of 2000000 steps (3%)
ImageImage
Please contribute your logs to http://ppd.fahmm.net
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 13500 double folded

Post by bruce »

The key to what happened is just after
08:42:11:WU02:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
08:42:11:WU02:FS01:FahCore returned: INTERRUPTED (102 = 0x66)


It looks like the WU got to 100% but then got an error before it got to
08:4x:xx:WU02:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
08:4x:xx:WU02:FS01:0x21:FahCore returned: FINISHED_UNIT (100 = 0x64)


I don't see Sending unit results: id:02 state:SEND .... (unless you edited out that information a minute or so after 08:42:11).
This would have left two WUs associated with the same slot. The client then needed to restart both[/u] of them (one at a time) and upload both of them for credit.

There's a known bug that when a WU is interrupted between reaching 100% and reaching the point when the upload is started, the result is likely to be discarded and restarted from 0%. (There is no checkpoint taken at 100% so it can't restart from that point.)

A key question is why was the WU interrupted at 08:42:11?
Post Reply