WU restarted- Project: 10469 (Run 0, Clone 416, Gen 26)

Moderators: Site Moderators, FAHC Science Team

Post Reply
billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

WU restarted- Project: 10469 (Run 0, Clone 416, Gen 26)

Post by billford »

Got the following in the log:

Code: Select all

19:23:50:WU02:FS01:0x17:Completed 1850000 out of 5000000 steps (37%)
19:28:31:WU02:FS01:0x17:Completed 1900000 out of 5000000 steps (38%)
19:29:12:WU02:FS01:0x17:ERROR:exception: Error invoking kernel findBlockBounds: clEnqueueNDRangeKernel (-5)
19:29:12:WU02:FS01:0x17:Saving result file logfile_01.txt
19:29:12:WU02:FS01:0x17:Saving result file log.txt
19:29:12:WU02:FS01:0x17:Folding@home Core Shutdown: BAD_WORK_UNIT
19:29:12:WU02:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
19:29:13:WU02:FS01:Starting
19:29:13:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_17.fah/FahCore_17 -dir 02 -suffix 01 -version 704 -lifeline 1088 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
19:29:13:WU02:FS01:Started FahCore on PID 10765
19:29:13:WU02:FS01:Core PID:10769
19:29:13:WU02:FS01:FahCore 0x17 started
19:29:13:WU02:FS01:0x17:*********************** Log Started 2014-09-26T19:29:13Z ***********************
19:29:13:WU02:FS01:0x17:Project: 10469 (Run 0, Clone 416, Gen 26)
19:29:13:WU02:FS01:0x17:Unit: 0x0000003b538b3db9538f43797be105f3
19:29:13:WU02:FS01:0x17:CPU: 0x00000000000000000000000000000000
19:29:13:WU02:FS01:0x17:Machine: 1
19:29:13:WU02:FS01:0x17:Reading tar file state.xml
19:29:13:WU02:FS01:0x17:Reading tar file system.xml
19:29:14:WU02:FS01:0x17:Reading tar file integrator.xml
19:29:14:WU02:FS01:0x17:Reading tar file core.xml
19:29:14:WU02:FS01:0x17:Digital signatures verified
19:31:01:WU02:FS01:0x17:Completed 0 out of 5000000 steps (0%)
19:35:43:WU02:FS01:0x17:Completed 50000 out of 5000000 steps (1%)
19:40:20:WU02:FS01:0x17:Completed 100000 out of 5000000 steps (2%)
19
Question is simply why it started again from scratch rather than picking up the last checkpoint? Is this deliberate for some reason (maybe due to the type of error) or a bug?

Configs:

Code: Select all

14:30:04:************************* Folding@home Client *************************
14:30:04:    Website: http://folding.stanford.edu/
14:30:04:  Copyright: (c) 2009-2014 Stanford University
14:30:04:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
14:30:04:       Args: --child --lifeline 1086 /etc/fahclient/config.xml --run-as
14:30:04:             fahclient --pid-file=/var/run/fahclient.pid --daemon
14:30:04:     Config: /etc/fahclient/config.xml
14:30:04:******************************** Build ********************************
14:30:04:    Version: 7.4.4
14:30:04:       Date: Mar 4 2014
14:30:04:       Time: 12:02:38
14:30:04:    SVN Rev: 4130
14:30:04:     Branch: fah/trunk/client
14:30:04:   Compiler: GNU 4.4.7
14:30:04:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
14:30:04:             -fno-unsafe-math-optimizations -msse2
14:30:04:   Platform: linux2 3.2.0-1-amd64
14:30:04:       Bits: 64
14:30:04:       Mode: Release
14:30:04:******************************* System ********************************
14:30:04:        CPU: Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz
14:30:04:     CPU ID: GenuineIntel Family 6 Model 60 Stepping 3
14:30:04:       CPUs: 4
14:30:04:     Memory: 3.82GiB
14:30:04:Free Memory: 3.40GiB
14:30:04:    Threads: POSIX_THREADS
14:30:04: OS Version: 3.13
14:30:04:Has Battery: false
14:30:04: On Battery: false
14:30:04: UTC Offset: 1
14:30:04:        PID: 1088
14:30:04:        CWD: /var/lib/fahclient
14:30:04:         OS: Linux 3.13.0-36-generic x86_64
14:30:04:    OS Arch: AMD64
14:30:04:       GPUs: 1
14:30:04:      GPU 0: NVIDIA:3 GK110 [GeForce GTX 780 Ti]
14:30:04:       CUDA: 3.5
14:30:04:CUDA Driver: 6000
14:30:04:***********************************************************************

Code: Select all

14:37:11:<config>
14:37:11:  <!-- Client Control -->
14:37:11:  <fold-anon v='true'/>
14:37:11:
14:37:11:  <!-- Folding Slot Configuration -->
14:37:11:  <gpu v='false'/>
14:37:11:
14:37:11:  <!-- HTTP Server -->
14:37:11:  <allow v='127.0.0.1 192.168.1.0/24'/>
14:37:11:
14:37:11:  <!-- Network -->
14:37:11:  <proxy v=':8080'/>
14:37:11:
14:37:11:  <!-- Remote Command Server -->
14:37:11:  <command-allow-no-pass v='127.0.0.1 192.168.1.0/24'/>
14:37:11:
14:37:11:  <!-- Slot Control -->
14:37:11:  <power v='full'/>
14:37:11:
14:37:11:  <!-- User Information -->
14:37:11:  <passkey v='********************************'/>
14:37:11:  <user v='[redacted]'/>
14:37:11:
14:37:11:  <!-- Work Unit Control -->
14:37:11:  <next-unit-percentage v='100'/>
14:37:11:
14:37:11:  <!-- Folding Slots -->
14:37:11:  <slot id='0' type='CPU'>
14:37:11:    <cpus v='3'/>
14:37:11:  </slot>
14:37:11:  <slot id='1' type='GPU'/>
14:37:11:</config>
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: WU restarted- Project: 10469 (Run 0, Clone 416, Gen 26)

Post by bruce »

Checkpoint intervals for GPU cores are predefined by the project developer. I don't know for sure, but I have a vague memory that the PI has set the checkpoint frequency at 2.5% and it looks like your system had not written a checkpoint yet. All restarts resume from the most recent checkpoint and if there are none, it restarts from the downloaded state at 0%.
billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: WU restarted- Project: 10469 (Run 0, Clone 416, Gen 26)

Post by billford »

From the log:

Code: Select all

19:28:31:WU02:FS01:0x17:Completed 1900000 out of 5000000 steps (38%)
19:29:12:WU02:FS01:0x17:ERROR:exception: Error invoking kernel findBlockBounds: clEnqueueNDRangeKernel (-5)
it would seem to have reached 38% (and had been running for about 3 hours) when it bombed out, so there should have been a checkpoint!

I had a look when I spotted it- it had reached ~7% on the new run and what I took to be a checkpoint file was present with a reasonable looking timestamp. It could have been that the file was corrupted in some way of course, or that some error occurred whilst writing it, so it couldn't use it.

(I'm not sure what caused it to error out- the P104xx WUs seem to run a bit hotter than others but it was well within limits that haven't bothered it before)

The new run won't finish until long after I go to bed, but I'll keep an eye on it until then.
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: WU restarted- Project: 10469 (Run 0, Clone 416, Gen 26)

Post by bruce »

I was basing my answer on the first log segment that you posted which ended abruptly after 2%. That answer doesn't apply when you post a log that got to 38% with an :ERROR:Exception..." message. (There's a good reason we ask you to post your log.)

In fact, when your system crashed (whatever caused the kernel error, which I've never seen before) there's a good chance that only a portion of the checkpoint file had been written to disk and the rest was in cache, resulting in a corrupt checkpoint file. (There's no way to actually know, however.) ... or, if a system crashes without doing an orderly shutdown such as with a power failure, files can be corrupted. When the client restarts a WU from a checkpoint, it validates the coherency of the checkpoint. If it determines the file is corrupt, it cannot use it. In that case (as well as the case in which not checkpoint is written) the only option is to restart from the beginning.
billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: WU restarted- Project: 10469 (Run 0, Clone 416, Gen 26)

Post by billford »

bruce wrote:I was basing my answer on the first log segment that you posted which ended abruptly after 2%. That answer doesn't apply when you post a log that got to 38% with an :ERROR:Exception..." message. (There's a good reason we ask you to post your log.)
Ah, I see… the log extract on the OP included the last two frames of the aborted run, the error and the first two frames of the new run, which is where your 2% came from. I'll be more careful in future to make it clear what I'm including.
n fact, when your system crashed (whatever caused the kernel error, which I've never seen before) there's a good chance that only a portion of the checkpoint file had been written to disk and the rest was in cache, resulting in a corrupt checkpoint file.
Thanks- whether or not that's what actually happened, it's a possible explanation, which is what I was after.

Just to round the topic off, it completed successfully overnight.
Image
Post Reply