Page 1 of 2

project 7520 being reactivated--post if bad WU's

Posted: Tue Jun 09, 2015 6:06 pm
by kasson
I'm about to reactivate project 7520, which was pulled when we had a RAID failure that corrupted a bunch of WU's. I've done a restore from backup, so I'm hoping that the work units should be clean again. If you get repeat failures on a WU, however, please post here. If necessary, we'll pull the project again.

Project: 7520 (Run 58, Clone 40, Gen 0)

Posted: Wed Jun 10, 2015 12:46 pm
by SKeptical_Thinker

Code: Select all

*********************** Log Started 2015-06-10T12:41:51Z ***********************
12:41:51:************************* Folding@home Client *************************
12:41:51:    Website: http://folding.stanford.edu/
12:41:51:  Copyright: (c) 2009-2014 Stanford University
12:41:51:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
12:41:51:       Args: --child --lifeline 3031 /etc/fahclient/config.xml --run-as
12:41:51:             fahclient --pid-file=/var/run/fahclient.pid --daemon
12:41:51:     Config: /etc/fahclient/config.xml
12:41:51:******************************** Build ********************************
12:41:51:    Version: 7.4.4
12:41:51:       Date: Mar 4 2014
12:41:51:       Time: 12:02:38
12:41:51:    SVN Rev: 4130
12:41:51:     Branch: fah/trunk/client
12:41:51:   Compiler: GNU 4.4.7
12:41:51:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
12:41:51:             -fno-unsafe-math-optimizations -msse2
12:41:51:   Platform: linux2 3.2.0-1-amd64
12:41:51:       Bits: 64
12:41:51:       Mode: Release
12:41:51:******************************* System ********************************
12:41:51:        CPU: AMD Phenom(tm) II X6 1045T Processor
12:41:51:     CPU ID: AuthenticAMD Family 16 Model 10 Stepping 0
12:41:51:       CPUs: 6
12:41:51:     Memory: 7.55GiB
12:41:51:Free Memory: 5.66GiB
12:41:51:    Threads: POSIX_THREADS
12:41:51: OS Version: 3.13
12:41:51:Has Battery: false
12:41:51: On Battery: false
12:41:51: UTC Offset: -4
12:41:51:        PID: 3119
12:41:51:        CWD: /var/lib/fahclient
12:41:51:         OS: Linux 3.13.0-32-generic x86_64
12:41:51:    OS Arch: AMD64
12:41:51:       GPUs: 1
12:41:51:      GPU 0: UNSUPPORTED: RS880 [Radeon HD 4250]
12:41:51:       CUDA: Not detected
12:41:51:***********************************************************************
12:41:51:<config>
12:41:51:  <!-- Client Control -->
12:41:51:  <fold-anon v='true'/>
12:41:51:
12:41:51:  <!-- Folding Core -->
12:41:51:  <checkpoint v='30'/>
12:41:51:
12:41:51:  <!-- Folding Slot Configuration -->
12:41:51:  <client-type v='advanced'/>
12:41:51:
12:41:51:  <!-- HTTP Server -->
12:41:51:  <allow v='127.0.0.1 192.168.1.0/24'/>
12:41:51:
12:41:51:  <!-- Network -->
12:41:51:  <proxy v=':8080'/>
12:41:51:
12:41:51:  <!-- Remote Command Server -->
12:41:51:  <command-allow-no-pass v='127.0.0.1 192.168.1.0/24'/>
12:41:51:
12:41:51:  <!-- Slot Control -->
12:41:51:  <power v='full'/>
12:41:51:
12:41:51:  <!-- User Information -->
12:41:51:  <passkey v='********************************'/>
12:41:51:  <team v='31574'/>
12:41:51:  <user v='SKeptical_Thinker'/>
12:41:51:
12:41:51:  <!-- Work Unit Control -->
12:41:51:  <next-unit-percentage v='100'/>
12:41:51:
12:41:51:  <!-- Folding Slots -->
12:41:51:  <slot id='0' type='CPU'/>
12:41:51:</config>
12:41:51:Switching to user fahclient
12:41:51:Trying to access database...
12:41:51:Successfully acquired database lock
12:41:51:Enabled folding slot 00: READY cpu:6
12:41:51:WU01:FS00:Starting
12:41:51:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 01 -suffix 01 -version 704 -lifeline 3119 -checkpoint 30 -np 6
12:41:51:WU01:FS00:Started FahCore on PID 3128
12:41:51:WU01:FS00:Core PID:3132
12:41:51:WU01:FS00:FahCore 0xa4 started
12:41:51:WU01:FS00:0xa4:
12:41:51:WU01:FS00:0xa4:*------------------------------*
12:41:51:WU01:FS00:0xa4:Folding@Home Gromacs GB Core
12:41:51:WU01:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
12:41:51:WU01:FS00:0xa4:
12:41:51:WU01:FS00:0xa4:Preparing to commence simulation
12:41:51:WU01:FS00:0xa4:- Ensuring status. Please wait.
12:42:00:WU01:FS00:0xa4:- Looking at optimizations...
12:42:00:WU01:FS00:0xa4:- Files status OK
12:42:00:WU01:FS00:0xa4:- Expanded 2021924 -> 3413504 (decompressed 168.8 percent)
12:42:00:WU01:FS00:0xa4:Called DecompressByteArray: compressed_data_size=2021924 data_size=3413504, decompressed_data_size=3413504 diff=0
12:42:00:WU01:FS00:0xa4:- Digital signature verified
12:42:00:WU01:FS00:0xa4:
12:42:00:WU01:FS00:0xa4:Project: 7520 (Run 58, Clone 40, Gen 0)
12:42:00:WU01:FS00:0xa4:
12:42:00:WU01:FS00:0xa4:Assembly optimizations on if available.
12:42:00:WU01:FS00:0xa4:Entering M.D.
This WU starts and grows to over 40GB of virtual memory and crashes.

I can supply an ubuntu crash file if you need it.

I deleted the WU and have moved on to Project: 9011 (Run 450, Clone 4, Gen 17) without issue.

Re: Project: 7520 (Run 58, Clone 40, Gen 0)

Posted: Wed Jun 10, 2015 2:59 pm
by bruce
So far, that WU has not been returned by anyone.

Project 7520, Run 58, Clone 40, Gen 0
No data back from query

The log you posted doesn't show the initialization of the FAHCore. Did it crash before messages of the form

Code: Select all

14:57:55:WU00:FS00:0xa4:Entering M.D.
14:58:02:WU00:FS00:0xa4:Mapping NT from 4 to 4 
14:58:05:WU00:FS00:0xa4:Completed 0 out of 200000 steps (0%)
Which drivers are you running?

Re: project 7520 being reactivated--post if bad WU's

Posted: Wed Jun 10, 2015 3:02 pm
by billford

Code: Select all

14:28:52:WU00:FS00:0xa4:Project: 7520 (Run 64, Clone 42, Gen 0)
14:28:52:WU00:FS00:0xa4:
14:28:52:WU00:FS00:0xa4:Assembly optimizations on if available.
14:28:52:WU00:FS00:0xa4:Entering M.D.
14:28:58:WU00:FS00:0xa4:Completed 0 out of 1000000 steps  (0%)
14:43:48:WU00:FS00:0xa4:Completed 10000 out of 1000000 steps  (1%)
14:59:09:WU00:FS00:0xa4:Completed 20000 out of 1000000 steps  (2%)
15:14:08:WU00:FS00:0xa4:Completed 30000 out of 1000000 steps  (3%)
If I remember correctly it should only have 500,000 steps?

TPF and PPD are compatible with it having twice the number it should have (looking in HFM at P7520's from long ago), it's running OK so far but should I dump it?

BTW, it's not listed in psummary.

Re: project 7520 being reactivated--post if bad WU's

Posted: Wed Jun 10, 2015 3:23 pm
by Joe_H
Looking at an old log with a 7520 WU, yes you remember that count for steps correctly - should be 500,000.

As for showing up in psummary, this is a bit odd. It does show up on the old psummary page, and also does on the new psummaryC page, but not the new psummary.

Re: project 7520 being reactivated--post if bad WU's

Posted: Wed Jun 10, 2015 3:32 pm
by billford
OK, I've dumped it (deleted the slot, so it may get reassigned).

Re: project 7520 being reactivated--post if bad WU's

Posted: Wed Jun 10, 2015 5:22 pm
by parkut
I have 12 machines chewing on 7520's. In the log files, I see (2) two of them have 500,000 steps, the rest have 1,000,000 steps

1,000,000 steps
Machine 033: Project: 7520 (Run 49, Clone 23, Gen 0)
Machine f1b1: Project: 7520 (Run 1, Clone 16, Gen 0)
Machine 163: Project: 7520 (Run 32, Clone 44, Gen 0)
Machine 105: Project: 7520 (Run 78, Clone 39, Gen 0)
Machine 139: Project: 7520 (Run 85, Clone 42, Gen 0)
Machine 148: Project: 7520 (Run 65, Clone 44, Gen 0)
Machine 093: Project: 7520 (Run 122, Clone 32, Gen 0)
Machine 071: Project: 7520 (Run 25, Clone 10, Gen 0)
Machine 081: Project: 7520 (Run 17, Clone 36, Gen 0)
Machine 091: Project: 7520 (Run 46, Clone 33, Gen 0)

500,000 steps
Machine: 145: Project: 7520 (Run 73, Clone 21, Gen 1)
Machine 133: Project: 7520 (Run 120, Clone 5, Gen 130)

Re: project 7520 being reactivated--post if bad WU's

Posted: Wed Jun 10, 2015 6:52 pm
by kasson
Oh, there was an old problem here IIRC where gen 0 had 2x the number of steps. Let me check that.

Re: Project: 7520 (Run 58, Clone 40, Gen 0)

Posted: Wed Jun 10, 2015 7:53 pm
by SKeptical_Thinker
bruce wrote:So far, that WU has not been returned by anyone.

Project 7520, Run 58, Clone 40, Gen 0
No data back from query

The log you posted doesn't show the initialization of the FAHCore. Did it crash before messages of the form

Code: Select all

14:57:55:WU00:FS00:0xa4:Entering M.D.
14:58:02:WU00:FS00:0xa4:Mapping NT from 4 to 4 
14:58:05:WU00:FS00:0xa4:Completed 0 out of 200000 steps (0%)
Which drivers are you running?
The log that I posted is complete. That was all of the output in the log up to the time of the crash.

Which drivers are you talking about?

Moderator, please move this thread to: project 7520 being reactivated--post if bad WU's
Mod edit: Topics merged.

thanks

Re: Project: 7520 (Run 58, Clone 40, Gen 0)

Posted: Wed Jun 10, 2015 9:00 pm
by kasson
The WU was corrupted and has been regenerated. Thanks.

Re: Project: 7520 (Run 58, Clone 40, Gen 0)

Posted: Wed Jun 10, 2015 9:11 pm
by kasson
We further regenerated all gen0 clones in Run 58 (and corrected their number of steps).

Re: project 7520 being reactivated--post if bad WU's

Posted: Thu Jun 11, 2015 2:29 am
by parkut
An observation, the million step WUs are reporting nearly one third of the PPD that the half million step WUs report on identical hardware.


And they take much longer to process.

Re: project 7520 being reactivated--post if bad WU's

Posted: Thu Jun 11, 2015 11:55 am
by rewron
Project: 7520 (81, 38, 0) has 1000000 steps. Processing time 4+ days.

Normal WU turnaround time on this machine 8-13 hours.

Re: project 7520 being reactivated--post if bad WU's

Posted: Thu Jun 11, 2015 7:52 pm
by kasson
We're working on this and have cleaned up a large number of WUs. The tricky part is that we don't want to change the number of steps on a WU that's already assigned; that can screw up the gen=n+1 WU.

Re: project 7520 being reactivated--post if bad WU's

Posted: Fri Jun 12, 2015 4:03 am
by bruce
Is there a way to detect if a WU has been assigned but not yet returned? Apparently if Gen 0 has been returned, that trajectory will continue normally.

Of course there's no guarantee that a WU that has been assigned will be returned, so that complicates the issues. What happens if you suspend assignments of trajectories for which Gen 0 has NOT been returned and then just wait until any all of those WUs have either expired or have been returned? Can you do that?