Page 1 of 1

8649 (Run 238, Clone 0, Gen 39) repeated fails, dumped

Posted: Mon Jan 15, 2018 1:30 pm
by parkut
Found one of my linux (Centos 7.3/Q22/Q6600) machines stuck in a loop trying to start Project 8649 (Run 238, Clone 0, Gen 39), but immediately failing. This repeated 322 times over a five hour period.

Issue was resolved by stopping FAHClient and deleting work directory. On restart, the machine was immediately assigned a new WU and returned to folding like normal.

Code: Select all

02:48:20:WU00:FS00:Starting
02:48:20:WU00:FS00:Removing old file './work/00/logfile_01-20180115-021720.txt'
-checkpoint 15 -np 4
02:48:20:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/fahwebx.stanford.edu/cores/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 00 -suffix 01 -version 704 -lifeline 28424 
02:48:20:WU00:FS00:Started FahCore on PID 29484
02:48:20:WU00:FS00:Core PID:29488
02:48:20:WU00:FS00:FahCore 0xa4 started
02:48:20:WU00:FS00:0xa4:
02:48:20:WU00:FS00:0xa4:*------------------------------*
02:48:20:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
02:48:20:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
02:48:20:WU00:FS00:0xa4:
02:48:20:WU00:FS00:0xa4:Preparing to commence simulation
02:48:20:WU00:FS00:0xa4:- Ensuring status. Please wait.
02:48:30:WU00:FS00:0xa4:- Looking at optimizations...
02:48:30:WU00:FS00:0xa4:- Working with standard loops on this execution.
02:48:30:WU00:FS00:0xa4:Examination of work files indicates 8 consecutive improper terminations of core.
02:48:30:WU00:FS00:0xa4:- Expanded 29720 -> 535948 (decompressed 1803.3 percent)
02:48:30:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=29720 data_size=535948, decompressed_data_size=535948 diff=0
02:48:30:WU00:FS00:0xa4:- Digital signature verified
02:48:30:WU00:FS00:0xa4:
02:48:30:WU00:FS00:0xa4:Project: 8649 (Run 238, Clone 0, Gen 39)
02:48:30:WU00:FS00:0xa4:
02:48:30:WU00:FS00:0xa4:Entering M.D.
02:48:36:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
02:49:20:WU00:FS00:Starting
02:49:20:WU00:FS00:Removing old file './work/00/logfile_01-20180115-021820.txt'
-checkpoint 15 -np 4
02:49:20:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/fahwebx.stanford.edu/cores/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 00 -suffix 01 -version 704 -lifeline 28424 
02:49:20:WU00:FS00:Started FahCore on PID 29496
02:49:20:WU00:FS00:Core PID:29500
02:49:20:WU00:FS00:FahCore 0xa4 started
02:49:20:WU00:FS00:0xa4:
02:49:20:WU00:FS00:0xa4:*------------------------------*
02:49:20:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
02:49:20:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
02:49:20:WU00:FS00:0xa4:
02:49:20:WU00:FS00:0xa4:Preparing to commence simulation
02:49:20:WU00:FS00:0xa4:- Ensuring status. Please wait.
02:49:30:WU00:FS00:0xa4:- Looking at optimizations...
02:49:30:WU00:FS00:0xa4:- Working with standard loops on this execution.
02:49:30:WU00:FS00:0xa4:Examination of work files indicates 8 consecutive improper terminations of core.
02:49:30:WU00:FS00:0xa4:- Expanded 29720 -> 535948 (decompressed 1803.3 percent)
02:49:30:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=29720 data_size=535948, decompressed_data_size=535948 diff=0
02:49:30:WU00:FS00:0xa4:- Digital signature verified
02:49:30:WU00:FS00:0xa4:
02:49:30:WU00:FS00:0xa4:Project: 8649 (Run 238, Clone 0, Gen 39)
02:49:30:WU00:FS00:0xa4:
02:49:30:WU00:FS00:0xa4:Entering M.D.
02:49:36:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)

Re: 8649 (Run 238, Clone 0, Gen 39) repeated fails, dumped

Posted: Sun Jan 21, 2018 9:42 pm
by toTOW
Yes, something is probably wrong with this trajectory : Gen 38 has been completed on January 15th and I see only one report of failure on the 21st ... at least someone has been able to report something to the server.

I marked the WU as bad.