Project: 14524 (Run 967, Clone 1, Gen 4) fatal error

Moderators: Site Moderators, FAHC Science Team

Post Reply
wandang
Posts: 2
Joined: Wed Mar 18, 2020 10:39 pm

Project: 14524 (Run 967, Clone 1, Gen 4) fatal error

Post by wandang »

The Project: 14524 (Run 967, Clone 1, Gen 4) produces a fatal error and the science code exits unexpectetly forcing me to retry every few minutes (I will try to change my id as said in the troubleshooting guide).

FAHLog:

Code: Select all

22:34:48:WU00:FS00:Starting
22:34:48:WU00:FS00:Removing old file '/opt/fah/work/00/logfile_01-20200318-220352.txt'
22:34:48:WU00:FS00:Running FahCore: /opt/fah/FAHCoreWrapper /opt/fah/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 75902 -checkpoint 15 -np 11
22:34:48:WU00:FS00:Started FahCore on PID 76863
22:34:48:WU00:FS00:Core PID:76867
22:34:48:WU00:FS00:FahCore 0xa7 started
22:34:49:WU00:FS00:0xa7:*********************** Log Started 2020-03-18T22:34:48Z ***********************
22:34:49:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
22:34:49:WU00:FS00:0xa7:       Type: 0xa7
22:34:49:WU00:FS00:0xa7:       Core: Gromacs
22:34:49:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 76863 -checkpoint 15 -np
22:34:49:WU00:FS00:0xa7:             11
22:34:49:WU00:FS00:0xa7:************************************ CBang *************************************
22:34:49:WU00:FS00:0xa7:       Date: Nov 5 2019
22:34:49:WU00:FS00:0xa7:       Time: 06:06:57
22:34:49:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
22:34:49:WU00:FS00:0xa7:     Branch: master
22:34:49:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
22:34:49:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
22:34:49:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
22:34:49:WU00:FS00:0xa7:       Bits: 64
22:34:49:WU00:FS00:0xa7:       Mode: Release
22:34:49:WU00:FS00:0xa7:************************************ System ************************************
22:34:49:WU00:FS00:0xa7:        CPU: AMD Ryzen 5 2600 Six-Core Processor
22:34:49:WU00:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 8 Stepping 2
22:34:49:WU00:FS00:0xa7:       CPUs: 12
22:34:49:WU00:FS00:0xa7:     Memory: 15.64GiB
22:34:49:WU00:FS00:0xa7:Free Memory: 3.40GiB
22:34:49:WU00:FS00:0xa7:    Threads: POSIX_THREADS
22:34:49:WU00:FS00:0xa7: OS Version: 5.5
22:34:49:WU00:FS00:0xa7:Has Battery: false
22:34:49:WU00:FS00:0xa7: On Battery: false
22:34:49:WU00:FS00:0xa7: UTC Offset: 1
22:34:49:WU00:FS00:0xa7:        PID: 76867
22:34:49:WU00:FS00:0xa7:        CWD: /opt/fah/work
22:34:49:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
22:34:49:WU00:FS00:0xa7:    Version: 0.0.18
22:34:49:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
22:34:49:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
22:34:49:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
22:34:49:WU00:FS00:0xa7:       Date: Nov 5 2019
22:34:49:WU00:FS00:0xa7:       Time: 06:13:26
22:34:49:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
22:34:49:WU00:FS00:0xa7:     Branch: master
22:34:49:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
22:34:49:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
22:34:49:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
22:34:49:WU00:FS00:0xa7:       Bits: 64
22:34:49:WU00:FS00:0xa7:       Mode: Release
22:34:49:WU00:FS00:0xa7:************************************ Build *************************************
22:34:49:WU00:FS00:0xa7:       SIMD: avx_256
22:34:49:WU00:FS00:0xa7:********************************************************************************
22:34:49:WU00:FS00:0xa7:Project: 14524 (Run 967, Clone 1, Gen 4)
22:34:49:WU00:FS00:0xa7:Unit: 0x0000000480fccb0a5e459b8e3093b364
22:34:49:WU00:FS00:0xa7:Reading tar file core.xml
22:34:49:WU00:FS00:0xa7:Reading tar file frame4.tpr
22:34:49:WU00:FS00:0xa7:Digital signatures verified
22:34:49:WU00:FS00:0xa7:Reducing thread count from 11 to 10 to avoid domain decomposition by a prime number > 3
22:34:49:WU00:FS00:0xa7:Calling: mdrun -s frame4.tpr -o frame4.trr -x frame4.xtc -cpt 15 -nt 10
22:34:49:WU00:FS00:0xa7:Steps: first=1000000 total=250000
22:34:49:WU00:FS00:0xa7:ERROR:
22:34:49:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
22:34:49:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
22:34:49:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
22:34:49:WU00:FS00:0xa7:ERROR:
22:34:49:WU00:FS00:0xa7:ERROR:Fatal error:
22:34:49:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
22:34:49:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
22:34:49:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
22:34:49:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
22:34:49:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
22:34:49:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
22:34:54:WU00:FS00:0xa7:WARNING:Unexpected exit() call
22:34:54:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
22:34:54:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
22:34:54:WU00:FS00:0xa7:Saving result file md.log
22:34:54:WU00:FS00:0xa7:Saving result file science.log
22:34:54:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Joe_H
Site Admin
Posts: 7870
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Project: 14524 (Run 967, Clone 1, Gen 4) fatal error

Post by Joe_H »

Try pausing, changing the number of CPU threads from 11 to 9 or 8, and then starting processing again. Some projects have issues with multiples of 5, most are identified during pre-release testing, but some make it out and we find out about the problem after the project has been running a while.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
wandang
Posts: 2
Joined: Wed Mar 18, 2020 10:39 pm

Re: Project: 14524 (Run 967, Clone 1, Gen 4) fatal error

Post by wandang »

Thanks Joe!

I deleted the work folder and got new WU afterwards.
The next time a similar bug occurs I will keep the multiplicator in mind and change accordingly.

Have a nice week,
Wandang
caog
Posts: 2
Joined: Thu Mar 19, 2020 10:09 am

Re: Project: 14524 (Run 967, Clone 1, Gen 4) fatal error

Post by caog »

14524, GROMACS decomposition error for 10 threads, also failed for me. I've changed cpu slot to 9 threads.
Post Reply