FAHCore seg faulting on pop os

FAH provides a V7 client installer for Debian / Mint / Ubuntu / RedHat / CentOS / Fedora. Installation on other distros may or may not be easy but if you can offer help to others, they would appreciate it.

Moderators: Site Moderators, FAHC Science Team

Post Reply
cujomalainey
Posts: 2
Joined: Thu Mar 26, 2020 2:17 am

FAHCore seg faulting on pop os

Post by cujomalainey »

I have been running jobs for about a week when I noticed my machine wasn't under load for about a day. Looked in the log and seems there is some sort of issue in one of the libraries. Do I have a bad job in my queue or is there something wrong with the client? Thanks.

Dmesg

Code: Select all

[13471.708354] FahCore_a7[15763]: segfault at 50 ip 000000000120aa3d sp 00007ffc5bac4cf0 error 4 in FahCore_a7[406000+10cc000]
[13471.708360] Code: 73 08 0f 84 83 00 00 00 48 c7 44 24 30 00 00 00 00 4c 8d 74 24 20 4c 89 74 24 08 f3 0f 7e 64 24 08 66 0f 6c e4 0f 29 64 24 20 <48> 8b 16 4c 39 f2 0f 84 07 01 00 00 4c 39 f6 0f 84 fe 00 00 00 4c
[13531.741271] FahCore_a7[24861]: segfault at 50 ip 000000000120aa3d sp 00007ffce9fff140 error 4 in FahCore_a7[406000+10cc000]
[13531.741277] Code: 73 08 0f 84 83 00 00 00 48 c7 44 24 30 00 00 00 00 4c 8d 74 24 20 4c 89 74 24 08 f3 0f 7e 64 24 08 66 0f 6c e4 0f 29 64 24 20 <48> 8b 16 4c 39 f2 0f 84 07 01 00 00 4c 39 f6 0f 84 fe 00 00 00 4c
[13591.778074] FahCore_a7[1683]: segfault at 50 ip 000000000120aa3d sp 00007ffc3463cff0 error 4 in FahCore_a7[406000+10cc000]
[13591.778112] Code: 73 08 0f 84 83 00 00 00 48 c7 44 24 30 00 00 00 00 4c 8d 74 24 20 4c 89 74 24 08 f3 0f 7e 64 24 08 66 0f 6c e4 0f 29 64 24 20 <48> 8b 16 4c 39 f2 0f 84 07 01 00 00 4c 39 f6 0f 84 fe 00 00 00 4c

Code: Select all

02:20:22:WU01:FS00:0xa7:************************************ Build *************************************
02:20:22:WU01:FS00:0xa7:       SIMD: avx_256 
02:20:22:WU01:FS00:0xa7:********************************************************************************
02:20:22:WU01:FS00:0xa7:Project: 13833 (Run 0, Clone 2650, Gen 2)                                                                                                                              
02:20:22:WU01:FS00:0xa7:Unit: 0x0000000480fccb095e6e56038838e939
02:20:22:WU01:FS00:0xa7:Reading tar file core.xml
02:20:22:WU01:FS00:0xa7:Reading tar file frame2.tpr
02:20:22:WU01:FS00:0xa7:Digital signatures verified
02:20:22:WU01:FS00:0xa7:Calling: mdrun -s frame2.tpr -o frame2.trr -x frame2.xtc -cpt 15 -nt 15         
02:20:22:WU01:FS00:0xa7:Steps: first=500000 total=250000                                                                                                                                       
02:20:22:WU01:FS00:0xa7:ERROR:                                                                                                                                                                 
02:20:22:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
02:20:22:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown                                                                                                        
02:20:22:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
02:20:22:WU01:FS00:0xa7:ERROR:                                                                                                                                                                 
02:20:22:WU01:FS00:0xa7:ERROR:Fatal error:                                                     
02:20:22:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
02:20:22:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
02:20:22:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
02:20:22:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
02:20:22:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
02:20:22:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
02:20:27:WU01:FS00:0xa7:WARNING:Unexpected exit() call
02:20:27:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
02:20:27:WU01:FS00:0xa7:Saving result file ../logfile_01.txt                                                                                                                                   
02:20:27:WU01:FS00:0xa7:Saving result file md.log                            
02:20:27:WU01:FS00:0xa7:Saving result file science.log                                                                                                                                         
02:20:27:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)                                  
02:21:22:WU01:FS00:Starting                                                                    
02:21:22:WU01:FS00:Removing old file './work/01/logfile_01-20200326-014921.txt'                
02:21:22:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 705 -lifeline 
1711 -checkpoint 15 -np 15                                                                     
02:21:22:WU01:FS00:Started FahCore on PID 22010                                                                                                                                                
02:21:22:WU01:FS00:Core PID:22018                                                              
02:21:22:WU01:FS00:FahCore 0xa7 started
02:21:22:WU01:FS00:0xa7:*********************** Log Started 2020-03-26T02:21:22Z ***********************
02:21:22:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
02:21:22:WU01:FS00:0xa7:       Type: 0xa7                                                                                                                                                      
02:21:22:WU01:FS00:0xa7:       Core: Gromacs
02:21:22:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 22010 -checkpoint 15 -np
02:21:22:WU01:FS00:0xa7:             15                                                                                                                                                        
02:21:22:WU01:FS00:0xa7:************************************ CBang *************************************                                                                                       
02:21:22:WU01:FS00:0xa7:       Date: Nov 5 2019                                           
02:21:22:WU01:FS00:0xa7:       Time: 06:06:57                                                                                                                                                  
02:21:22:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9       
02:21:22:WU01:FS00:0xa7:     Branch: master                                                    
02:21:22:WU01:FS00:0xa7:   Compiler: GNU 8.3.0                                                 
02:21:22:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
02:21:22:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64  
02:21:22:WU01:FS00:0xa7:       Bits: 64                                                        
02:21:22:WU01:FS00:0xa7:       Mode: Release                                                   
02:21:22:WU01:FS00:0xa7:************************************ System ************************************
02:21:22:WU01:FS00:0xa7:        CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
02:21:22:WU01:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 12
02:21:22:WU01:FS00:0xa7:       CPUs: 16
02:21:22:WU01:FS00:0xa7:     Memory: 31.28GiB
02:21:22:WU01:FS00:0xa7:Free Memory: 13.97GiB
02:21:22:WU01:FS00:0xa7:    Threads: POSIX_THREADS
02:21:22:WU01:FS00:0xa7: OS Version: 5.3
02:21:22:WU01:FS00:0xa7:Has Battery: false
02:21:22:WU01:FS00:0xa7: On Battery: false
02:21:22:WU01:FS00:0xa7: UTC Offset: -7
02:21:22:WU01:FS00:0xa7:        PID: 22018
02:21:22:WU01:FS00:0xa7:        CWD: /var/lib/fahclient/work
02:21:22:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
02:21:22:WU01:FS00:0xa7:    Version: 0.0.18
02:21:22:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
02:21:22:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
02:21:22:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
02:21:22:WU01:FS00:0xa7:       Date: Nov 5 2019
02:21:22:WU01:FS00:0xa7:       Time: 06:13:26
02:21:22:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
02:21:22:WU01:FS00:0xa7:     Branch: master
02:21:22:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
02:21:22:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
02:21:22:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
02:21:22:WU01:FS00:0xa7:       Bits: 64
02:21:22:WU01:FS00:0xa7:       Mode: Release
02:21:22:WU01:FS00:0xa7:************************************ Build *************************************
02:21:22:WU01:FS00:0xa7:       SIMD: avx_256
02:21:22:WU01:FS00:0xa7:********************************************************************************
02:21:22:WU01:FS00:0xa7:Project: 13833 (Run 0, Clone 2650, Gen 2)
02:21:22:WU01:FS00:0xa7:Unit: 0x0000000480fccb095e6e56038838e939
02:21:22:WU01:FS00:0xa7:Reading tar file core.xml
02:21:22:WU01:FS00:0xa7:Reading tar file frame2.tpr
02:21:22:WU01:FS00:0xa7:Digital signatures verified
02:21:22:WU01:FS00:0xa7:Calling: mdrun -s frame2.tpr -o frame2.trr -x frame2.xtc -cpt 15 -nt 15
02:21:22:WU01:FS00:0xa7:Steps: first=500000 total=250000
02:21:22:WU01:FS00:0xa7:ERROR:
02:21:22:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
02:21:22:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
02:21:22:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
02:21:22:WU01:FS00:0xa7:ERROR:
02:21:22:WU01:FS00:0xa7:ERROR:Fatal error:
02:21:22:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
02:21:22:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
02:21:22:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
02:21:22:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
02:21:22:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
02:21:22:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
02:21:27:WU01:FS00:0xa7:WARNING:Unexpected exit() call
02:21:27:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
02:21:27:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
02:21:27:WU01:FS00:0xa7:Saving result file md.log
02:21:27:WU01:FS00:0xa7:Saving result file science.log
02:21:27:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Papamatti
Posts: 1
Joined: Thu Mar 26, 2020 11:33 am

Re: FAHCore seg faulting on pop os

Post by Papamatti »

The same happens on my system since today. Majaro with Kernel 5.5.11, AMD Ryzen 7 2700 :e(
cujomalainey
Posts: 2
Joined: Thu Mar 26, 2020 2:17 am

Re: FAHCore seg faulting on pop os

Post by cujomalainey »

No idea if it was a bad job that timed out, but I am folding again with no recent change from myself.
benc
Posts: 8
Joined: Fri Jul 17, 2020 2:28 pm

Re: FAHCore seg faulting on pop os

Post by benc »

I am seeing this too on Ubuntu headless as well

Code: Select all

[67858.261564] FahCore_a7[37025]: segfault at 50 ip 000000000120aa3d sp 00007ffe33bafe50 error 4 in FahCore_a7[406000+10cc000]
[67858.261578] Code: 73 08 0f 84 83 00 00 00 48 c7 44 24 30 00 00 00 00 4c 8d 74 24 20 4c 89 74 24 08 f3 0f 7e 64 24 08 66 0f 6c e4 0f 29 64 24 20 <48> 8b 16 4c 39 f2 0f 84 07 01 00 00 4c 39 f6 0f 84 fe 00 00 00 4c
[67918.392974] FahCore_a7[37049]: segfault at 50 ip 000000000120aa3d sp 00007fffd9ef0eb0 error 4 in FahCore_a7[406000+10cc000]
[67918.392988] Code: 73 08 0f 84 83 00 00 00 48 c7 44 24 30 00 00 00 00 4c 8d 74 24 20 4c 89 74 24 08 f3 0f 7e 64 24 08 66 0f 6c e4 0f 29 64 24 20 <48> 8b 16 4c 39 f2 0f 84 07 01 00 00 4c 39 f6 0f 84 fe 00 00 00 4c
[67978.533974] FahCore_a7[37078]: segfault at 50 ip 000000000120aa3d sp 00007ffdace4aa60 error 4 in FahCore_a7[406000+10cc000]
[67978.533987] Code: 73 08 0f 84 83 00 00 00 48 c7 44 24 30 00 00 00 00 4c 8d 74 24 20 4c 89 74 24 08 f3 0f 7e 64 24 08 66 0f 6c e4 0f 29 64 24 20 <48> 8b 16 4c 39 f2 0f 84 07 01 00 00 4c 39 f6 0f 84 fe 00 00 00 4c
Is this a bad piece of work or something I am doing wrong?

Have been folding on this machine (AMD Ryzen 3700X) for about a month.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: FAHCore seg faulting on pop os

Post by bruce »

benc wrote:I am seeing this too on Ubuntu headless as well

Is this a bad piece of work or something I am doing wrong?

Have been folding on this machine (AMD Ryzen 3700X) for about a month.
It would help if we knew what WU was assigned.
benc
Posts: 8
Joined: Fri Jul 17, 2020 2:28 pm

Re: FAHCore seg faulting on pop os

Post by benc »

Thanks Bruce, might be a GROMACS issue or configuration.

Code: Select all

19:04:05:WU00:FS00:Starting
19:04:05:WU00:FS00:Removing old file 'work/00/logfile_01-20200717-140751.txt'
19:04:05:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 706 -lifeline 1077 -checkpoint 15 -np 16
19:04:05:WU00:FS00:Started FahCore on PID 1818
19:04:05:WU00:FS00:Core PID:1822
19:04:05:WU00:FS00:FahCore 0xa7 started
19:04:06:WU00:FS00:0xa7:*********************** Log Started 2020-07-17T19:04:05Z ***********************
19:04:06:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
19:04:06:WU00:FS00:0xa7:       Type: 0xa7
19:04:06:WU00:FS00:0xa7:       Core: Gromacs
19:04:06:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 706 -lifeline 1818 -checkpoint 15 -np
19:04:06:WU00:FS00:0xa7:             16
19:04:06:WU00:FS00:0xa7:************************************ CBang *************************************
19:04:06:WU00:FS00:0xa7:       Date: Nov 5 2019
19:04:06:WU00:FS00:0xa7:       Time: 06:06:57
19:04:06:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
19:04:06:WU00:FS00:0xa7:     Branch: master
19:04:06:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
19:04:06:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
19:04:06:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
19:04:06:WU00:FS00:0xa7:       Bits: 64
19:04:06:WU00:FS00:0xa7:       Mode: Release
19:04:06:WU00:FS00:0xa7:************************************ System ************************************
19:04:06:WU00:FS00:0xa7:        CPU: AMD Ryzen 7 3700X 8-Core Processor
19:04:06:WU00:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
19:04:06:WU00:FS00:0xa7:       CPUs: 16
19:04:06:WU00:FS00:0xa7:     Memory: 7.77GiB
19:04:06:WU00:FS00:0xa7:Free Memory: 7.03GiB
19:04:06:WU00:FS00:0xa7:    Threads: POSIX_THREADS
19:04:06:WU00:FS00:0xa7: OS Version: 5.4
19:04:06:WU00:FS00:0xa7:Has Battery: false
19:04:06:WU00:FS00:0xa7: On Battery: false
19:04:06:WU00:FS00:0xa7: UTC Offset: 0
19:04:06:WU00:FS00:0xa7:        PID: 1822
19:04:06:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
19:04:06:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
19:04:06:WU00:FS00:0xa7:    Version: 0.0.18
19:04:06:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
19:04:06:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
19:04:06:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
19:04:06:WU00:FS00:0xa7:       Date: Nov 5 2019
19:04:06:WU00:FS00:0xa7:       Time: 06:13:26
19:04:06:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
19:04:06:WU00:FS00:0xa7:     Branch: master
19:04:06:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
19:04:06:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
19:04:06:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
19:04:06:WU00:FS00:0xa7:       Bits: 64
19:04:06:WU00:FS00:0xa7:       Mode: Release
19:04:06:WU00:FS00:0xa7:************************************ Build *************************************
19:04:06:WU00:FS00:0xa7:       SIMD: avx_256
19:04:06:WU00:FS00:0xa7:********************************************************************************
19:04:06:WU00:FS00:0xa7:Project: 14246 (Run 0, Clone 102, Gen 19)
19:04:06:WU00:FS00:0xa7:Unit: 0x0000002180fccb0a5efe04946e0a838c
19:04:06:WU00:FS00:0xa7:Reading tar file core.xml
19:04:06:WU00:FS00:0xa7:Reading tar file frame19.tpr
19:04:06:WU00:FS00:0xa7:Digital signatures verified
19:04:06:WU00:FS00:0xa7:Calling: mdrun -s frame19.tpr -o frame19.trr -x frame19.xtc -cpt 15 -nt 16
19:04:06:WU00:FS00:0xa7:Steps: first=4750000 total=250000
19:04:06:WU00:FS00:0xa7:ERROR:
19:04:06:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
19:04:06:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
19:04:06:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
19:04:06:WU00:FS00:0xa7:ERROR:
19:04:06:WU00:FS00:0xa7:ERROR:Fatal error:
19:04:06:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 16 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
19:04:06:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
19:04:06:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
19:04:06:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
19:04:06:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
19:04:06:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
19:04:10:WU00:FS00:0xa7:WARNING:Unexpected exit() call
19:04:10:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
19:04:10:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
19:04:10:WU00:FS00:0xa7:Saving result file md.log
19:04:10:WU00:FS00:0xa7:Saving result file science.log
19:04:11:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Post Reply