WU crashing infinitely

Moderators: Site Moderators, PandeGroup

WU crashing infinitely

Postby Nuitari » Fri Nov 08, 2019 3:38 am

Code: Select all
03:35:38:WU01:FS00:FahCore 0xa7 started
03:35:38:WU01:FS00:0xa7:*********************** Log Started 2019-11-08T03:35:38Z ***********************
03:35:38:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
03:35:38:WU01:FS00:0xa7:       Type: 0xa7
03:35:38:WU01:FS00:0xa7:       Core: Gromacs
03:35:38:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 8918 -checkpoint 15 -np
03:35:38:WU01:FS00:0xa7:             15
03:35:38:WU01:FS00:0xa7:************************************ CBang *************************************
03:35:38:WU01:FS00:0xa7:       Date: Nov 5 2019
03:35:38:WU01:FS00:0xa7:       Time: 06:06:57
03:35:38:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
03:35:38:WU01:FS00:0xa7:     Branch: master
03:35:38:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
03:35:38:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
03:35:38:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
03:35:38:WU01:FS00:0xa7:       Bits: 64
03:35:38:WU01:FS00:0xa7:       Mode: Release
03:35:38:WU01:FS00:0xa7:************************************ System ************************************
03:35:38:WU01:FS00:0xa7:        CPU: AMD Ryzen 7 3700X 8-Core Processor
03:35:38:WU01:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
03:35:38:WU01:FS00:0xa7:       CPUs: 16
03:35:38:WU01:FS00:0xa7:     Memory: 31.35GiB
03:35:38:WU01:FS00:0xa7:Free Memory: 8.31GiB
03:35:38:WU01:FS00:0xa7:    Threads: POSIX_THREADS
03:35:38:WU01:FS00:0xa7: OS Version: 5.3
03:35:38:WU01:FS00:0xa7:Has Battery: false
03:35:38:WU01:FS00:0xa7: On Battery: false
03:35:38:WU01:FS00:0xa7: UTC Offset: -5
03:35:38:WU01:FS00:0xa7:        PID: 8922
03:35:38:WU01:FS00:0xa7:        CWD: /opt/foldingathome/work
03:35:38:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
03:35:38:WU01:FS00:0xa7:    Version: 0.0.18
03:35:38:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
03:35:38:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
03:35:38:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
03:35:38:WU01:FS00:0xa7:       Date: Nov 5 2019
03:35:38:WU01:FS00:0xa7:       Time: 06:13:26
03:35:38:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
03:35:38:WU01:FS00:0xa7:     Branch: master
03:35:38:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
03:35:38:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
03:35:38:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
03:35:38:WU01:FS00:0xa7:       Bits: 64
03:35:38:WU01:FS00:0xa7:       Mode: Release
03:35:38:WU01:FS00:0xa7:************************************ Build *************************************
03:35:38:WU01:FS00:0xa7:       SIMD: avx_256
03:35:38:WU01:FS00:0xa7:********************************************************************************
03:35:38:WU01:FS00:0xa7:Project: 14246 (Run 0, Clone 69, Gen 68)
03:35:38:WU01:FS00:0xa7:Unit: 0x0000006380fccb0a5d6fe21f1bfc07a1
03:35:38:WU01:FS00:0xa7:Reading tar file core.xml
03:35:38:WU01:FS00:0xa7:Reading tar file frame68.tpr
03:35:38:WU01:FS00:0xa7:Digital signatures verified
03:35:38:WU01:FS00:0xa7:Calling: mdrun -s frame68.tpr -o frame68.trr -x frame68.xtc -cpt 15 -nt 15
03:35:38:WU01:FS00:0xa7:Steps: first=17000000 total=250000
03:35:38:WU01:FS00:0xa7:ERROR:
03:35:38:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
03:35:38:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
03:35:38:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
03:35:38:WU01:FS00:0xa7:ERROR:
03:35:38:WU01:FS00:0xa7:ERROR:Fatal error:
03:35:38:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
03:35:38:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
03:35:38:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
03:35:38:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
03:35:38:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
03:35:38:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
03:35:43:WU01:FS00:0xa7:WARNING:Unexpected exit() call
03:35:43:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
03:35:43:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
03:35:43:WU01:FS00:0xa7:Saving result file md.log
03:35:43:WU01:FS00:0xa7:Saving result file science.log
03:35:43:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)


The same thing happened 663 times, wasting 12h of time that could have been used for actual folding.

No overclocking. I've manually removed the WU from the folder, but kept the files in case they are needed.
Nuitari
 
Posts: 9
Joined: Sun Jun 09, 2019 4:03 am

Re: WU crashing infinitely

Postby JimboPalmer » Fri Nov 08, 2019 3:57 am

Your cpu has 16 threads and is using 15 of them. (You do not include the configuration portion of the log, so I can't be sure, but the latest version, 7.5.1 should avoid this) Your Work Unit won't divide 15 ways. You can configure your cpu slot to use 16 or 12 CPUs.
Last edited by JimboPalmer on Fri Nov 08, 2019 2:11 pm, edited 1 time in total.
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
JimboPalmer
 
Posts: 959
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: WU crashing infinitely

Postby Nuitari » Fri Nov 08, 2019 5:22 am

AMD Ryzen 7 3700X, so 16 core with hyper threading.
The new version hasn't made it down gentoo yet, so the update will have to wait until that happens.

It might be nice to consider having the WU handle it and go down automatically to the nearest workable number of threads...
Nuitari
 
Posts: 9
Joined: Sun Jun 09, 2019 4:03 am

Re: WU crashing infinitely

Postby JimboPalmer » Fri Nov 08, 2019 6:09 am

The client is written by a Computer Programmer, the WUs are written by Biochemists.
I am sorry you have not decided to install the latest client. https://packages.gentoo.org/packages/sc ... dingathome
I would set your CPUs to 12 or 16.
JimboPalmer
 
Posts: 959
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: WU crashing infinitely

Postby MeeLee » Fri Nov 08, 2019 6:33 am

I would try comparing PPD and power consumption, between running with HT disabled (7 threads), and HT enabled (13-14 threads).
You'll need at least 2 cores (if not 3) reserved for non-fah things.
You'll notice, if you run 14, 15 or 16 threads, that your CPU will run at 100% load anyway (probably 13 to 14 cores will get you 95-98% load).
Chances are you'll see only very little improvement between running on 7 cores, vs running on 13-14 threads, as the CPU will run cooler, but if your CPU supports it, also at higher boost frequencies.
MeeLee
 
Posts: 396
Joined: Tue Feb 19, 2019 10:16 pm

Re: WU crashing infinitely

Postby bollix47 » Fri Nov 08, 2019 10:44 am

@Nuitari

First let me clear up what may be confusing from the above answers:

The current version of the folding client is 7.5.1
The cpu core will not run when using large prime numbers. Some projects will have trouble with a prime number as low as 5.

So 7, 13 or 14 will not work. You can, as Jimbo suggested, use 12 or 16.

Linux has a command called ldd which will list the dependencies needed to run the core. Open files and navigate to the FAHCore_a7 location where you should be able to right-click and select open a terminal and type or copy/paste the following:
Code: Select all
ldd FAHCore_a7

That will show you what, if anything, is missing and if you post the results here we can help solve any problems.

See the following post for usage of ldd & strings (your location will be different because you've used a non-default install):
viewtopic.php?p=309712#p309712

Your location may look something like the following but those are only guesses because your log doesn't show enough info:
/opt/foldingathome/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7
OR
/opt/foldingathome/work/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7
bollix47
 
Posts: 3509
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: WU crashing infinitely

Postby bruce » Fri Nov 08, 2019 3:08 pm

In this case, the WU is being partioned into 3x5x1 segments (-nt 15) so in this case, the factor 5 is essentially a "large prime" as far as GROMACS is concerned. Having the software use 12 of your 15 cores might be considered but there are a lot of WUs that successfully use the factor 5. Dumping the WU after this type of failure can also be considered, as opposed to retrying it.

Did the software retry Project: 14246 (Run 0, Clone 69, Gen 68) 663 times or did the same failure process that many different WUs? If it was the latter, what other projects were involved?

I recommend that you avoid the problem entirely by manually creating a CPU slot with 12 threads and another with 3 threads (or, if you don't have a GPU, a single slot with 16 CPUs).

I'm not enough of a GROMACS user to know what -rcon or -dds or the LINCS settings can do, but I'll alert the project owner and let him/her research that possibility. (and whether changing those settings can be applied to projects that can be assigned to an unknown number of threads.
bruce
 
Posts: 22873
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: WU crashing infinitely

Postby Nuitari » Sat Nov 09, 2019 2:45 am

I do run folding at home 7.5.1 (your post original version had 7.5.5.1)

It was the same unit that failed 663 times. The client should probably have dumped it on its own.

I've checked through the historical logs files and none of the project's WU ever successfully completed, but the client automatically sent it back to FAULTY

This is an example with a different WU:
Code: Select all
13:00:09:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:14246 run:0 clone:88 gen:23 core:0xa7 unit:0x0000002380fccb0a5d6fe21e1415b4d9
13:00:48:WU00:FS00:Starting
13:00:48:WU00:FS00:Running FahCore: /opt/foldingathome/FAHCoreWrapper /opt/foldingathome/cores/cores.foldingathome.org/Linux/AMD64/AVX/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705
 -lifeline 7281 -checkpoint 15 -np 15
13:00:48:WU00:FS00:Started FahCore on PID 9350
13:00:48:WU00:FS00:Core PID:9354
13:00:48:WU00:FS00:FahCore 0xa7 started
13:00:48:WU00:FS00:0xa7:*********************** Log Started 2019-10-04T13:00:48Z ***********************
13:00:48:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
13:00:48:WU00:FS00:0xa7:       Type: 0xa7
13:00:48:WU00:FS00:0xa7:       Core: Gromacs
13:00:48:WU00:FS00:0xa7:    Website: https://foldingathome.org/
13:00:48:WU00:FS00:0xa7:  Copyright: (c) 2009-2018 foldingathome.org
13:00:48:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
13:00:48:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 9350 -checkpoint 15 -np
13:00:48:WU00:FS00:0xa7:             15
13:00:48:WU00:FS00:0xa7:     Config: <none>
13:00:48:WU00:FS00:0xa7:************************************ Build *************************************
13:00:48:WU00:FS00:0xa7:    Version: 0.0.17
13:00:48:WU00:FS00:0xa7:       Date: Apr 27 2018
13:00:48:WU00:FS00:0xa7:       Time: 19:09:21
13:00:48:WU00:FS00:0xa7: Repository: Git
13:00:48:WU00:FS00:0xa7:   Revision: 21359963583d09ec2063ef946399441c4df4ccd7
13:00:48:WU00:FS00:0xa7:     Branch: master
13:00:48:WU00:FS00:0xa7:   Compiler: GNU 6.3.0 20170516
13:00:48:WU00:FS00:0xa7:    Options: -std=gnu++98 -O3 -funroll-loops
13:00:48:WU00:FS00:0xa7:   Platform: linux2 4.14.0-3-amd64
13:00:48:WU00:FS00:0xa7:       Bits: 64
13:00:48:WU00:FS00:0xa7:       Mode: Release
13:00:48:WU00:FS00:0xa7:       SIMD: avx_256
13:00:48:WU00:FS00:0xa7:************************************ System ************************************
13:00:48:WU00:FS00:0xa7:        CPU: AMD Ryzen 7 3700X 8-Core Processor
13:00:48:WU00:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
13:00:48:WU00:FS00:0xa7:       CPUs: 16
13:00:48:WU00:FS00:0xa7:     Memory: 31.35GiB
13:00:48:WU00:FS00:0xa7:Free Memory: 12.45GiB
13:00:48:WU00:FS00:0xa7:    Threads: POSIX_THREADS
13:00:48:WU00:FS00:0xa7: OS Version: 5.3
13:00:48:WU00:FS00:0xa7:Has Battery: false
13:00:48:WU00:FS00:0xa7: On Battery: false
13:00:48:WU00:FS00:0xa7: UTC Offset: -4
13:00:48:WU00:FS00:0xa7:        PID: 9354
13:00:48:WU00:FS00:0xa7:        CWD: /opt/foldingathome/work
13:00:48:WU00:FS00:0xa7:         OS: Linux 5.3.0-gentoo x86_64
13:00:48:WU00:FS00:0xa7:    OS Arch: AMD64
13:00:48:WU00:FS00:0xa7:********************************************************************************
13:00:48:WU00:FS00:0xa7:Project: 14246 (Run 0, Clone 88, Gen 23)
13:00:48:WU00:FS00:0xa7:Unit: 0x0000002380fccb0a5d6fe21e1415b4d9
13:00:48:WU00:FS00:0xa7:Reading tar file core.xml
13:00:48:WU00:FS00:0xa7:Reading tar file frame23.tpr
13:00:48:WU00:FS00:0xa7:Digital signatures verified
13:00:48:WU00:FS00:0xa7:Calling: mdrun -s frame23.tpr -o frame23.trr -x frame23.xtc -cpt 15 -nt 15
13:00:48:WU00:FS00:0xa7:Steps: first=5750000 total=250000
13:00:48:WU00:FS00:0xa7:ERROR:
13:00:48:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
13:00:48:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20161122-4846b12ba-unknown
13:00:48:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
13:00:48:WU00:FS00:0xa7:ERROR:
13:00:48:WU00:FS00:0xa7:ERROR:Fatal error:
13:00:48:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
13:00:48:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
13:00:48:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
13:00:48:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
13:00:48:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
13:00:48:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
13:00:53:WU00:FS00:0xa7:WARNING:Unexpected exit() call
13:00:53:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
13:00:53:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
13:00:53:WU00:FS00:0xa7:Saving result file md.log
13:00:53:WU00:FS00:0xa7:Saving result file science.log
13:00:53:WU00:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
13:00:54:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
13:00:54:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:14246 run:0 clone:88 gen:23 core:0xa7 unit:0x0000002380fccb0a5d6fe21e1415b4d9
13:00:54:WU00:FS00:Uploading 19.50KiB to 128.252.203.10
13:00:54:WU00:FS00:Connecting to 128.252.203.10:8080
13:00:54:WU00:FS00:Upload complete
13:00:54:WU00:FS00:Server responded WORK_ACK (400)
13:00:54:WU00:FS00:Cleaning up


The WU in cause here had this log instead:
Code: Select all
16:35:18:WU01:FS00:Starting
16:35:18:WU01:FS00:Running FahCore: /opt/foldingathome/FAHCoreWrapper /opt/foldingathome/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 70
5 -lifeline 7442 -checkpoint 15 -np 15
16:35:18:WU01:FS00:Started FahCore on PID 16797
16:35:18:WU01:FS00:Core PID:16802
16:35:18:WU01:FS00:FahCore 0xa7 started
16:35:18:WU01:FS00:0xa7:*********************** Log Started 2019-11-07T16:35:18Z ***********************
16:35:18:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
16:35:18:WU01:FS00:0xa7:       Type: 0xa7
16:35:18:WU01:FS00:0xa7:       Core: Gromacs
16:35:18:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 16797 -checkpoint 15 -np
16:35:18:WU01:FS00:0xa7:             15
16:35:18:WU01:FS00:0xa7:************************************ CBang *************************************
16:35:18:WU01:FS00:0xa7:       Date: Nov 5 2019
16:35:18:WU01:FS00:0xa7:       Time: 06:06:57
16:35:18:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
16:35:18:WU01:FS00:0xa7:     Branch: master
16:35:18:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
16:35:18:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
16:35:18:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
16:35:18:WU01:FS00:0xa7:       Bits: 64
16:35:18:WU01:FS00:0xa7:       Mode: Release
16:35:18:WU01:FS00:0xa7:************************************ System ************************************
16:35:18:WU01:FS00:0xa7:        CPU: AMD Ryzen 7 3700X 8-Core Processor
16:35:18:WU01:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
16:35:18:WU01:FS00:0xa7:       CPUs: 16
16:35:18:WU01:FS00:0xa7:     Memory: 31.35GiB
16:35:18:WU01:FS00:0xa7:Free Memory: 8.39GiB
16:35:18:WU01:FS00:0xa7:    Threads: POSIX_THREADS
16:35:18:WU01:FS00:0xa7: OS Version: 5.3
16:35:18:WU01:FS00:0xa7:Has Battery: false
16:35:18:WU01:FS00:0xa7: On Battery: false
16:35:18:WU01:FS00:0xa7: UTC Offset: -5
16:35:18:WU01:FS00:0xa7:        PID: 16802
16:35:18:WU01:FS00:0xa7:        CWD: /opt/foldingathome/work
16:35:18:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
16:35:18:WU01:FS00:0xa7:    Version: 0.0.18
16:35:18:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
16:35:18:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
16:35:18:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
16:35:18:WU01:FS00:0xa7:       Date: Nov 5 2019
16:35:18:WU01:FS00:0xa7:       Time: 06:13:26
16:35:18:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
16:35:18:WU01:FS00:0xa7:     Branch: master
16:35:18:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
16:35:18:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
16:35:18:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
16:35:18:WU01:FS00:0xa7:       Bits: 64
16:35:18:WU01:FS00:0xa7:       Mode: Release
16:35:18:WU01:FS00:0xa7:************************************ Build *************************************
16:35:18:WU01:FS00:0xa7:       SIMD: avx_256
16:35:18:WU01:FS00:0xa7:********************************************************************************
16:35:18:WU01:FS00:0xa7:Project: 14246 (Run 0, Clone 69, Gen 68)
16:35:18:WU01:FS00:0xa7:Unit: 0x0000006380fccb0a5d6fe21f1bfc07a1
16:35:18:WU01:FS00:0xa7:Reading tar file core.xml
16:35:18:WU01:FS00:0xa7:Reading tar file frame68.tpr
16:35:18:WU01:FS00:0xa7:Digital signatures verified
16:35:18:WU01:FS00:0xa7:Calling: mdrun -s frame68.tpr -o frame68.trr -x frame68.xtc -cpt 15 -nt 15
16:35:18:WU01:FS00:0xa7:Steps: first=17000000 total=250000
16:35:18:WU01:FS00:0xa7:ERROR:
16:35:18:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
16:35:18:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
16:35:18:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
16:35:18:WU01:FS00:0xa7:ERROR:
16:35:18:WU01:FS00:0xa7:ERROR:Fatal error:
16:35:18:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
16:35:18:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
16:35:18:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
16:35:18:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
16:35:18:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
16:35:18:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
16:35:23:WU01:FS00:0xa7:WARNING:Unexpected exit() call
16:35:23:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
16:35:23:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
16:35:23:WU01:FS00:0xa7:Saving result file md.log
16:35:23:WU01:FS00:0xa7:Saving result file science.log
16:35:23:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
16:35:24:WU01:FS00:Starting
16:35:24:WU01:FS00:Running FahCore: /opt/foldingathome/FAHCoreWrapper /opt/foldingathome/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 705 -lifeline 7442 -checkpoint 15 -np 15



None of the FahCode_a7 files are missing dependencies.
Code: Select all
gandalf /opt/foldingathome # ldd ./cores/cores.foldingathome.org/Linux/AMD64/AVX/Core_a7.fah/FahCore_a7
        linux-vdso.so.1 (0x00007ffffc8f8000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fd7d7c49000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fd7d7c43000)
        libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/libstdc++.so.6 (0x00007fd7d79c6000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fd7d787a000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/libgcc_s.so.1 (0x00007fd7d7860000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fd7d768e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fd7d9203000)
gandalf /opt/foldingathome # ldd ./cores/cores.foldingathome.org/Linux/AMD64/Core_a7.fah/FahCore_a7
        linux-vdso.so.1 (0x00007ffc4b308000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fc308f5c000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fc308f56000)
        libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/libstdc++.so.6 (0x00007fc308cd9000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fc308b8d000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/libgcc_s.so.1 (0x00007fc308b73000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fc3089a1000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fc30a353000)
gandalf /opt/foldingathome # ldd ./cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7
        linux-vdso.so.1 (0x00007ffe8cc64000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f37f8136000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f37f8130000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f37f7fe4000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f37f7e12000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f37f81bc000)
gandalf /opt/foldingathome # ldd ./cores/fahwebx.stanford.edu/cores/Linux/AMD64/Core_a7.fah/FahCore_a7
        linux-vdso.so.1 (0x00007ffcba349000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fbdf152a000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fbdf1524000)
        libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/libstdc++.so.6 (0x00007fbdf12a7000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fbdf115b000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/libgcc_s.so.1 (0x00007fbdf1141000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fbdf0f6f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fbdf15b0000)


I will play around the config to split the cpus
Nuitari
 
Posts: 9
Joined: Sun Jun 09, 2019 4:03 am

Re: WU crashing infinitely

Postby bruce » Sat Nov 09, 2019 3:12 am

Nuitari wrote:I will play around the config to split the cpus


So please report what happens when you try to run those same WUs with fewer CPUs? ... especially with even numbers of CPUs.
bruce
 
Posts: 22873
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: WU crashing infinitely

Postby Nuitari » Fri Nov 22, 2019 3:34 am

Project 14245 is also having the problem at 16CPUs. It works with 8.
Is there a guide somewhere for the syntax in config.xml ?
Nuitari
 
Posts: 9
Joined: Sun Jun 09, 2019 4:03 am

Re: WU crashing infinitely

Postby bruce » Fri Nov 22, 2019 8:35 pm

If you have more than one CPU slot, you can configure it like this
-<slot type="CPU" id="0">
<cpus v="6"/>
</slot>

If you want all the slots same (or you only have one), the <cpus v=N/> can go in the general section before the slots are defined.
bruce
 
Posts: 22873
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.


Return to CPU Projects - released FAHCores _a4 & _a7

Who is online

Users browsing this forum: No registered users and 3 guests

cron