[Bad WU, Possiable freak crash] PRCG 13833 (0,4937,10)

Moderators: Site Moderators, FAHC Science Team

Post Reply
HSF
Posts: 8
Joined: Tue Mar 17, 2020 9:17 pm

[Bad WU, Possiable freak crash] PRCG 13833 (0,4937,10)

Post by HSF »

Log attached below.

Code: Select all

08:43:05:WU02:FS00:0xa7:Project: 13833 (Run 0, Clone 4937, Gen 10)
08:43:05:WU02:FS00:0xa7:Unit: 0x0000000e80fccb095e6e556528ff8640
08:43:05:WU02:FS00:0xa7:Reading tar file core.xml
08:43:05:WU02:FS00:0xa7:Reading tar file frame10.tpr
08:43:05:WU02:FS00:0xa7:Digital signatures verified
08:43:05:WU02:FS00:0xa7:Calling: mdrun -s frame10.tpr -o frame10.trr -x frame10.xtc -cpt 15 -nt 15
08:43:05:WU02:FS00:0xa7:Steps: first=2500000 total=250000
08:43:05:WU02:FS00:0xa7:ERROR:
08:43:05:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
08:43:05:WU02:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
08:43:05:WU02:FS00:0xa7:ERROR:Source code file: C:\build\fah\core-a7-avx-release\windows-10-64bit-core-a7-avx-release\gromacs-core\build\gromacs\src\gromacs\mdlib\domdec.c, line: 6902
08:43:05:WU02:FS00:0xa7:ERROR:
08:43:05:WU02:FS00:0xa7:ERROR:Fatal error:
08:43:05:WU02:FS00:0xa7:ERROR:There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
08:43:05:WU02:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
08:43:05:WU02:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
08:43:05:WU02:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
08:43:05:WU02:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
08:43:05:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
08:43:10:WU02:FS00:0xa7:WARNING:Unexpected exit() call
08:43:10:WU02:FS00:0xa7:WARNING:Unexpected exit from science code
08:43:10:WU02:FS00:0xa7:Saving result file ..\logfile_01.txt
08:43:10:WU02:FS00:0xa7:Saving result file md.log
08:43:10:WU02:FS00:0xa7:Saving result file science.log
08:43:10:WU02:FS00:0xa7:WARNING:While cleaning up: boost::filesystem::remove: The process cannot access the file because it is being used by another process: "01/md.log"
08:43:10:WU02:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
08:43:10:WARNING:WU02:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
08:43:10:WU02:FS00:Sending unit results: id:02 state:SEND error:FAULTY project:13833 run:0 clone:4937 gen:10 core:0xa7 unit:0x0000000e80fccb095e6e556528ff8640
Considering I'm running other WU's completely fine, possiable freak crash and/or bad generation?
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: [Bad WU, Possiable freak crash] PRCG 13833 (0,4937,10)

Post by Neil-B »

Think this has been spotted ... believe this project may now no longer be being issued for 15cores ... someone will confirm but there is a recent post on this.

Edit ... actually might have been a different one I'm checking ... it was a different one but possible same type of issue relating to number of cores you are folding with - a search for large primes may throw light on it - most projects can cope with cores multiple of 5 but some have been sensitive to this
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Joe_H
Site Admin
Posts: 7870
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: [Bad WU, Possiable freak crash] PRCG 13833 (0,4937,10)

Post by Joe_H »

Another problem I spot is here:

Code: Select all

08:43:10:WU02:FS00:0xa7:WARNING:While cleaning up: boost::filesystem::remove: The process cannot access the file because it is being used by another process: "01/md.log"
Either part of the process had not exited properly and the file was still open when it shouldn't have been, or there is some filesytem problem. I would go with the first as being the explanation as long as you don't see this kind of error repeating.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Post Reply