Work unit for 14587 never starts

Moderators: Site Moderators, FAHC Science Team

Post Reply
texmandie
Posts: 6
Joined: Sun Apr 12, 2020 6:31 pm

Work unit for 14587 never starts

Post by texmandie »

This work unit never started. This system completed several work units at its normal pace today (about 1.4 - 1.7M PPD on CPU-only - it's an Azure F72 spot instance; 72 virtual CPU cores). I tried stopping and starting the FAHClient service.

FAHClient --send-command options:

Code: Select all

18:51:19:Connecting to 127.0.0.1:36330


PyON 1 options
{"child": "true", "client-type": "beta", "daemon": "true", "gpu": "false", "passkey": "xxxxxxxxx", "pid-file": "/var/run/fahclient.pid", "power": "full", "run-as": "fahclient", "team": "xxxxx", "user": "xxxxxxxxx"}
---
FAHClient --send-command queue-info:

Code: Select all

18:45:46:Connecting to 127.0.0.1:36330


PyON 1 units
[
  {"id": "00", "state": "READY", "error": "NO_ERROR", "project": 14587, "run": 12, "clone": 0, "gen": 23, "core": "0xa7", "unit": "0x0000001b80fccb025e8bcf5a61620012", "percentdone": "0.00%", "eta": "24 hours 00 mins", "ppd": "202", "creditestimate": "202", "waitingon": "FahCore Run", "nextattempt": "23.63 secs", "timeremaining": "6.90 days", "totalframes": 0, "framesdone": 0, "assigned": "2020-04-12T16:19:29Z", "timeout": "2020-04-13T16:19:29Z", "deadline": "2020-04-19T16:19:29Z", "ws": "128.252.203.2", "cs": "52.224.109.74", "attempts": 4, "slot": "00", "tpf": "14 mins 24 secs", "basecredit": "202"}
]
---

Here is the most recent logfile_01.txt from /var/lib/fahclient/work/00:

Code: Select all

*********************** Log Started 2020-04-12T18:44:10Z ***********************
************************** Gromacs Folding@home Core ***************************
       Type: 0xa7
       Core: Gromacs
       Args: -dir 00 -suffix 01 -version 705 -lifeline 3190 -checkpoint 15 -np
             72
************************************ CBang *************************************
       Date: Nov 5 2019
       Time: 06:06:57
   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
     Branch: master
   Compiler: GNU 8.3.0
    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
   Platform: linux2 4.19.0-5-amd64
       Bits: 64
       Mode: Release
************************************ System ************************************
        CPU: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
     CPU ID: GenuineIntel Family 6 Model 85 Stepping 4
       CPUs: 72
     Memory: 141.64GiB
Free Memory: 138.66GiB
    Threads: POSIX_THREADS
 OS Version: 5.0
Has Battery: false
 On Battery: false
 UTC Offset: 0
        PID: 3194
        CWD: /var/lib/fahclient/work
******************************** Build - libFAH ********************************
    Version: 0.0.18
     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
  Copyright: 2019 foldingathome.org
   Homepage: https://foldingathome.org/
       Date: Nov 5 2019
       Time: 06:13:26
   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
     Branch: master
   Compiler: GNU 8.3.0
    Options: -std=c++11 -O3 -funroll-loops -fno-pie
   Platform: linux2 4.19.0-5-amd64
       Bits: 64
       Mode: Release
************************************ Build *************************************
       SIMD: avx_256
********************************************************************************
Project: 14587 (Run 12, Clone 0, Gen 23)
Unit: 0x0000001b80fccb025e8bcf5a61620012
Reading tar file core.xml
Reading tar file frame23.tpr
Digital signatures verified
Calling: mdrun -s frame23.tpr -o frame23.trr -x frame23.xtc -cpt 15 -nt 72
Steps: first=5750000 total=250000
ERROR:
ERROR:-------------------------------------------------------
ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902       
ERROR:
ERROR:Fatal error:
ERROR:There is no domain decomposition for 54 ranks that is compatible with the given box and a minimum cell size of 1.46925 nm
ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
ERROR:Look in the log file for details on the domain decomposition
ERROR:For more information and tips for troubleshooting, please check the GROMACS
ERROR:website at http://www.gromacs.org/Documentation/Errors
ERROR:-------------------------------------------------------
WARNING:Unexpected exit() call
WARNING:Unexpected exit from science code
Saving result file ../logfile_01.txt
Saving result file md.log
Saving result file science.log
Image
Joe_H
Site Admin
Posts: 7927
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Work unit for 14587 never starts

Post by Joe_H »

Try running the WU on fewer CPU cores, multiples of 2 and 3 work best. Sometimes multiples of 5 will work.

I will pass on this information, there is limited chances to test on more than 32 cores before release. So they may not have limits set for assigning to certain CPU thread counts that will not work.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
texmandie
Posts: 6
Joined: Sun Apr 12, 2020 6:31 pm

Re: Work unit for 14587 never starts

Post by texmandie »

I already killed that virtual machine, but if I get stuck again, I'll try that to get the WU through.
Image
texmandie
Posts: 6
Joined: Sun Apr 12, 2020 6:31 pm

Re: Work unit for 14587 never starts

Post by texmandie »

I had another non-starter, this time Project 16427, and running

Code: Select all

FAHClient --send-command "option power light"
got it going, using about 1/2 of the available cores. Once this work unit is through, I'll bump it back up to high.
Image
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Work unit for 14587 never starts

Post by Neil-B »

Are you a registered beta tester: viewtopic.php?f=16&t=8%20? If no, don't use that flag, no support is offered for beta outside the beta forum ... as far as I am aware 14587 is still in beta as is 16427.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
texmandie
Posts: 6
Joined: Sun Apr 12, 2020 6:31 pm

Re: Work unit for 14587 never starts

Post by texmandie »

No, I'm not, and taking that option out of my config files now that I know...
Image
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Work unit for 14587 never starts

Post by Neil-B »

client-type advanced will give you the nearest equivalent - post beta but pre full release ... you will get support and help find WUs that struggle with big core counts :)
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
texmandie
Posts: 6
Joined: Sun Apr 12, 2020 6:31 pm

Re: Work unit for 14587 never starts

Post by texmandie »

Excellent! I'm doing all my folding on well-priced Azure spot VMs, so if something crashes or hangs, I'm just disappointed for what it's done to F@H progress, not my work... Rationale for folding in the cloud: did the math on the electricity consumption and performance of my hardware and current domestic electricity price, and realized that the best-priced spot VMs were literally cheaper than leaving my stuff on when I'm not using it.
Image
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Work unit for 14587 never starts

Post by Neil-B »

You may want to have a look at this viewtopic.php?f=16&t=34137&p=324550&hil ... ai#p324550 … reports are it is fairly cheap - and GPU … personally know zero about it but thought you might like to read it if nothing else.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
texmandie
Posts: 6
Joined: Sun Apr 12, 2020 6:31 pm

Re: Work unit for 14587 never starts

Post by texmandie »

*Fascinating* - and as a DevOps engineer mostly focused on helping developers deal with containers and Kubernetes these days, I can see ways to make it even more efficient for F@H :)

I thought I'd gotten things as efficient as possible on Azure spot GPU instances, but switched focus to optimizing for CPU WUs due to a lot of time waiting for new GPU WUs.
Image
Post Reply