Project: 14531 (Run 0, Clone 1305, Gen 17)

Moderators: Site Moderators, FAHC Science Team

Post Reply
AkrionXxarr
Posts: 4
Joined: Thu Apr 16, 2020 8:21 am

Project: 14531 (Run 0, Clone 1305, Gen 17)

Post by AkrionXxarr »

This WU appears to have stalled out on one of my linux servers. You have a lot of dead links in your "Troubleshooting WUs" sticky (such as links to a non-existent wiki and to your FAH-specific cpu stressing software) so I'm not exactly sure where to start with regards to troubleshooting on my end.

Here's the log:

Code: Select all

08:14:31:WU00:FS00:Starting
08:14:31:WU00:FS00:Removing old file './work/00/logfile_01-20200416-074250.txt'
08:14:31:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 704 -lifeline 513 -checkpoint 15 -np 8
08:14:31:WU00:FS00:Started FahCore on PID 855
08:14:31:WU00:FS00:Core PID:859
08:14:31:WU00:FS00:FahCore 0xa7 started
08:14:31:WU00:FS00:0xa7:*********************** Log Started 2020-04-16T08:14:31Z ***********************
08:14:31:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
08:14:31:WU00:FS00:0xa7:       Type: 0xa7
08:14:31:WU00:FS00:0xa7:       Core: Gromacs
08:14:31:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 704 -lifeline 855 -checkpoint 15 -np 8
08:14:31:WU00:FS00:0xa7:************************************ CBang *************************************
08:14:31:WU00:FS00:0xa7:       Date: Nov 5 2019
08:14:31:WU00:FS00:0xa7:       Time: 06:06:57
08:14:31:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
08:14:31:WU00:FS00:0xa7:     Branch: master
08:14:31:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
08:14:31:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
08:14:31:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
08:14:31:WU00:FS00:0xa7:       Bits: 64
08:14:31:WU00:FS00:0xa7:       Mode: Release
08:14:31:WU00:FS00:0xa7:************************************ System ************************************
08:14:31:WU00:FS00:0xa7:        CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
08:14:31:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 42 Stepping 7
08:14:31:WU00:FS00:0xa7:       CPUs: 8
08:14:31:WU00:FS00:0xa7:     Memory: 15.63GiB
08:14:31:WU00:FS00:0xa7:Free Memory: 15.40GiB
08:14:31:WU00:FS00:0xa7:    Threads: POSIX_THREADS
08:14:31:WU00:FS00:0xa7: OS Version: 4.9
08:14:31:WU00:FS00:0xa7:Has Battery: false
08:14:31:WU00:FS00:0xa7: On Battery: false
08:14:31:WU00:FS00:0xa7: UTC Offset: -7
08:14:31:WU00:FS00:0xa7:        PID: 859
08:14:31:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
08:14:31:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
08:14:31:WU00:FS00:0xa7:    Version: 0.0.18
08:14:31:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
08:14:31:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
08:14:31:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
08:14:31:WU00:FS00:0xa7:       Date: Nov 5 2019
08:14:31:WU00:FS00:0xa7:       Time: 06:13:26
08:14:31:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
08:14:31:WU00:FS00:0xa7:     Branch: master
08:14:31:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
08:14:31:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
08:14:31:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
08:14:31:WU00:FS00:0xa7:       Bits: 64
08:14:31:WU00:FS00:0xa7:       Mode: Release
08:14:31:WU00:FS00:0xa7:************************************ Build *************************************
08:14:31:WU00:FS00:0xa7:       SIMD: avx_256
08:14:31:WU00:FS00:0xa7:********************************************************************************
08:14:31:WU00:FS00:0xa7:Project: 14531 (Run 0, Clone 1305, Gen 17)
08:14:31:WU00:FS00:0xa7:Unit: 0x0000001a80fccb0a5e6978bc26ce4efd
08:14:31:WU00:FS00:0xa7:Digital signatures verified
08:14:31:WU00:FS00:0xa7:Calling: mdrun -s frame17.tpr -o frame17.trr -cpi state.cpt -cpt 15 -nt 8
08:14:31:WU00:FS00:0xa7:Steps: first=4250000 total=250000
08:14:34:WU00:FS00:0xa7:Completed 223571 out of 250000 steps (89%)
08:14:36:WU00:FS00:0xa7:ERROR:
08:14:36:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
08:14:36:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
08:14:36:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
08:14:36:WU00:FS00:0xa7:ERROR:
08:14:36:WU00:FS00:0xa7:ERROR:Fatal error:
08:14:36:WU00:FS00:0xa7:ERROR:7 particles communicated to PME rank 0 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
08:14:36:WU00:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
08:14:36:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
08:14:36:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
08:14:36:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
08:14:36:WU00:FS00:0xa7:ERROR:
08:14:36:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
08:14:36:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
08:14:36:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
08:14:36:WU00:FS00:0xa7:ERROR:
08:14:36:WU00:FS00:0xa7:ERROR:Fatal error:
08:14:36:WU00:FS00:0xa7:ERROR:2 particles communicated to PME rank 7 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
08:14:36:WU00:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
08:14:36:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
08:14:36:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
08:14:36:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
08:14:41:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
This machine has been chewing through plenty of WUs for the past few weeks so either there's an issue with the WU or my machine is giving up the ghost (it's not overclocked). It runs a minimal GUI-less install of Debian 9 and has been running F@H exclusively.
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Project: 14531 (Run 0, Clone 1305, Gen 17)

Post by PantherX »

Welcome to the F@H Forum AkrionXxarr,

Apologies for the outdated Sticky, a new version will be published by the end of this week that will fix it.

I have to say that this is a different type of domain decomposition issue that I haven't seen before. You're folding with 8 CPUs which is a stable number. It seems that you did manage to fold until 89% and then something didn't go to plan. Do you know if there were any changes made to the system or was it restarted?

It would be a good idea to get the first ~100 lines from your log file which will include the system configuration and the client settings.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
AkrionXxarr
Posts: 4
Joined: Thu Apr 16, 2020 8:21 am

Re: Project: 14531 (Run 0, Clone 1305, Gen 17)

Post by AkrionXxarr »

Thanks for the welcome! And no worries about the sticky.

I hadn't made any changes to the machine, no. It stalled overnight. I had reset the server after a pause/unpause failed to work incase it needed a reboot after running for ~3 weeks but that didn't fix it. I had the FAH client paused while waiting for a reply to this thread and when I went to unpause it the WU was immediately sent (presumably at 89%) and I got a new one, so it doesn't look like I can generate any further logs from that particular WU.

The log I pasted in my initial post what the advanced client control let me copy. I just went into the server and reloaded the client so hopefully this is what you were looking for:

Code: Select all

*********************** Log Started 2020-04-16T10:31:40Z ***********************
10:31:40:************************* Folding@home Client *************************
10:31:40:    Website: http://folding.stanford.edu/
10:31:40:  Copyright: (c) 2009-2014 Stanford University
10:31:40:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
10:31:40:       Args: --child --lifeline 511 /etc/fahclient/config.xml --run-as fahclient
10:31:40:             --pid-file=/var/run/fahclient.pid --daemon
10:31:40:     Config: /etc/fahclient/config.xml
10:31:40:******************************** Build ********************************
10:31:40:    Version: 7.4.4
10:31:40:       Date: Mar 4 2014
10:31:40:       Time: 12:02:38
10:31:40:    SVN Rev: 4130
10:31:40:     Branch: fah/trunk/client
10:31:40:   Compiler: GNU 4.4.7
10:31:40:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
10:31:40:             -fno-unsafe-math-optimizations -msse2
10:31:40:   Platform: linux2 3.2.0-1-amd64
10:31:40:       Bits: 64
10:31:40:       Mode: Release
10:31:40:******************************* System ********************************
10:31:40:        CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
10:31:40:     CPU ID: GenuineIntel Family 6 Model 42 Stepping 7
10:31:40:       CPUs: 8
10:31:40:     Memory: 15.63GiB
10:31:40:Free Memory: 15.42GiB
10:31:40:    Threads: POSIX_THREADS
10:31:40: OS Version: 4.9
10:31:40:Has Battery: false
10:31:40: On Battery: false
10:31:40: UTC Offset: -7
10:31:40:        PID: 1046
10:31:40:        CWD: /var/lib/fahclient
10:31:40:         OS: Linux 4.9.0-3-amd64 x86_64
10:31:40:    OS Arch: AMD64
10:31:40:       GPUs: 1
10:31:40:      GPU 0: NVIDIA:2 GF116 [GeForce GTX 550 Ti] 691
10:31:40:       CUDA: Not detected
10:31:40:***********************************************************************
10:31:40:<config>
10:31:40:  <!-- HTTP Server -->
10:31:40:  <allow v='192.168.0.0/24'/>
10:31:40:
10:31:40:  <!-- Network -->
10:31:40:  <proxy v=':8080'/>
10:31:40:
10:31:40:  <!-- Remote Command Server -->
10:31:40:  <command-allow-no-pass v='192.168.0.0/24'/>
10:31:40:
10:31:40:  <!-- Slot Control -->
10:31:40:  <power v='full'/>
10:31:40:
10:31:40:  <!-- User Information -->
10:31:40:  <user v='AkrionXxarr'/>
10:31:40:
10:31:40:  <!-- Folding Slots -->
10:31:40:  <slot id='0' type='CPU'>
10:31:40:    <paused v='true'/>
10:31:40:  </slot>
10:31:40:</config>
10:31:40:Switching to user fahclient
10:31:40:Trying to access database...
10:31:40:Successfully acquired database lock
10:31:40:Enabled folding slot 00: PAUSED cpu:8 (by user)
Edit:
Also when I did some searching around for this error message prior to posting the only stuff I could find seemed to indicate it's an issue with the simulation.
See article 4.3 on this page: http://www.gromacs.org/Documentation/Errors
(This is the best I could find with my admittedly limited research)
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Project: 14531 (Run 0, Clone 1305, Gen 17)

Post by PantherX »

Using 8 CPUs and getting that error is weird. There's a possibility that it could be a bad WU.

I also noticed that you're not using a passkey which is recommended to use due to security reasons and bonus points. You can read more about it here: https://foldingathome.org/support/faq/points/passkey/
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
AkrionXxarr
Posts: 4
Joined: Thu Apr 16, 2020 8:21 am

Re: Project: 14531 (Run 0, Clone 1305, Gen 17)

Post by AkrionXxarr »

Yeah I've got the passkey set up now.

In any case the machine that ran into trouble appears to have comfortably chewed through another handful of WUs (no idea how many), so I'm leaning towards this being a problem with that specific WU.
HendricksSA
Posts: 336
Joined: Fri Jun 26, 2009 4:34 am

Re: Project: 14531 (Run 0, Clone 1305, Gen 17)

Post by HendricksSA »

I am fuzzy on my core AVX-256 specifics. Does that require just AVX on the processor or does it need a higher version, like AVX2? Since AkrionXxarr has been completing WUs, I guess original AVX on the I7 2600K is enough. Can someone jog my memory?
Joe_H
Site Admin
Posts: 7870
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Project: 14531 (Run 0, Clone 1305, Gen 17)

Post by Joe_H »

Original AVX supports some operations as 256-bit, and also supports SIMD-128 bit operations. AVX2 adds additional 256-bit operations.

On Intel processors, AVX has been supported since the Sandy Bridge Core i-series as 256-bit. Some of the early AMD processors which support AVX do it with 128-bit operations
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
AkrionXxarr
Posts: 4
Joined: Thu Apr 16, 2020 8:21 am

Re: Project: 14531 (Run 0, Clone 1305, Gen 17)

Post by AkrionXxarr »

To add some extra information I've got a second identical linux machine that's been folding without issue. As of right now I've completed 582 WUs total and I'd guess that my desktop has completed maybe 40-45% of them (it's far more powerful than the linux machines CPU-wise and is also running GPU projects whereas the linux machines are limited to CPU projects) so I'd say my two linux machines have completed roughly 160 WUs each. This is why I figured it was either an issue with the WU or a very recent issue with my hardware.
Post Reply