Page 1 of 1

project:13126 run:77 clone:7 gen:23

Posted: Sat Mar 04, 2017 7:05 pm
by davidcoton
This, then the slot hung...

Code: Select all

16:37:18:WU01:FS00:Connecting to 171.67.108.45:8080
16:37:19:WU01:FS00:Assigned to work server 171.67.108.101
16:37:19:WU01:FS00:Requesting new work unit for slot 00: RUNNING cpu:4 from 171.67.108.101
16:37:19:WU01:FS00:Connecting to 171.67.108.101:8080
16:37:21:WU01:FS00:Downloading 2.95MiB
16:37:24:WU01:FS00:Download complete
16:37:25:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13126 run:77 clone:7 gen:23 core:0xa7 unit:0x0000001bab436c655898c998e9514d98

16:39:05:WU01:FS00:Starting
16:39:05:WU00:FS00:Connecting to 171.67.108.101:8080
16:39:05:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\david\AppData\Roaming\FAHClient\cores/fahwebx.stanford.edu/cores/Win32/AMD64/AVX/beta/Core_a7.fah/FahCore_a7.exe -dir 01 -suffix 01 -version 704 -lifeline 6472 -checkpoint 15 -np 4
16:39:06:WU01:FS00:Started FahCore on PID 6664
16:39:06:WU01:FS00:Core PID:3376
16:39:06:WU01:FS00:FahCore 0xa7 started
16:39:07:WU01:FS00:0xa7:*********************** Log Started 2017-03-04T16:39:07Z ***********************
16:39:07:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
16:39:07:WU01:FS00:0xa7:       Type: 0xa7
16:39:07:WU01:FS00:0xa7:       Core: Gromacs
16:39:07:WU01:FS00:0xa7:    Website: http://folding.stanford.edu/
16:39:07:WU01:FS00:0xa7:  Copyright: (c) 2009-2016 Stanford University
16:39:07:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
16:39:07:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 704 -lifeline 6664 -checkpoint 15 -np 4
16:39:07:WU01:FS00:0xa7:     Config: <none>
16:39:07:WU01:FS00:0xa7:************************************ Build *************************************
16:39:07:WU01:FS00:0xa7:    Version: 0.0.11
16:39:07:WU01:FS00:0xa7:       Date: Sep 21 2016
16:39:07:WU01:FS00:0xa7:       Time: 01:43:48
16:39:07:WU01:FS00:0xa7: Repository: Git
16:39:07:WU01:FS00:0xa7:   Revision: 957bd90e68d95ddcf1594dc15ff6c64cc4555146
16:39:07:WU01:FS00:0xa7:     Branch: master
16:39:07:WU01:FS00:0xa7:   Compiler: GNU 4.2.1 Compatible Clang 3.9.0 (trunk 274080)
16:39:07:WU01:FS00:0xa7:    Options: -std=gnu++98 -O3 -funroll-loops -ffast-math -mfpmath=sse
16:39:07:WU01:FS00:0xa7:             -fno-unsafe-math-optimizations -msse2 -I/mingw64/include
16:39:07:WU01:FS00:0xa7:             -Wno-inconsistent-dllimport -Wno-parentheses-equality
16:39:07:WU01:FS00:0xa7:             -Wno-deprecated-register -Wno-unused-local-typedef
16:39:07:WU01:FS00:0xa7:   Platform: linux2 4.6.0-1-amd64
16:39:07:WU01:FS00:0xa7:       Bits: 64
16:39:07:WU01:FS00:0xa7:       Mode: Release
16:39:07:WU01:FS00:0xa7:       SIMD: avx_256
16:39:07:WU01:FS00:0xa7:************************************ System ************************************
16:39:07:WU01:FS00:0xa7:        CPU: Intel(R) Core(TM) i5-6400 CPU @ 2.70GHz
16:39:07:WU01:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 94 Stepping 3
16:39:07:WU01:FS00:0xa7:       CPUs: 4
16:39:07:WU01:FS00:0xa7:     Memory: 15.91GiB
16:39:07:WU01:FS00:0xa7:Free Memory: 12.70GiB
16:39:07:WU01:FS00:0xa7:    Threads: WINDOWS_THREADS
16:39:07:WU01:FS00:0xa7: OS Version: 6.2
16:39:07:WU01:FS00:0xa7:Has Battery: false
16:39:07:WU01:FS00:0xa7: On Battery: false
16:39:07:WU01:FS00:0xa7: UTC Offset: 0
16:39:07:WU01:FS00:0xa7:        PID: 3376
16:39:07:WU01:FS00:0xa7:        CWD: C:\Users\david\AppData\Roaming\FAHClient\work
16:39:07:WU01:FS00:0xa7:         OS: Windows 10 Home
16:39:07:WU01:FS00:0xa7:    OS Arch: AMD64
16:39:07:WU01:FS00:0xa7:********************************************************************************
16:39:07:WU01:FS00:0xa7:Project: 13126 (Run 77, Clone 7, Gen 23)
16:39:07:WU01:FS00:0xa7:Unit: 0x0000001bab436c655898c998e9514d98
16:39:07:WU01:FS00:0xa7:Reading tar file core.xml
16:39:07:WU01:FS00:0xa7:Reading tar file frame23.tpr
16:39:07:WU01:FS00:0xa7:Digital signatures verified
16:39:07:WU01:FS00:0xa7:Calling: mdrun -s frame23.tpr -o frame23.trr -cpt 15 -nt 4
16:39:08:WU01:FS00:0xa7:Steps: first=4600000 total=200000
16:39:10:WU01:FS00:0xa7:Completed 1 out of 200000 steps (0%)

16:40:55:WU01:FS00:0xa7:Completed 2000 out of 200000 steps (1%)

16:42:39:WU01:FS00:0xa7:Completed 4000 out of 200000 steps (2%)
16:44:22:WU01:FS00:0xa7:Completed 6000 out of 200000 steps (3%)
...
17:51:31:WU01:FS00:0xa7:Completed 84000 out of 200000 steps (42%)
17:52:22:Removing old file 'configs/config-20170220-213437.xml'
17:52:22:Saving configuration to config.xml
17:52:22:<config>
17:52:22:  <!-- HTTP Server -->
17:52:22:  <allow v='127.0.0.1 192.168.1.0/24'/>
17:52:22:
17:52:22:  <!-- Network -->
17:52:22:  <proxy v=':8080'/>
17:52:22:
17:52:22:  <!-- Remote Command Server -->
17:52:22:  <password v='*******'/>
17:52:22:
17:52:22:  <!-- Slot Control -->
17:52:22:  <power v='FULL'/>
17:52:22:
17:52:22:  <!-- User Information -->
17:52:22:  <passkey v='********************************'/>
17:52:22:  <user v='davidcoton'/>
17:52:22:
17:52:22:  <!-- Folding Slots -->
17:52:22:  <slot id='0' type='CPU'>
17:52:22:    <client-type v=redacted/>
17:52:22:    <cpus v='4'/>
17:52:22:    <project-key v=redacted/>
17:52:22:  </slot>
17:52:22:  <slot id='1' type='GPU'/>
17:52:22:</config>
17:53:10:Removing old file 'configs/config-20170227-230643.xml'
17:53:10:Saving configuration to config.xml
17:53:10:<config>
17:53:10:  <!-- HTTP Server -->
17:53:10:  <allow v='127.0.0.1 192.168.1.0/24'/>
17:53:10:
17:53:10:  <!-- Network -->
17:53:10:  <proxy v=':8080'/>
17:53:10:
17:53:10:  <!-- Remote Command Server -->
17:53:10:  <password v='*******'/>
17:53:10:
17:53:10:  <!-- Slot Control -->
17:53:10:  <power v='FULL'/>
17:53:10:
17:53:10:  <!-- User Information -->
17:53:10:  <passkey v='********************************'/>
17:53:10:  <user v='davidcoton'/>
17:53:10:
17:53:10:  <!-- Folding Slots -->
17:53:10:  <slot id='0' type='CPU'>
17:53:10:    <client-type v=redacted/>
17:53:10:    <cpus v='4'/>
17:53:10:    <project-key v=redacted/>
17:53:10:  </slot>
17:53:10:  <slot id='1' type='GPU'/>
17:53:10:</config>
17:53:14:WU01:FS00:0xa7:Completed 86000 out of 200000 steps (43%)
17:54:57:WU01:FS00:0xa7:Completed 88000 out of 200000 steps (44%)
17:55:35:WU01:FS00:0xa7:ERROR:
17:55:35:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
17:55:35:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20160919-669094a-unknown
17:55:35:WU01:FS00:0xa7:ERROR:Source code file: /host/windows-cross-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
17:55:35:WU01:FS00:0xa7:ERROR:
17:55:35:WU01:FS00:0xa7:ERROR:Fatal error:
17:55:35:WU01:FS00:0xa7:ERROR:46 particles communicated to PME rank 1 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
17:55:35:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
17:55:35:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
17:55:35:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
17:55:35:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
17:55:35:WU01:FS00:0xa7:ERROR:
17:55:35:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
17:55:35:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20160919-669094a-unknown
17:55:35:WU01:FS00:0xa7:ERROR:Source code file: /host/windows-cross-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
17:55:35:WU01:FS00:0xa7:ERROR:
17:55:35:WU01:FS00:0xa7:ERROR:Fatal error:
17:55:35:WU01:FS00:0xa7:ERROR:24 particles communicated to PME rank 2 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
17:55:35:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
17:55:35:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
17:55:35:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
17:55:35:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
The reconfiguration should not be relevant, it was preparing for a test WU next. No current settings were changed.

Re: project:13126 run:77 clone:7 gen:23

Posted: Tue Mar 07, 2017 9:17 pm
by nhstanley
Hi David,

This error is likely due to the system becoming unstable for some reason. Sometimes it's a hardware/overclocking issue, sometimes it's an issue in with the files getting corrupted at some point, or it could be that some setting is a bit too relaxed (such as temperature damping). Usually re-running from the last WU gets around the issue. It shouldn't have hung or caused your system any issues, however. We can take a look if that was the case.

Also, 13126 has been off testing for a few weeks now, so you shouldn't have gotten it if you were using a project key. Let me know if I'm (not) reading your config correctly.

Nate

Re: project:13126 run:77 clone:7 gen:23

Posted: Tue Mar 07, 2017 9:33 pm
by davidcoton
Thanks Nate
I suspected it was just a simulation run in trouble. The project key was applied after this WU started, should not be relevant. My PC did not hang, only the slot (ie it did not upload and then get a new WU, nor did it show an error state). From memory I think the partial WU uploaded after I restarted the slot (I think it was only the slot, not the whole client). If it happens again I'll try to record the details properly. If it's important I can probably find the log of the restart.

Re: project:13126 run:77 clone:7 gen:23

Posted: Tue Mar 07, 2017 10:21 pm
by nhstanley
Ok, that sounds good. I don't think it will help to dig up the log unless we see this issue again. It was likely an unlucky failure, but we'll keep an eye out. Thanks.