13724 2/0/62 fails to get new work

Moderators: Site Moderators, FAHC Science Team

Post Reply
verdeva
Posts: 30
Joined: Mon Dec 03, 2007 1:40 pm
Location: Seattle, WA

13724 2/0/62 fails to get new work

Post by verdeva »

This unit finishes cleanly then folding stops. No warnings​ or messages.
verdeva
Posts: 30
Joined: Mon Dec 03, 2007 1:40 pm
Location: Seattle, WA

Re: 13724 2/0/62 fails to get new work

Post by verdeva »

This reoccurred on a different machine, different work unit. Here is text from the log files.

Code: Select all

00:33:27:WU01:FS00:0xa4:Completed 2400000 out of 2500000 steps  (96%)
00:40:09:WU01:FS00:0xa4:Completed 2425000 out of 2500000 steps  (97%)
00:46:48:WU01:FS00:0xa4:Completed 2450000 out of 2500000 steps  (98%)
00:53:28:WU01:FS00:0xa4:Completed 2475000 out of 2500000 steps  (99%)
00:53:29:WU00:FS00:Connecting to 171.67.108.45:8080
01:00:12:WU01:FS00:0xa4:Completed 2500000 out of 2500000 steps  (100%)
01:00:12:WU01:FS00:0xa4:DynamicWrapper: Finished Work Unit: sleep=10000
01:00:22:WU01:FS00:0xa4:
01:00:22:WU01:FS00:0xa4:Finished Work Unit:
01:00:22:WU01:FS00:0xa4:- Reading up to 5333472 from "01/wudata_01.trr": Read 5333472
01:00:22:WU01:FS00:0xa4:trr file hash check passed.
01:00:22:WU01:FS00:0xa4:- Reading up to 332388 from "01/wudata_01.xtc": Read 332388
01:00:22:WU01:FS00:0xa4:xtc file hash check passed.
01:00:22:WU01:FS00:0xa4:edr file hash check passed.
01:00:22:WU01:FS00:0xa4:logfile size: 56991
01:00:22:WU01:FS00:0xa4:Leaving Run
01:00:23:WU01:FS00:0xa4:- Writing 5760555 bytes of core data to disk...
01:00:24:WU01:FS00:0xa4:Done: 5760043 -> 4835669 (compressed to 83.9 percent)
01:00:24:WU01:FS00:0xa4:  ... Done.
01:00:24:WU01:FS00:0xa4:- Shutting down core
01:00:24:WU01:FS00:0xa4:
01:00:24:WU01:FS00:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
01:00:25:WU01:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
01:00:25:WU01:FS00:Sending unit results: id:01 state:SEND error:NO_ERROR project:8694 run:0 clone:277 gen:30 core:0xa4 unit:0x000000210002894c58235424910db129
01:00:25:WU01:FS00:Uploading 4.61MiB to 155.247.166.220
01:00:25:WU01:FS00:Connecting to 155.247.166.220:8080
01:00:31:WU01:FS00:Upload 14.91%
01:00:37:WU01:FS00:Upload 29.81%
01:00:43:WU01:FS00:Upload 46.07%
01:00:49:WU01:FS00:Upload 60.98%
01:00:55:WU01:FS00:Upload 75.89%
01:01:01:WU01:FS00:Upload 92.15%
01:01:05:WU01:FS00:Upload complete
01:01:05:WU01:FS00:Server responded WORK_ACK (400)
01:01:05:WU01:FS00:Final credit estimate, 9629.00 points
01:01:05:WU01:FS00:Cleaning up
******After slot ID is deleted and restored ****

Code: Select all

16:35:17:Adding folding slot 00: READY cpu:8
16:35:17:Removing old file 'configs/config-20161110-004405.xml'
16:35:17:Saving configuration to config.xml
16:35:17:<config>
16:35:17:  <!-- Folding Slot Configuration -->
16:35:17:  <client-type v='advanced'/>
16:35:17:
16:35:17:  <!-- Network -->
16:35:17:  <proxy v=':8080'/>
16:35:17:
16:35:17:  <!-- Slot Control -->
16:35:17:  <power v='full'/>
16:35:17:
16:35:17:  <!-- User Information -->
16:35:17:  <passkey v='********************************'/>
16:35:17:  <team v='33432'/>
16:35:17:  <user v='verdeva'/>
16:35:17:
16:35:17:  <!-- Folding Slots -->
16:35:17:  <slot id='0' type='CPU'/>
16:35:17:</config>
16:35:18:WU01:FS00:Connecting to 171.67.108.45:8080
16:35:18:WU01:FS00:Assigned to work server 171.67.108.101
16:35:18:WU01:FS00:Requesting new work unit for slot 00: READY cpu:8 from 171.67.108.101
16:35:18:WU01:FS00:Connecting to 171.67.108.101:8080
16:35:19:ERROR:WU01:FS00:Exception: Server did not assign work unit
16:35:19:WU01:FS00:Connecting to 171.67.108.45:8080
16:35:19:WU01:FS00:Assigned to work server 128.252.203.4
16:35:19:WU01:FS00:Requesting new work unit for slot 00: READY cpu:8 from 128.252.203.4
16:35:19:WU01:FS00:Connecting to 128.252.203.4:8080
16:35:20:WU01:FS00:Downloading 7.32MiB
16:35:23:WU01:FS00:Download complete
16:35:23:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13804 run:0 clone:3016 gen:4 core:0xa7 unit:0x0000000580fccb04590f3665a366e0a2
16:35:23:WU01:FS00:Starting
16:35:23:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/I7/AppData/Roaming/FAHClient/cores/fahwebx.stanford.edu/cores/Win32/AMD64/AVX/Core_a7.fah/FahCore_a7.exe -dir 01 -suffix 01 -version 704 -lifeline 2836 -checkpoint 15 -np 8
16:35:23:WU01:FS00:Started FahCore on PID 2119116
16:35:23:WU01:FS00:Core PID:2118756
16:35:23:WU01:FS00:FahCore 0xa7 started
16:35:24:WU01:FS00:0xa7:*********************** Log Started 2017-05-08T16:35:23Z ***********************
16:35:24:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
16:35:24:WU01:FS00:0xa7:       Type: 0xa7
16:35:24:WU01:FS00:0xa7:       Core: Gromacs
16:35:24:WU01:FS00:0xa7:    Website: http://folding.stanford.edu/
16:35:24:WU01:FS00:0xa7:  Copyright: (c) 2009-2016 Stanford University
16:35:24:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
16:35:24:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 704 -lifeline 2119116 -checkpoint 15
16:35:24:WU01:FS00:0xa7:             -np 8
16:35:24:WU01:FS00:0xa7:     Config: <none>
16:35:24:WU01:FS00:0xa7:************************************ Build *************************************
16:35:24:WU01:FS00:0xa7:    Version: 0.0.11
16:35:24:WU01:FS00:0xa7:       Date: Sep 21 2016
16:35:24:WU01:FS00:0xa7:       Time: 01:43:48
16:35:24:WU01:FS00:0xa7: Repository: Git
16:35:24:WU01:FS00:0xa7:   Revision: 957bd90e68d95ddcf1594dc15ff6c64cc4555146
16:35:24:WU01:FS00:0xa7:     Branch: master
16:35:24:WU01:FS00:0xa7:   Compiler: GNU 4.2.1 Compatible Clang 3.9.0 (trunk 274080)
16:35:24:WU01:FS00:0xa7:    Options: -std=gnu++98 -O3 -funroll-loops -ffast-math -mfpmath=sse
16:35:24:WU01:FS00:0xa7:             -fno-unsafe-math-optimizations -msse2 -I/mingw64/include
16:35:24:WU01:FS00:0xa7:             -Wno-inconsistent-dllimport -Wno-parentheses-equality
16:35:24:WU01:FS00:0xa7:             -Wno-deprecated-register -Wno-unused-local-typedef
16:35:24:WU01:FS00:0xa7:   Platform: linux2 4.6.0-1-amd64
16:35:24:WU01:FS00:0xa7:       Bits: 64
16:35:24:WU01:FS00:0xa7:       Mode: Release
16:35:24:WU01:FS00:0xa7:       SIMD: avx_256
16:35:24:WU01:FS00:0xa7:************************************ System ************************************
16:35:24:WU01:FS00:0xa7:        CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
16:35:24:WU01:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 42 Stepping 7
16:35:24:WU01:FS00:0xa7:       CPUs: 8
16:35:24:WU01:FS00:0xa7:     Memory: 7.92GiB
16:35:24:WU01:FS00:0xa7:Free Memory: 3.94GiB
16:35:24:WU01:FS00:0xa7:    Threads: WINDOWS_THREADS
16:35:24:WU01:FS00:0xa7: OS Version: 6.1
16:35:24:WU01:FS00:0xa7:Has Battery: false
16:35:24:WU01:FS00:0xa7: On Battery: false
16:35:24:WU01:FS00:0xa7: UTC Offset: -7
16:35:24:WU01:FS00:0xa7:        PID: 2118756
16:35:24:WU01:FS00:0xa7:        CWD: C:\Users\I7\AppData\Roaming\FAHClient\work
16:35:24:WU01:FS00:0xa7:         OS: Windows 7 Home Premium Service Pack 1
16:35:24:WU01:FS00:0xa7:    OS Arch: AMD64
16:35:24:WU01:FS00:0xa7:********************************************************************************
16:35:24:WU01:FS00:0xa7:Project: 13804 (Run 0, Clone 3016, Gen 4)
16:35:24:WU01:FS00:0xa7:Unit: 0x0000000580fccb04590f3665a366e0a2
16:35:24:WU01:FS00:0xa7:Reading tar file core.xml
16:35:24:WU01:FS00:0xa7:Reading tar file frame4.tpr
16:35:24:WU01:FS00:0xa7:Digital signatures verified
16:35:24:WU01:FS00:0xa7:Calling: mdrun -s frame4.tpr -o frame4.trr -x frame4.xtc -cpt 15 -nt 8
16:35:24:WU01:FS00:0xa7:Steps: first=1000000 total=250000
16:35:27:WU01:FS00:0xa7:Completed 1 out of 250000 steps (0%)
Mod edit: added Code tags to log listings - j
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 13724 2/0/62 fails to get new work

Post by bruce »

You're having the same problem that was reported here:
Subject: No new WU downloaded
verdeva
Posts: 30
Joined: Mon Dec 03, 2007 1:40 pm
Location: Seattle, WA

Re: 13724 2/0/62 fails to get new work

Post by verdeva »

Okay, with the distinction that this problem was observed on a CPU work unit. Three in the last 24 hours.
Thanks for the great support.
Joe_H
Site Admin
Posts: 7856
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 13724 2/0/62 fails to get new work

Post by Joe_H »

At this time the client connected with an AS:

Code: Select all

00:53:29:WU00:FS00:Connecting to 171.67.108.45:8080
There is no response from that AS shown in the log. This is an example where the client failed to detect that the download attempt failed to proceed, and the slot was stuck. This can happen on any download or upload, it does not matter if it is a CPU or GPU slot. An alternative to deleting and recreating the slot s to restart the client. A reboot will do the same to reset the slot and get it to try the download again.

This is a bug that has been observed for a long time. The client version in beta testing does improve on detecting and retrying the connection, but it still has not reached 100% reliability.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 13724 2/0/62 fails to get new work

Post by bruce »

The log can be confusing. It merges messages from several independent threads. When you view the log using FAHControl, you have the option of filtering the messages based on where the messages originate. In this case, messages regarding WU01 need to be (logically) separated from messages regarding WU00 since progress on one task has nothing to do with progress on the other task.

Here is how to analyze this problem:
Consider the messages from WU00 alone (You may want to gather earlier messages regarding WU00 to observe a more logical sequence of what that task was doing.)

Code: Select all

0:53:29:WU00:FS00:Connecting to 171.67.108.45:8080
(followed my no more responses...)


Also consider the messages regarding WU01 alone

Code: Select all

00:40:09:WU01:FS00:0xa4:Completed 2425000 out of 2500000 steps  (97%)
00:46:48:WU01:FS00:0xa4:Completed 2450000 out of 2500000 steps  (98%)
00:53:28:WU01:FS00:0xa4:Completed 2475000 out of 2500000 steps  (99%)
01:00:12:WU01:FS00:0xa4:Completed 2500000 out of 2500000 steps  (100%)
01:00:12:WU01:FS00:0xa4:DynamicWrapper: Finished Work Unit: sleep=10000
01:00:22:WU01:FS00:0xa4:
01:00:22:WU01:FS00:0xa4:Finished Work Unit:
01:00:22:WU01:FS00:0xa4:- Reading up to 5333472 from "01/wudata_01.trr": Read 5333472
01:00:22:WU01:FS00:0xa4:trr file hash check passed.
01:00:22:WU01:FS00:0xa4:- Reading up to 332388 from "01/wudata_01.xtc": Read 332388
01:00:22:WU01:FS00:0xa4:xtc file hash check passed.
01:00:22:WU01:FS00:0xa4:edr file hash check passed.
01:00:22:WU01:FS00:0xa4:logfile size: 56991
01:00:22:WU01:FS00:0xa4:Leaving Run
01:00:23:WU01:FS00:0xa4:- Writing 5760555 bytes of core data to disk...
01:00:24:WU01:FS00:0xa4:Done: 5760043 -> 4835669 (compressed to 83.9 percent)
01:00:24:WU01:FS00:0xa4:  ... Done.
01:00:24:WU01:FS00:0xa4:- Shutting down core
01:00:24:WU01:FS00:0xa4:
01:00:24:WU01:FS00:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
01:00:25:WU01:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
01:00:25:WU01:FS00:Sending unit results: id:01 state:SEND error:NO_ERROR project:8694 run:0 clone:277 gen:30 core:0xa4 unit:0x000000210002894c58235424910db129
01:00:25:WU01:FS00:Uploading 4.61MiB to 155.247.166.220
01:00:25:WU01:FS00:Connecting to 155.247.166.220:8080
01:00:31:WU01:FS00:Upload 14.91%
01:00:37:WU01:FS00:Upload 29.81%
01:00:43:WU01:FS00:Upload 46.07%
01:00:49:WU01:FS00:Upload 60.98%
01:00:55:WU01:FS00:Upload 75.89%
01:01:01:WU01:FS00:Upload 92.15%
01:01:05:WU01:FS00:Upload complete
01:01:05:WU01:FS00:Server responded WORK_ACK (400)
01:01:05:WU01:FS00:Final credit estimate, 9629.00 points
01:01:05:WU01:FS00:Cleaning up
The problem with WU00 started much earlier and the word "then" in your original statement is not valid, since those two log segments are independent of each other.
verdeva wrote:This unit finishes cleanly then folding stops. No warnings​ or messages.
The processing (downloading) of WU00 was already hung (at or before 0:53:29) and folding could not proceed since it had no WUs ready to process.

The processing of WU01 proceeded to its normal conclusion.
Joe_H
Site Admin
Posts: 7856
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 13724 2/0/62 fails to get new work

Post by Joe_H »

I would add one clarification to bruce's post, at the point that one message about WU00 appears, that may be the only relevant message. WU01 had reached 99% completion, and the client started a download of a new WU at the default setting. A WU number will be used by the client that is not already in use, for a single slot system this may just alternate between WU00 and WU01. So in this case a download request was initiated by contacting an AS, a new WU number was used, and the client did not get a response back.

Under normal circumstances the client would detect that no response was received and retry the request after some some time. You could search the log in the time period after 00:53:29 until you reset things by deleting the slot and recreating it to see if any other messages showed up related to WU00. But if the bug occurred you would not find any.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Post Reply