Project: 14004 (Run:0, Clone:115, Gen:16)

Moderators: Site Moderators, PandeGroup

Project: 14004 (Run:0, Clone:115, Gen:16)

Postby Spidermaster » Tue Jan 02, 2018 8:09 pm

I have a bad Work Unit 14004 which has been hung for the better part of a day. I've restarted it three times after it hung at 99.99%. While I'll be losing up to two days processing, I thought it important to restart it the third time and track it to see what is going on.

The log reinitializes each time the unit is restarted. Here is the entire log with the unit now at 99.08%:

*********************** Log Started 2018-01-02T17:55:17Z ***********************
17:55:17:************************* Folding@home Client *************************
17:55:17: Website: http://folding.stanford.edu/
17:55:17: Copyright: (c) 2009-2014 Stanford University
17:55:17: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
17:55:17: Args: --open-web-control
17:55:17: Config: C:/Users/sinbad/AppData/Roaming/FAHClient/config.xml
17:55:17:******************************** Build ********************************
17:55:17: Version: 7.4.4
17:55:17: Date: Mar 4 2014
17:55:17: Time: 20:26:54
17:55:17: SVN Rev: 4130
17:55:17: Branch: fah/trunk/client
17:55:17: Compiler: Intel(R) C++ MSVC 1500 mode 1200
17:55:17: Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
17:55:17: /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
17:55:17: Platform: win32 XP
17:55:17: Bits: 32
17:55:17: Mode: Release
17:55:17:******************************* System ********************************
17:55:17: CPU: AMD Phenom(tm) 9150e Quad-Core Processor
17:55:17: CPU ID: AuthenticAMD Family 16 Model 2 Stepping 3
17:55:17: CPUs: 4
17:55:17: Memory: 3.75GiB
17:55:17: Free Memory: 959.96MiB
17:55:17: Threads: WINDOWS_THREADS
17:55:17: OS Version: 6.0
17:55:17: Has Battery: false
17:55:17: On Battery: false
17:55:17: UTC Offset: -5
17:55:17: PID: 6504
17:55:17: CWD: C:/Users/sinbad/AppData/Roaming/FAHClient
17:55:17: OS: Windows (TM) Vista Home Premium
17:55:17: OS Arch: AMD64
17:55:17: GPUs: 1
17:55:17: GPU 0: UNSUPPORTED: [Radeon HD 3200]
17:55:17: CUDA: Not detected
17:55:17:Win32 Service: false
17:55:17:***********************************************************************
17:55:17:<config>
17:55:17: <!-- Folding Core -->
17:55:17: <core-priority v='low'/>
17:55:17:
17:55:17: <!-- Network -->
17:55:17: <proxy v=':8080'/>
17:55:17:
17:55:17: <!-- Slot Control -->
17:55:17: <pause-on-battery v='false'/>
17:55:17: <power v='full'/>
17:55:17:
17:55:17: <!-- User Information -->
17:55:17: <passkey v='********************************'/>
17:55:17: <team v='13915'/>
17:55:17: <user v='Spidermaster'/>
17:55:17:
17:55:17: <!-- Folding Slots -->
17:55:17: <slot id='0' type='CPU'/>
17:55:17:</config>
17:55:17:Trying to access database...
17:55:17:Successfully acquired database lock
17:55:17:Enabled folding slot 00: READY cpu:4
17:55:17:WU01:FS00:Starting
17:55:17:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/sinbad/AppData/Roaming/FAHClient/cores/fahwebx.stanford.edu/cores/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 01 -suffix 01 -version 704 -lifeline 6504 -checkpoint 15 -np 4
17:55:17:WU01:FS00:Started FahCore on PID 6856
17:55:17:WU01:FS00:Core PID:6080
17:55:17:WU01:FS00:FahCore 0xa4 started
17:55:17:WU01:FS00:0xa4:
17:55:17:WU01:FS00:0xa4:*------------------------------*
17:55:17:WU01:FS00:0xa4:Folding@Home Gromacs GB Core
17:55:17:WU01:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
17:55:17:WU01:FS00:0xa4:
17:55:17:WU01:FS00:0xa4:Preparing to commence simulation
17:55:17:WU01:FS00:0xa4:- Looking at optimizations...
17:55:17:WU01:FS00:0xa4:- Files status OK
17:55:17:WU01:FS00:0xa4:- Expanded 740484 -> 1939364 (decompressed 261.9 percent)
17:55:17:WU01:FS00:0xa4:Called DecompressByteArray: compressed_data_size=740484 data_size=1939364, decompressed_data_size=1939364 diff=0
17:55:17:WU01:FS00:0xa4:- Digital signature verified
17:55:17:WU01:FS00:0xa4:
17:55:17:WU01:FS00:0xa4:Project: 14004 (Run 0, Clone 115, Gen 16)
17:55:17:WU01:FS00:0xa4:
17:55:18:WU01:FS00:0xa4:Assembly optimizations on if available.
17:55:18:WU01:FS00:0xa4:Entering M.D.
17:55:24:WU01:FS00:0xa4:Using Gromacs checkpoints
17:55:24:WU01:FS00:0xa4:Mapping NT from 4 to 4
17:55:25:15:127.0.0.1:New Web connection
17:59:39:WU01:FS00:0xa4:Resuming from checkpoint
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.log
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.trr
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.xtc
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.edr
18:00:26:WU01:FS00:0xa4:Completed 2358960 out of 2500000 steps (94%)
18:00:26:WARNING:WU01:FS00:Detected clock skew (5 mins 09 secs), adjusting time estimates

Please note that following the clock skew (origin unknown, as my computer clock has not been adjusted), there are NO MORE CHECKPOINTS.

If the unit hangs again, I will be forced to dump it in order to resume proper processing, in which case this log will be lost. I therefore hope this information is of use in diagnosing the problem.

The Spidermaster

P.S. I've emptied the work unit queue, but I saved all the associated files rather than deleting them. If you require these files for further analysis, please let me know.
Spidermaster
 
Posts: 7
Joined: Mon Oct 23, 2017 5:16 am

Re: Project: 14004 (Run:0, Clone:115, Gen:16)

Postby bruce » Wed Jan 03, 2018 3:52 am

The log reinitializes each time the unit is restarted.
When you restart FAHClient, the old log is moved to the logs subdirectory of FAH's data file. A new log is started. Have you looked there? -- or does your method of restarting the WU wipe out the contents of that directory, too? Is that wiping out the checkpoints, too?

Apparently you're not restarting only the WU. There is no documented way to restart a WU other than to pause it ... wait ... and then resume folding it. I see no indication that you paused the WU.

Please post segments of runs from those previous logs.

When Project: 14004 (Run:0, Clone:115, Gen:16) either expires or your client reports an error, it will be reassigned to someone else. There's no error being reported.
bruce
 
Posts: 21316
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 14004 (Run:0, Clone:115, Gen:16)

Postby Spidermaster » Wed Jan 03, 2018 5:21 am

Bruce,

When I said I restarted the work unit, I actually quit the Folding program and restarted it. I have found the prior logs. At one point I had reset my system clock, but that was AFTER I quit Folding, and the clock was reset to the current time BEFORE Folding was restarted. Apparently, something went wrong during that process. I completely dumped the work unit files and obtained a new work unit, but that one was not checkpointing after about six hours, so I dumped that one too. The work unit I have running now appears to be running optimally.

Sorry for the screw up. It was most likely due in some way to my resetting the clock, even though the Folding program was not running during that process.

The Spidermaster
Spidermaster
 
Posts: 7
Joined: Mon Oct 23, 2017 5:16 am

Re: Project: 14004 (Run:0, Clone:115, Gen:16)

Postby bruce » Wed Jan 03, 2018 5:53 pm

So in the previous logs, did you find error reports, and were any of those error reports uploaded?

Resetting the clock does affect the current WU. The download time is noted both by the server using it's clock and by the local system, using your local clock. The time between the download and the eventual upload that the WU was on your system is usied to calculate the Bonus points based on the server's clock. Whenever you reset the local clock, the client recognizes that the duration based on your local clock no longer matches the duration based on the server clock. The message reports that discrepancy and adjusts accordingly. (It doesn't matter whether a WU was actively being computed or was sitting idle on your system at the time.)

You say that no checkpoints were being recorded after 6 hours. That's possible but unlikely. In the one log you posted, I see that FAHCore_a4 reports that it had completed 2358960 out of 2500000 steps (94%) so it was able to restart from the most recent checkpoint. FAH does not issue messages saying a checkpoint was recorded. How are you determining when a checkpoint is written? Have you adjusted the default setting of 15 minutes for the checkpoint interval?
bruce
 
Posts: 21316
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 14004 (Run:0, Clone:115, Gen:16)

Postby Spidermaster » Thu Jan 04, 2018 3:51 am

Bruce,

I saw no error logs (or I didn't know what I was looking for). Some work units do interfere with operation of my (old) computer due to its low physical memory and my heavy usage, so I periodically have to shut Folding down. Haven't had any problems with my Windows 10 computer (12gb). Here are shutdown excerpts from previous logs relating to this WU. Note that in the log excerpt with the 1 Jan date, no steps (Those are what I call "checkpoints" -- semantics.) following Step 94 occur for eight hours, after which time I shut Folding down. At that time the console was reading 99% with 15 seconds remaining, but it never completed. So ... this actually occurred BEFORE the clock skew.

Please let me know if there is anything else I can provide. My suspicion is that the unit may have run out of memory and hung due to excessive disc swapping. That is just conjecture; I have occasionally experienced problems with other programs. I wish I had more memory on this machine (running Windows Vista), but we don't always get what we want.

The Spidermaster

******************************* Date: 2017-12-30 *******************************
15:40:57:WU01:FS00:0xa4:Completed 700000 out of 2500000 steps (28%)
16:09:35:WU01:FS00:0xa4:Completed 725000 out of 2500000 steps (29%)
16:35:45:WU01:FS00:0xa4:Completed 750000 out of 2500000 steps (30%)
17:01:04:WU01:FS00:0xa4:Completed 775000 out of 2500000 steps (31%)
17:14:25:FS00:Shutting core down
17:14:31:WU01:FS00:0xa4:Client no longer detected. Shutting down core
17:14:31:WU01:FS00:0xa4:
17:14:31:WU01:FS00:0xa4:Folding@home Core Shutdown: CLIENT_DIED
17:14:31:Clean exit


******************************* Date: 2017-12-31 *******************************
03:28:49:WU01:FS00:0xa4:Completed 1075000 out of 2500000 steps (43%)
03:54:26:WU01:FS00:0xa4:Completed 1100000 out of 2500000 steps (44%)
04:08:40:FS00:Shutting core down
04:08:49:WU01:FS00:0xa4:Client no longer detected. Shutting down core
04:08:49:WU01:FS00:0xa4:
04:08:49:WU01:FS00:0xa4:Folding@home Core Shutdown: CLIENT_DIED
04:08:50:Clean exit


******************************* Date: 2018-01-01 *******************************
18:50:27:WU01:FS00:0xa4:Completed 2100000 out of 2500000 steps (84%)
19:17:29:WU01:FS00:0xa4:Completed 2125000 out of 2500000 steps (85%)
19:43:36:WU01:FS00:0xa4:Completed 2150000 out of 2500000 steps (86%)
20:09:12:WU01:FS00:0xa4:Completed 2175000 out of 2500000 steps (87%)
20:33:55:WU01:FS00:0xa4:Completed 2200000 out of 2500000 steps (88%)
20:59:29:WU01:FS00:0xa4:Completed 2225000 out of 2500000 steps (89%)
21:24:38:WU01:FS00:0xa4:Completed 2250000 out of 2500000 steps (90%)
21:50:42:WU01:FS00:0xa4:Completed 2275000 out of 2500000 steps (91%)
22:16:44:WU01:FS00:0xa4:Completed 2300000 out of 2500000 steps (92%)
00:02:51:WU01:FS00:0xa4:Completed 2325000 out of 2500000 steps (93%)
******************************* Date: 2018-01-02 *******************************
05:47:20:FS00:Shutting core down
05:47:31:Clean exit


06:10:34:WU01:FS00:0xa4:Using Gromacs checkpoints
06:10:34:WU01:FS00:0xa4:Mapping NT from 4 to 4
07:10:53:WU01:FS00:0xa4:Resuming from checkpoint
07:11:01:WU01:FS00:0xa4:Verified 01/wudata_01.log
07:11:33:WU01:FS00:0xa4:Verified 01/wudata_01.trr
07:11:33:WU01:FS00:0xa4:Verified 01/wudata_01.xtc
07:11:33:WU01:FS00:0xa4:Verified 01/wudata_01.edr
07:11:56:WU01:FS00:0xa4:Completed 2338130 out of 2500000 steps (93%)
09:54:55:WU01:FS00:0xa4:Completed 2350000 out of 2500000 steps (94%)
******************************* Date: 2018-01-02 *******************************
17:53:25:FS00:Finishing
17:54:14:FS00:Shutting core down
17:54:20:Clean exit


17:55:24:WU01:FS00:0xa4:Using Gromacs checkpoints
17:55:24:WU01:FS00:0xa4:Mapping NT from 4 to 4
17:55:25:15:127.0.0.1:New Web connection
17:59:39:WU01:FS00:0xa4:Resuming from checkpoint
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.log
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.trr
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.xtc
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.edr
18:00:26:WU01:FS00:0xa4:Completed 2358960 out of 2500000 steps (94%)
18:00:26:WARNING:WU01:FS00:Detected clock skew (5 mins 09 secs), adjusting time estimates
21:04:27:FS00:Shutting core down
21:04:35:WU01:FS00:0xa4:Client no longer detected. Shutting down core
21:04:35:WU01:FS00:0xa4:
21:04:35:WU01:FS00:0xa4:Folding@home Core Shutdown: CLIENT_DIED
21:04:35:Clean exit
Spidermaster
 
Posts: 7
Joined: Mon Oct 23, 2017 5:16 am

Re: Project: 14004 (Run:0, Clone:115, Gen:16)

Postby bruce » Thu Jan 04, 2018 5:24 am

I think the key ere is that you said "due to low physical memory and my heavy usage"

If you pretend that FAH is not running, how many hours per day does your "heavy usage" amount to? ... and what percentage of your CPU resources are unused during that time?

Suppose I assume "heavy" means 75% busy and 25% idle. Then when you add FAH's processing, it will only get an average of one of your 4 CPU threads.

I notice 1) you're running a 32-bit OS, and 2) you have 4 CPUs and you've allocated ALL of them to FAH. You'd actually be better off to allocate only 1 of your CPUs to FAH since that's all that's going to be available anyway.

Now maybe "heavy" means that 99% of your CPU resources are busy with non-FAH activities. Then you'd be better off not trying to run FAH at all because there'd only be 1% available.
bruce
 
Posts: 21316
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 14004 (Run:0, Clone:115, Gen:16)

Postby Spidermaster » Thu Jan 04, 2018 7:09 am

Bruce,

My "heavy usage" is memory-intensive rather than CPU-intensive -- just making that distinction. It's webmaster stuff rather than number-crunching stuff. I'll try reducing the number of CPUs in use for FAH and see whether that helps. Thanks!

The Spidermaster
Spidermaster
 
Posts: 7
Joined: Mon Oct 23, 2017 5:16 am

Re: Project: 14004 (Run:0, Clone:115, Gen:16)

Postby bruce » Thu Jan 04, 2018 7:38 pm

Memory--intensive with 4GB real RAM is going to push things into virtual memory, increasing the load on your paging device. For most people, that's not too much of a problem, but you might monitor it and see how your system performs. I generally don't recommend adding RAM or moving the paging file to a faster device, but in your self-described situation, it MIGHT help. A 64-bit OS might help, too, as I expect support for 32-bit OSs will continue to receive less and less support. You're trying to do a lot with that system.
bruce
 
Posts: 21316
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 14004 (Run:0, Clone:115, Gen:16)

Postby Spidermaster » Fri Jan 05, 2018 6:42 am

Bruce,

Roger that! You're absolutely right. Fortunately, I am also folding on my 12 gig Windows 10 system, which works flawlessly (and much faster, with an Intel I5).

A follow-on to the previous conversation. I had another WU failure, after which I rebooted the machine and have had no further problems thus far. Here is the pertinent portion of the log:

00:08:51:WU00:FS00:0xa4:Completed 237500 out of 250000 steps (95%)
00:43:00:WU00:FS00:0xa4:Completed 240000 out of 250000 steps (96%)
01:09:48:WU00:FS00:0xa4:Completed 242500 out of 250000 steps (97%)
01:18:00:WU00:FS00:0xa4:Completed 245000 out of 250000 steps (98%)
01:26:20:WU00:FS00:0xa4:Completed 247500 out of 250000 steps (99%)
01:26:22:WU01:FS00:Connecting to 171.67.108.45:8080
01:26:24:WU01:FS00:Assigned to work server 155.247.166.220
01:26:24:WU01:FS00:Requesting new work unit for slot 00: RUNNING cpu:4 from 155.247.166.220
01:26:24:WU01:FS00:Connecting to 155.247.166.220:8080
01:26:25:WU01:FS00:Downloading 344.20KiB
01:26:25:WU01:FS00:Download complete
01:26:25:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:8613 run:9695 clone:3 gen:25 core:0xa4 unit:0x0000001e0002894c5a011011184f3c56
01:35:09:WU00:FS00:0xa4:Completed 250000 out of 250000 steps (100%)
01:35:26:WU00:FS00:0xa4:DynamicWrapper: Finished Work Unit: sleep=10000
01:35:36:WU00:FS00:0xa4:
01:35:36:WU00:FS00:0xa4:Finished Work Unit:
01:35:36:WU00:FS00:0xa4:- Reading up to 811536 from "00/wudata_01.trr": Read 811536
01:35:36:WU00:FS00:0xa4:trr file hash check passed.
01:35:36:WU00:FS00:0xa4:- Reading up to 746060 from "00/wudata_01.xtc": Read 746060
01:35:36:WU00:FS00:0xa4:xtc file hash check passed.
01:35:36:WU00:FS00:0xa4:edr file hash check passed.
01:35:36:WU00:FS00:0xa4:logfile size: 28730
01:35:36:WU00:FS00:0xa4:Leaving Run
01:35:37:WU00:FS00:0xa4:- Writing 1588814 bytes of core data to disk...
01:35:38:WU00:FS00:0xa4:Done: 1588302 -> 1538819 (compressed to 96.8 percent)
01:35:38:WU00:FS00:0xa4: ... Done.
01:35:39:WU00:FS00:0xa4:- Shutting down core
01:35:39:WU00:FS00:0xa4:
01:35:39:WU00:FS00:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
01:35:42:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
01:35:42:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:9032 run:895 clone:0 gen:848 core:0xa4 unit:0x000003a1ab436c9e569832387f65831f
01:35:42:WU00:FS00:Uploading 1.47MiB to 171.67.108.158
01:35:42:WU00:FS00:Connecting to 171.67.108.158:8080
01:35:42:WU01:FS00:Starting
01:35:42:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/sinbad/AppData/Roaming/FAHClient/cores/fahwebx.stanford.edu/cores/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 01 -suffix 01 -version 704 -lifeline 4772 -checkpoint 15 -np 4
01:35:43:WU01:FS00:Started FahCore on PID 7468
01:35:43:WU01:FS00:Core PID:2700
01:35:43:WU01:FS00:FahCore 0xa4 started
01:35:44:WU01:FS00:0xa4:
01:35:44:WU01:FS00:0xa4:*------------------------------*
01:35:44:WU01:FS00:0xa4:Folding@Home Gromacs GB Core
01:35:44:WU01:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
01:35:44:WU01:FS00:0xa4:
01:35:44:WU01:FS00:0xa4:Preparing to commence simulation
01:35:44:WU01:FS00:0xa4:- Looking at optimizations...
01:35:44:WU01:FS00:0xa4:- Created dyn
01:35:44:WU01:FS00:0xa4:- Files status OK
01:35:44:WU01:FS00:0xa4:- Expanded 351949 -> 596960 (decompressed 169.6 percent)
01:35:44:WU01:FS00:0xa4:Called DecompressByteArray: compressed_data_size=351949 data_size=596960, decompressed_data_size=596960 diff=0
01:35:44:WU01:FS00:0xa4:- Digital signature verified
01:35:44:WU01:FS00:0xa4:
01:35:44:WU01:FS00:0xa4:Project: 8613 (Run 9695, Clone 3, Gen 25)
01:35:44:WU01:FS00:0xa4:
01:35:44:WU01:FS00:0xa4:Assembly optimizations on if available.
01:35:44:WU01:FS00:0xa4:Entering M.D.
01:35:46:WU00:FS00:Upload complete
01:35:46:WU00:FS00:Server responded WORK_ACK (400)
01:35:46:WU00:FS00:Final credit estimate, 530.00 points
01:35:47:WU00:FS00:Cleaning up
01:35:49:WU01:FS00:0xa4:Mapping NT from 4 to 4
01:37:03:WU01:FS00:0xa4:Completed 0 out of 2500000 steps (0%)
******************************* Date: 2018-01-04 *******************************
17:12:55:FS00:Shutting core down
17:13:01:WU01:FS00:0xa4:Client no longer detected. Shutting down core
17:13:02:WU01:FS00:0xa4:
17:13:02:WU01:FS00:0xa4:Folding@home Core Shutdown: CLIENT_DIED
17:13:02:Clean exit

The item I would like you to note (near the end of the log snippet) is that the work unit ran nearly 16 hrs without a step incrementation. The Advanced Control appliance listed the WU as 99.99% complete with 5 seconds remaining, but of course it never finished. It was clear at this point that there was something wrong in the machine, so I rebooted and I am pretty sure it is the same work unit that reestablished and resumed normal processing.

You might want to investigate to see whether there is something in the Folding code that allows (some sort of) processing to continue without stepping. As to what might cause this on my machine, with the exception of low memory I can think of one other thing -- Adobe Flash. I have had significant problems with Flash and try to keep it off my machine, but of course that is not always possible, as I require it to display certain info that I require. I have found that Flash has a nasty habit of hanging -- especially if multiple iterations are running simultaneously. When this happens, it wreaks havoc with my system, sometimes slowing it to a dead crawl. (If I remember to check for it, I can kill it and things will return to normal.) I was having noticeable slowing of processes in the machine during the time that the WU was not functioning nominally; sadly, I did not think to check to see whether Flash was active before I rebooted. Should this happen again, I will try to remember to investigate that. In the meantime, the current WU is at 78% and stepping properly in the log, so I am confident that things have returned to normal. Sorry for all the hassle, and I appreciate your responses. Take care.

The Spidermaster
Spidermaster
 
Posts: 7
Joined: Mon Oct 23, 2017 5:16 am

Re: Project: 14004 (Run:0, Clone:115, Gen:16)

Postby bruce » Fri Jan 05, 2018 10:31 pm

There have been reports similar to yours but nobody has come up with an explanation -- nor has anybody been able to come up with a method that can reproduce it.

The apparent hang at s 99.99% is a meaningless piece of information. Suppose a WU writes a 1% progress message every 5 minutes. After 50 minutes of uniform production, we can presume there's a message reporting 10% complete and we can project that it will be finished after 500 minutes. Now suppose after 450 minutes we see the last progress message saying it is 90% complete. Now, assume that for some unknown reason the WU hangs. The Advanced Control program will still assume the projected rate of 1% per 5 minutes and it will continue right up to 99.99% even though no progress is being made. In this scenario, there MIGHT have been an error message after 450 minutes or there might have been no message -- except for the missing progress messages. The only other clue is that if the WU is paused, then work is resumed, it will start from about 90% -- whenever the actual hang occurred.

You're reporting hangs with Flash. It's not really far-fetched to assume your system is unstable and whatever is causing Flash to hang is also causing FAH to hang. The point is that we don't have any kind of explanation for what caused the hangs.
bruce
 
Posts: 21316
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 14004 (Run:0, Clone:115, Gen:16)

Postby SteveWillis » Fri Jan 05, 2018 10:43 pm

I ran into the same issue of the 99.99% hang on my Linux machine that was also resolved by a reboot. Unfortunately I didn't see it for about 12 hours. So I added code to my auto reboot script to boot if a FS goes over 10 minutes without incrementing. It hasn't happened again though.
Image
folding as DarthMouse_ALL_1GD5nCZbh7gNo1SESPLT24xEd2Jsu4rTP9
Currently folding on 14 GPUs on Linux Mint
If you wish to discuss a comment or post I've made feel free to PM me.
SteveWillis
 
Posts: 347
Joined: Fri Apr 15, 2016 12:42 am


Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 2 guests

cron