10127 (Run 48, Clone 1, Gen 121)

Moderators: Site Moderators, FAHC Science Team

Post Reply
woschl
Posts: 3
Joined: Mon Mar 04, 2013 10:27 am

10127 (Run 48, Clone 1, Gen 121)

Post by woschl »

This WU seems to have problems communicating at the 50% point. The folding thread seems to continue or hang using 50 % of the idle CPU cycles, the other 50 % seems to be the communicating attempts, which hogs down the computer.

Code: Select all

08:32:57:WU01:FS00:0xa3:Completed 920000 out of 2000000 steps  (46%)
10:35:14:WU01:FS00:0xa3:Completed 940000 out of 2000000 steps  (47%)
13:27:56:WU01:FS00:0xa3:Completed 960000 out of 2000000 steps  (48%)
******************************** Date: 03/03/13 ********************************
15:43:44:WU01:FS00:0xa3:Completed 980000 out of 2000000 steps  (49%)
17:44:36:Server connection id=2 on 0.0.0.0:36330 from 127.0.0.1
17:44:37:Server connection id=1 ended
18:58:45:WU01:FS00:0xa3:Completed 1000000 out of 2000000 steps  (50%)
19:00:17:Server connection id=3 on 0.0.0.0:36330 from 127.0.0.1
19:00:27:Server connection id=4 on 0.0.0.0:36330 from 127.0.0.1
19:00:38:Server connection id=5 on 0.0.0.0:36330 from 127.0.0.1
19:00:49:Server connection id=6 on 0.0.0.0:36330 from 127.0.0.1
19:01:00:Server connection id=7 on 0.0.0.0:36330 from 127.0.0.1
19:01:11:Server connection id=8 on 0.0.0.0:36330 from 127.0.0.1
19:01:22:Server connection id=9 on 0.0.0.0:36330 from 127.0.0.1
19:01:33:Server connection id=10 on 0.0.0.0:36330 from 127.0.0.1
19:01:44:Server connection id=11 on 0.0.0.0:36330 from 127.0.0.1
there is nothing unusual going on on this client. I did a reboot.

i did look at the status of the servers but all seem to be OK.

there is a switch to an error

Code: Select all

07:27:35:Server connection id=1505 on 0.0.0.0:36330 from 127.0.0.1
07:27:46:Server connection id=1506 on 0.0.0.0:36330 from 127.0.0.1
07:27:57:Server connection id=1507 on 0.0.0.0:36330 from 127.0.0.1
07:28:08:Server connection id=1508 on 0.0.0.0:36330 from 127.0.0.1
07:28:19:ERROR:Exception: Error creating thread
07:28:24:ERROR:Exception: Error creating thread
07:28:29:ERROR:Exception: Error creating thread
07:28:34:ERROR:Exception: Error creating thread
07:28:39:ERROR:Exception: Error creating thread
I've only completed about 60 WUs and this is my first problem, so if you need more information or i'm in the wrong place, please let me know.

woschl

Configuration:

Code: Select all

*********************** Log Started 2013-03-04T02:54:44Z ***********************
02:54:44:************************* Folding@home Client *************************
02:54:44:      Website: http://folding.stanford.edu/
02:54:44:    Copyright: (c) 2009-2012 Stanford University
02:54:44:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
02:54:44:         Args: --lifeline 3380 --command-port=36330
02:54:44:       Config: C:/Users/woschl/AppData/Roaming/FAHClient/config.xml
02:54:44:******************************** Build ********************************
02:54:44:      Version: 7.2.9
02:54:44:         Date: Oct 3 2012
02:54:44:         Time: 18:05:48
02:54:44:      SVN Rev: 3578
02:54:44:       Branch: fah/trunk/client
02:54:44:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
02:54:44:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
02:54:44:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
02:54:44:     Platform: win32 XP
02:54:44:         Bits: 32
02:54:44:         Mode: Release
02:54:44:******************************* System ********************************
02:54:44:          CPU: AMD Athlon(tm) II X2 240e Processor
02:54:44:       CPU ID: AuthenticAMD Family 16 Model 6 Stepping 2
02:54:44:         CPUs: 2
02:54:44:       Memory: 3.75GiB
02:54:44:  Free Memory: 2.10GiB
02:54:44:      Threads: WINDOWS_THREADS
02:54:44:   On Battery: false
02:54:44:   UTC offset: 1
02:54:44:          PID: 3628
02:54:44:          CWD: C:/Users/woschl/AppData/Roaming/FAHClient
02:54:44:           OS: Windows 7 Home Premium
02:54:44:      OS Arch: AMD64
02:54:44:         GPUs: 1
02:54:44:        GPU 0: UNSUPPORTED: RS880 [Radeon HD 4200]
02:54:44:         CUDA: Not detected
02:54:44:Win32 Service: false
02:54:44:***********************************************************************
02:54:45:<config>
02:54:45:  <!-- Folding Slot Configuration -->
02:54:45:  <gpu v='true'/>
02:54:45:
02:54:45:  <!-- Network -->
02:54:45:  <proxy v=':8080'/>
02:54:45:
02:54:45:  <!-- User Information -->
02:54:45:  <user v='SemiAnonymousWolf'/>
02:54:45:
02:54:45:  <!-- Folding Slots -->
02:54:45:  <slot id='0' type='SMP'/>
02:54:45:</config>
02:54:45:Trying to access database...
02:54:45:Successfully acquired database lock
02:54:45:Enabled folding slot 00: READY smp:2
02:54:45:WU01:FS00:Starting
02:54:45:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/woschl/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/Core_a3.fah/FahCore_a3.exe -dir 01 -suffix 01 -version 702 -lifeline 3628 -checkpoint 15 -np 2
02:54:45:WU01:FS00:Started FahCore on PID 3780
02:54:45:WU01:FS00:Core PID:3348
02:54:45:WU01:FS00:FahCore 0xa3 started
02:54:47:Server connection id=1 on 0.0.0.0:36330 from 127.0.0.1
02:54:47:WU01:FS00:0xa3:
02:54:47:WU01:FS00:0xa3:*------------------------------*
02:54:47:WU01:FS00:0xa3:Folding@Home Gromacs SMP Core
02:54:47:WU01:FS00:0xa3:Version 2.27 (Dec. 15, 2010)
02:54:47:WU01:FS00:0xa3:
02:54:47:WU01:FS00:0xa3:Preparing to commence simulation
02:54:47:WU01:FS00:0xa3:- Ensuring status. Please wait.
02:54:56:WU01:FS00:0xa3:- Looking at optimizations...
02:54:56:WU01:FS00:0xa3:- Working with standard loops on this execution.
02:54:56:WU01:FS00:0xa3:- Previous termination of core was improper.
02:54:56:WU01:FS00:0xa3:- Files status OK
02:54:57:WU01:FS00:0xa3:- Expanded 2039537 -> 3061060 (decompressed 150.0 percent)
02:54:57:WU01:FS00:0xa3:Called DecompressByteArray: compressed_data_size=2039537 data_size=3061060, decompressed_data_size=3061060 diff=0
02:54:57:WU01:FS00:0xa3:- Digital signature verified
02:54:57:WU01:FS00:0xa3:
02:54:57:WU01:FS00:0xa3:Project: 10127 (Run 48, Clone 1, Gen 121)
02:54:57:WU01:FS00:0xa3:
02:54:57:WU01:FS00:0xa3:Entering M.D.
02:55:03:WU01:FS00:0xa3:Using Gromacs checkpoints
02:55:03:WU01:FS00:0xa3:Mapping NT from 2 to 2 
02:55:07:WU01:FS00:0xa3:Resuming from checkpoint
02:55:07:WU01:FS00:0xa3:Verified 01/wudata_01.log
02:55:07:WU01:FS00:0xa3:Verified 01/wudata_01.trr
02:55:07:WU01:FS00:0xa3:Verified 01/wudata_01.xtc
02:55:07:WU01:FS00:0xa3:Verified 01/wudata_01.edr
02:55:08:WU01:FS00:0xa3:Completed 1018212 out of 2000000 steps  (50%)
02:55:20:Server connection id=2 on 0.0.0.0:36330 from 127.0.0.1
02:55:31:Server connection id=3 on 0.0.0.0:36330 from 127.0.0.1
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 10127 (Run 48, Clone 1, Gen 121)

Post by PantherX »

Welcome to the F@H Forum woschl,

Please note that this isn't a bad WU since it has been already completed by another donor. However, since this is the first problem you have encountered on your system, it can just be a random event so you can ignore it. However, if this happens frequently, then it might be an indication that something is wrong with your system.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Joe_H
Site Admin
Posts: 7870
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 10127 (Run 48, Clone 1, Gen 121)

Post by Joe_H »

I don't know if this is related to the problem you are seeing, but I have seen a similar connection issue when using the V7 client on OS X. The software has a limit on how many connections can be open simultaneously, so after that it will not accept any more. In your selection of the error messages I am only seeing connections being made, none being closed. The cause on my system was a bug in a third-party monitoring app, it kept opening new connections and eventually used up all that were available. The only cure I found in my case was to completely shutdown folding and restart it. In some cases it took a reboot. So that is what I can suggest to clear this up.

As for the source of the problem on your Windows system, there have been some reports of the connections being lost between the active components of the F@H software in the past. It does not happen often, so the investigation has had limited examples to determine where was a bug and how to fix it. There was a ticket on that, if I can find it I will add that to this post.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
woschl
Posts: 3
Joined: Mon Mar 04, 2013 10:27 am

Re: 10127 (Run 48, Clone 1, Gen 121)

Post by woschl »

Joe_H wrote:. The cause on my system was a bug in a third-party monitoring app, it kept opening new connections and eventually used up all that were available. The only cure I found in my case was to completely shutdown folding and restart it.
good lead. i had to kill the FAH Client and closed MS's ProcessExplorer. After restarting FAH everything runs as expected, for now. Strange that rebooting didnt clear the problem, but maybe startup of the monitoring app before FAH prevented something to reset.

I will keep an eye on it with regard to the monitoring app and communications.

woschl
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 10127 (Run 48, Clone 1, Gen 121)

Post by bruce »

You're running V7.2.9 rather than the latest which is V7.3.6. I have no idea if that matters, but it is something you can try if the problem comes back.

It does resemble Joe_H's problem on OS-X in that for some unknown reason, connections are being created at a rapid rate. Based on my experience with older versions, a new connection is created when FAHControl extablishes communications with FAHClient. If I connect from both a local FAHControl and a remote FAHControl, I get two connections and they get closed if I stop the applicable copy of FAHControl. I presume the same is true with WebControl.

If we assume that some 3rd party monitoring tool is also running, it may be logging on to FAHClient and not logging off, thereby eating up all the connections. Unfortunately I don't know any way to prove or disprove that guess. The messages don't provide enough detail to tell.

Kevin: Do stale connections time out eventually?
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: 10127 (Run 48, Clone 1, Gen 121)

Post by P5-133XL »

Just as a independent question, he has an Core_A3 and is running Windows. How is that possible with V7.x? I thought A3's on Windows needed MPI that was abandoned after v6? I know that Linux uses Core_A3's but he's running XP.

Just a peculiarity that I noticed.
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 10127 (Run 48, Clone 1, Gen 121)

Post by bruce »

The Windows MPI code was used initially but it was unreliable. The A3 core was rewritten to use the functionality of Windows Threads.

I don't remember the exact sequence of events nor can I state categorically what happened in Linux/OS-X but I think it probably still uses MPI since that's a fundamental capability of most Distros. That may also be related to the outcome of core A5, but I know even less about it.

The code did get adopted by a later version of Gromacs which then got used in Core A4.

I have no reason to believe this has anything at all to do with the telnet connections being establish between different software components.
calxalot
Site Moderator
Posts: 894
Joined: Sat Dec 08, 2007 1:33 am
Location: San Francisco, CA
Contact:

Re: 10127 (Run 48, Clone 1, Gen 121)

Post by calxalot »

To the best of my knowledge, the client does not clean up stale connections unless there is a heartbeat/update scheduled. The client doesn't know the connection is gone until it tries to send data on the socket and fails.

If a third-party app disconnects and does not send a 'quit' command or have updates scheduled, then it can use up the available socket connections in the client.

It may be possible that FAHControl can open a connection and an error occurs before it can setup the heartbeat. Not sure.

There should probably be a ticket for the client to always close inactive connections. A 1 minute timeout might be reasonable, since apps should typically setup a heartbeat for 5 seconds if they want to keep the connection open.

Edit: a relevant ticket is 932

It's important to know if a third-party monitoring app is being used, or if it is FAHControl that somehow causes this.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 10127 (Run 48, Clone 1, Gen 121)

Post by bruce »

In ticket #932, it's apparent that WUdget is being used. The website for WUdget says it is no longer being developed Nevertheless, WUdget 1.5.7k contains the problem and (beta 1.5.7L does not) so the best guess is to have everybody use revision L. Unfortunately Stanford cannot be responsible for 3rd party apps. Does anybody know how to contact the author and politely ask him to stop distributing revision K?
calxalot
Site Moderator
Posts: 894
Joined: Sat Dec 08, 2007 1:33 am
Location: San Francisco, CA
Contact:

Re: 10127 (Run 48, Clone 1, Gen 121)

Post by calxalot »

F@H WUdget is for OSX, so not relevant in this case.

The question is if there are any other apps that might have the same issue, or if it is FAHControl that is causing this somehow.
woschl
Posts: 3
Joined: Mon Mar 04, 2013 10:27 am

Re: 10127 (Run 48, Clone 1, Gen 121)

Post by woschl »

It happened again, but this time without any monitoring app running that i'm aware of (except AV), around the 66% mark. Next i will update to 7.3.6 unless it's not reccomended switching versions during an incomplete WU.

woschl
Joe_H
Site Admin
Posts: 7870
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 10127 (Run 48, Clone 1, Gen 121)

Post by Joe_H »

In this case it looks like the problem is within one of the components of the F@H client, and that is triggering the bug to run out of connections. Updating may help. Most persons have not had any problems with switching versions in the middle of a WU, so you should be okay. There are some changes between 7.2.9 and 7.3.6 due to the change in focus on default system usage, I would recommend reading the release notes and the posts on the forum on the newer version if you haven't already.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Jesse_V
Site Moderator
Posts: 2851
Joined: Mon Jul 18, 2011 4:44 am
Hardware configuration: OS: Windows 10, Kubuntu 19.04
CPU: i7-6700k
GPU: GTX 970, GTX 1080 TI
RAM: 24 GB DDR4
Location: Western Washington

Re: 10127 (Run 48, Clone 1, Gen 121)

Post by Jesse_V »

Joe_H wrote:There are some changes between 7.2.9 and 7.3.6 due to the change in focus on default system usage, I would recommend reading the release notes and the posts on the forum on the newer version if you haven't already.
The software FAQs in http://folding.stanford.edu/English/FAQ woud be an excellent place to start too.
F@h is now the top computing platform on the planet and nothing unites people like a dedicated fight against a common enemy. This virus affects all of us. Lets end it together.
Post Reply