7004 (Run 3, Clone 303, Gen 76) Linux client/core reconnects

Moderators: Site Moderators, FAHC Science Team

Post Reply
Ken_g6
Posts: 17
Joined: Sat Dec 26, 2009 9:23 pm

7004 (Run 3, Clone 303, Gen 76) Linux client/core reconnects

Post by Ken_g6 »

That topic line is way too short.

Anyway, I'm running the v7 client on Linux (and the v6 GPU client separately if that matters.) Since this WU started, my client continually tries to reconnect to my core, using 100% of one CPU core, and fails most of the time:

Code: Select all

13:09:15:WU01:FS00:Starting
13:09:15:WU00:FS00:Connecting to 171.67.108.59:8080
13:09:15:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /media/hdd/home/ken/.FAHClient/cores/www.stanford.edu/~pande/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 01 -suffix 01 -version 702 -lifeline 17065 -checkpoint 15 -np 4
13:09:15:WU01:FS00:Started FahCore on PID 14083
13:09:15:WU01:FS00:Core PID:14087
13:09:15:WU01:FS00:FahCore 0xa4 started
13:09:16:WU01:FS00:0xa4:
13:09:16:WU01:FS00:0xa4:*------------------------------*
13:09:16:WU01:FS00:0xa4:Folding@Home Gromacs GB Core
13:09:16:WU01:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
13:09:16:WU01:FS00:0xa4:
13:09:16:WU01:FS00:0xa4:Preparing to commence simulation
13:09:16:WU01:FS00:0xa4:- Looking at optimizations...
13:09:16:WU01:FS00:0xa4:- Created dyn
13:09:16:WU01:FS00:0xa4:- Files status OK
13:09:16:WU01:FS00:0xa4:- Expanded 39691 -> 204900 (decompressed 516.2 percent)
13:09:16:WU01:FS00:0xa4:Called DecompressByteArray: compressed_data_size=39691 data_size=204900, decompressed_data_size=204900 diff=0
13:09:16:WU01:FS00:0xa4:- Digital signature verified
13:09:16:WU01:FS00:0xa4:
13:09:16:WU01:FS00:0xa4:Project: 7004 (Run 3, Clone 303, Gen 76)
13:09:16:WU01:FS00:0xa4:
13:09:16:WU01:FS00:0xa4:Assembly optimizations on if available.
13:09:16:WU01:FS00:0xa4:Entering M.D.
13:09:21:WU00:FS00:Upload 32.61%
13:09:21:WU01:FS00:0xa4:Completed 0 out of 10000000 steps  (0%)
13:09:27:WU00:FS00:Upload 65.22%
13:09:33:WU00:FS00:Upload 97.83%
13:09:35:WU00:FS00:Upload complete
13:09:35:WU00:FS00:Server responded WORK_ACK (400)
13:09:35:WU00:FS00:Final credit estimate, 975.00 points
13:09:35:WU00:FS00:Cleaning up
13:09:46:Server connection id=2 on 0.0.0.0:36330 from 127.0.0.1
13:09:56:Server connection id=3 on 0.0.0.0:36330 from 127.0.0.1
13:10:07:Server connection id=4 on 0.0.0.0:36330 from 127.0.0.1
13:10:18:Server connection id=5 on 0.0.0.0:36330 from 127.0.0.1
13:10:28:Server connection id=6 on 0.0.0.0:36330 from 127.0.0.1
13:10:39:Server connection id=7 on 0.0.0.0:36330 from 127.0.0.1
13:10:50:Server connection id=8 on 0.0.0.0:36330 from 127.0.0.1
13:11:00:Server connection id=9 on 0.0.0.0:36330 from 127.0.0.1
13:11:11:Server connection id=10 on 0.0.0.0:36330 from 127.0.0.1
13:11:21:Server connection id=11 on 0.0.0.0:36330 from 127.0.0.1
13:11:32:Server connection id=12 on 0.0.0.0:36330 from 127.0.0.1
13:11:43:Server connection id=13 on 0.0.0.0:36330 from 127.0.0.1
13:11:53:Server connection id=14 on 0.0.0.0:36330 from 127.0.0.1
13:12:04:Server connection id=15 on 0.0.0.0:36330 from 127.0.0.1
13:12:15:Server connection id=16 on 0.0.0.0:36330 from 127.0.0.1
13:12:25:Server connection id=17 on 0.0.0.0:36330 from 127.0.0.1
13:12:36:Server connection id=18 on 0.0.0.0:36330 from 127.0.0.1
13:12:46:Server connection id=19 on 0.0.0.0:36330 from 127.0.0.1
13:12:57:Server connection id=20 on 0.0.0.0:36330 from 127.0.0.1
13:13:08:Server connection id=21 on 0.0.0.0:36330 from 127.0.0.1
13:13:18:Server connection id=22 on 0.0.0.0:36330 from 127.0.0.1
13:13:29:Server connection id=23 on 0.0.0.0:36330 from 127.0.0.1
13:13:40:Server connection id=24 on 0.0.0.0:36330 from 127.0.0.1
13:13:50:Server connection id=25 on 0.0.0.0:36330 from 127.0.0.1
13:14:01:Server connection id=26 on 0.0.0.0:36330 from 127.0.0.1
13:14:11:Server connection id=27 on 0.0.0.0:36330 from 127.0.0.1
13:14:22:Server connection id=28 on 0.0.0.0:36330 from 127.0.0.1
13:14:33:Server connection id=29 on 0.0.0.0:36330 from 127.0.0.1
13:14:43:Server connection id=30 on 0.0.0.0:36330 from 127.0.0.1
13:14:54:Server connection id=31 on 0.0.0.0:36330 from 127.0.0.1
13:15:05:Server connection id=32 on 0.0.0.0:36330 from 127.0.0.1
13:15:15:Server connection id=33 on 0.0.0.0:36330 from 127.0.0.1
13:15:26:Server connection id=34 on 0.0.0.0:36330 from 127.0.0.1
13:15:37:Server connection id=35 on 0.0.0.0:36330 from 127.0.0.1
13:15:47:Server connection id=36 on 0.0.0.0:36330 from 127.0.0.1
13:15:58:Server connection id=37 on 0.0.0.0:36330 from 127.0.0.1
13:16:08:Server connection id=38 on 0.0.0.0:36330 from 127.0.0.1
13:16:08:WU01:FS00:0xa4:Completed 100000 out of 10000000 steps  (1%)
13:16:19:Server connection id=39 on 0.0.0.0:36330 from 127.0.0.1
13:16:30:Server connection id=40 on 0.0.0.0:36330 from 127.0.0.1
13:16:40:Server connection id=41 on 0.0.0.0:36330 from 127.0.0.1
13:16:51:Server connection id=42 on 0.0.0.0:36330 from 127.0.0.1
13:17:02:Server connection id=43 on 0.0.0.0:36330 from 127.0.0.1
13:17:12:Server connection id=44 on 0.0.0.0:36330 from 127.0.0.1
13:17:23:Server connection id=45 on 0.0.0.0:36330 from 127.0.0.1
13:17:33:Server connection id=46 on 0.0.0.0:36330 from 127.0.0.1
13:17:44:Server connection id=47 on 0.0.0.0:36330 from 127.0.0.1
13:17:55:Server connection id=48 on 0.0.0.0:36330 from 127.0.0.1
13:18:05:Server connection id=49 on 0.0.0.0:36330 from 127.0.0.1
13:18:16:Server connection id=50 on 0.0.0.0:36330 from 127.0.0.1
13:18:27:Server connection id=51 on 0.0.0.0:36330 from 127.0.0.1
13:18:38:Server connection id=52 on 0.0.0.0:36330 from 127.0.0.1
13:18:48:Server connection id=53 on 0.0.0.0:36330 from 127.0.0.1
13:18:59:Server connection id=54 on 0.0.0.0:36330 from 127.0.0.1
13:19:10:Server connection id=55 on 0.0.0.0:36330 from 127.0.0.1
13:19:20:Server connection id=56 on 0.0.0.0:36330 from 127.0.0.1
13:19:31:Server connection id=57 on 0.0.0.0:36330 from 127.0.0.1
13:19:42:Server connection id=58 on 0.0.0.0:36330 from 127.0.0.1
13:19:52:Server connection id=59 on 0.0.0.0:36330 from 127.0.0.1
13:20:03:Server connection id=60 on 0.0.0.0:36330 from 127.0.0.1
13:20:14:Server connection id=61 on 0.0.0.0:36330 from 127.0.0.1
13:20:24:Server connection id=62 on 0.0.0.0:36330 from 127.0.0.1
13:20:35:Server connection id=63 on 0.0.0.0:36330 from 127.0.0.1
13:20:46:Server connection id=64 on 0.0.0.0:36330 from 127.0.0.1
13:20:56:Server connection id=65 on 0.0.0.0:36330 from 127.0.0.1
13:21:07:Server connection id=66 on 0.0.0.0:36330 from 127.0.0.1
13:21:17:Server connection id=67 on 0.0.0.0:36330 from 127.0.0.1
13:21:28:Server connection id=68 on 0.0.0.0:36330 from 127.0.0.1
13:21:39:Server connection id=69 on 0.0.0.0:36330 from 127.0.0.1
13:21:49:Server connection id=70 on 0.0.0.0:36330 from 127.0.0.1
13:22:00:Server connection id=71 on 0.0.0.0:36330 from 127.0.0.1
13:22:11:Server connection id=72 on 0.0.0.0:36330 from 127.0.0.1
13:22:21:Server connection id=73 on 0.0.0.0:36330 from 127.0.0.1
13:22:32:Server connection id=74 on 0.0.0.0:36330 from 127.0.0.1
13:22:37:WU01:FS00:0xa4:Completed 200000 out of 10000000 steps  (2%)
13:22:42:Server connection id=75 on 0.0.0.0:36330 from 127.0.0.1
13:22:53:Server connection id=76 on 0.0.0.0:36330 from 127.0.0.1
And so on. While this is happening in the log, the client just says "Updating" all the time.

For now I've killed the client with STOP to give the core more room to run. But I'd like a better solution.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: 7004 (Run 3, Clone 303, Gen 76) Linux client/core reconn

Post by P5-133XL »

Something is interfering with port 36330 and preventing the client from properly communicating with the cores. What I don't know. I cannot see how a specific WU would cause this.
Image
Ken_g6
Posts: 17
Joined: Sat Dec 26, 2009 9:23 pm

Re: 7004 (Run 3, Clone 303, Gen 76) Linux client/core reconn

Post by Ken_g6 »

Well, it ran the following projects before this one and didn't have connection issues:

Code: Select all

20:00:35:WU00:FS00:0xa4:Project: 8056 (Run 82, Clone 61, Gen 0)
02:47:30:WU01:FS00:0xa4:Project: 8027 (Run 1609, Clone 1, Gen 8)
08:05:42:WU00:FS00:0xa4:Project: 8056 (Run 76, Clone 31, Gen 57)
13:50:56:WU01:FS00:0xa4:Project: 7611 (Run 4, Clone 76, Gen 192)
15:15:19:WU01:FS00:0xa4:Project: 7611 (Run 4, Clone 76, Gen 192)
15:24:02:WU01:FS00:0xa4:Project: 7611 (Run 4, Clone 76, Gen 192)
17:59:01:WU00:FS00:0xa4:Project: 8069 (Run 0, Clone 94, Gen 26)
18:00:35:WU00:FS00:0xa4:Project: 8069 (Run 0, Clone 94, Gen 26)
18:12:04:WU00:FS00:0xa4:Project: 8069 (Run 0, Clone 94, Gen 26)
01:39:11:WU01:FS00:0xa4:Project: 8056 (Run 16, Clone 38, Gen 62)
07:30:51:WU00:FS00:0xa4:Project: 8056 (Run 16, Clone 43, Gen 63)
Hopefully it will go away after this WU. But I'd like to avoid it happening again.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: 7004 (Run 3, Clone 303, Gen 76) Linux client/core reconn

Post by P5-133XL »

Not recommended, but you could dump the WU and see if there is a change...

Document before and after by suppling the logs, if you choose to do that so we can make a bug report if that is the cause.

It would also be reasonable to allow others to comment before dumping. Maybe someone else has a solution.
Image
Joe_H
Site Admin
Posts: 7867
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 7004 (Run 3, Clone 303, Gen 76) Linux client/core reconn

Post by Joe_H »

First a couple questions, are you running any third party monitoring tools and how long has your Linux system been running since its last restart? Reason I ask is because I did come across a similar issue with the OS X client a few months ago. There a third party utility would open network connections to FAHClient and eventually use up the open limit under some circumstances. The only cure was to reboot the Mac. The author of that utility did issue an updated version that fixed the problem.

If you are not using another monitoring utility it is possible that FAHControl may have got into a similar state trying to open a connection to FAHClient after a period of uptime. Closing FAHControl will not stop the FAHCore from processing, but killing the FAHClient process should. If you can issue a Restart to the FAHClient, that is what I would recommend as a minimum. Rebooting your system and restarting F@H would also do the same.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 7004 (Run 3, Clone 303, Gen 76) Linux client/core reconn

Post by bruce »

See another report of the same thing here: viewtopic.php?f=85&t=23134
Ken_g6
Posts: 17
Joined: Sat Dec 26, 2009 9:23 pm

Re: 7004 (Run 3, Clone 303, Gen 76) Linux client/core reconn

Post by Ken_g6 »

Joe_H wrote:First a couple questions, are you running any third party monitoring tools and how long has your Linux system been running since its last restart?
No third-party monitoring tools. Uptime is just under 7 days.
Joe_H wrote:If you are not using another monitoring utility it is possible that FAHControl may have got into a similar state trying to open a connection to FAHClient after a period of uptime. Closing FAHControl will not stop the FAHCore from processing, but killing the FAHClient process should.
Well, I did that. First time I did a kill -9; second time I did kill three times. Each time the client stopped, of course, as did the cores. Each time I started over with FAHControl, and launched FAHClient from there. It showed some progress, as did the log file - 25% the first time and 28% the second. But the connection problem came back within a minute.

I'm currently running the core with FAHClient stopped (with "kill -STOP".) I set a timer to run kill -CONT on it near the time when it should be done, and then I plan to run normally from there. Hopefully if I get through this WU the next won't act like this.
Ken_g6
Posts: 17
Joined: Sat Dec 26, 2009 9:23 pm

Re: 7004 (Run 3, Clone 303, Gen 76) Linux client/core reconn

Post by Ken_g6 »

Well, the darn thing worked itself out eventually, although I had to restart FAHClient after the WU finished.

Thanks, all!
Post Reply