FAH504-Linux results upload may swamp inet i/f (revised subj

Post here if you have issues with the Linux 5.0x client.

Moderators: Site Moderators, PandeGroup

FAH504-Linux results upload may swamp inet i/f (revised subj

Postby lc99 » Wed Dec 05, 2007 5:20 am

Hi all, I've been running the FAH client on Linux since the end of 2004, with no problems. Currently I have FAH502-Linux running on a 2.6.22-3-k7 system (debian) ...
processor is AMD Athlon XP (2+ GHz), 971348k mem, 2996080k "swap".

I login to the little system almost every day, and today I noticed that the system was really dragged down by something. I figured out that if I shut down the F@H client, then everything goes back to normal. One thing I thought was weird is that the FAH (even though I run it at "nice 15") is always the primary task (as seen in "top" or "ps"), but now it looked like it wasn't really doing anything (while I had it running).

I have "bigpackets=yes" in my client.cfg file. Current/latest project is number 3903 (Run 2924, Clone 4, Gen 1) ("Protein: IBX in water"), core is FahCore_79. Logfile says "Folding@Home Double Gromacs Core [04:32:37] Version 1.72 (February 4, 2005)".

Sorry for being long-winded. At first I was thinking that FAH was dominating my network interface, but that isn't so likely. No syslog messages, and nothing funny in the f@h work directory. So, I'm wondering whether the resident segment size or VM size of the process became really large.

I'm gonna start it up again, try to measure the process, page i/o's, etc. and will try to post something interesting here to see why my system was getting messed up. Thanks for any tips you might have; Larry.
Last edited by lc99 on Mon Dec 31, 2007 8:12 am, edited 2 times in total.
lc99
 
Posts: 12
Joined: Wed Dec 05, 2007 4:56 am
Location: San Jose, CA

Postby v00d00 » Wed Dec 05, 2007 5:35 am

Try running:

Code: Select all
top


And see what it throws up for FAH and whether FAH is using the most CPU time.
User avatar
v00d00
 
Posts: 421
Joined: Sun Dec 02, 2007 4:53 am
Location: In the UK

Postby lc99 » Wed Dec 05, 2007 5:48 am

Collected some vital statistics, am staring at this now ... ps output:
Code: Select all
PID PRI  NI    VSZ   RSS WCHAN  S TTY          TIME COMMAND
 4274   0  15  43844 17972 stext  S ?        00:00:04 /home/lc/f@h/FAH504-Linux.exe
 4279   9  15 164048 132556 -     S ?        00:00:03 ./FahCore_79.exe -dir work/ -suffix 06 -priority 96 -checkpoint 15 -lifeline 4274 -version 504
 4281   9  15 164048 132556 -     S ?        00:00:00 ./FahCore_79.exe -dir work/ -suffix 06 -priority 96 -checkpoint 15 -lifeline 4274 -version 504
 4282   0  19 164048 132556 -     R ?        00:05:02 ./FahCore_79.exe -dir work/ -suffix 06 -priority 96 -checkpoint 15 -lifeline 4274 -version 504
 4283   9  15 164048 132556 -     S ?        00:00:00 ./FahCore_79.exe -dir work/ -suffix 06 -priority 96 -checkpoint 15 -lifeline 4274 -version 504

I set a vmstat command running, on a 60-second interval, then I ran the f@h for a couple of minutes, then terminated it. When you see the cpu "user time" go to 100 in the following stats sample, that's while f@h was running:
Code: Select all
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 671396  10196 147640    0    0     0     3   51   63  0  0 100  0
 0  0      0 671272  10224 147640    0    0     0     3   52   66  0  0 100  0
 0  0      0 671272  10232 147640    0    0     0     1   48   53  0  0 100  0
 0  0      0 671212  10248 147644    0    0     0     2   52   59  0  0 100  0
 1  0      0 521844  10396 147660    0    0     0   487   48   58 23  1 75  1
 1  0      0 520120  10516 148056    0    0     6    36   26   50 100  0  0  0
 1  0      0 519996  10548 148056    0    0     0     2   19   46 100  0  0  0
 1  0      0 519748  10596 148056    0    0     0     4   16   45 100  0  0  0
 1  0      0 520004  10608 148056    0    0     0     2   16   44 100  0  0  0
 1  0      0 457892  10640 148056    0    0     0     2   16   44 100  0  0  0
 0  0      0 670172  11244 148380    0    0    10   243   57   72 42  1 57  1
 0  0      0 670304  11284 148380    0    0     0     8   49   52  0  0 100  0
 0  0      0 670304  11304 148380    0    0     0     1   48   53  0  0 100  0

Logfile_06.txt :
Code: Select all
*------------------------------*
Folding@Home Double Gromacs Core
Version 1.72 (February 4, 2005)

Preparing to commence simulation
- Looking at optimizations...
- Files status OK
- Expanded 9485235 -> 34524104 (decompressed 363.9 percent)

Project: 3903 (Run 2924, Clone 4, Gen 1)

Assembly optimizations on if available.
Entering M.D.
(Starting from checkpoint)
Protein: IBX in water

Writing local files
Writing local files
Completed 0 out of 25000 steps  (0%)

Folding@home Core Shutdown: INTERRUPTED

Will look into this further, to see why my system gets really bogged down. It never did before ... I wonder whether this has to do with the particular project/WU ... The little machine has worked so nicely, until now! Larry.
lc99
 
Posts: 12
Joined: Wed Dec 05, 2007 4:56 am
Location: San Jose, CA

Postby v00d00 » Wed Dec 05, 2007 5:57 am

Why not create a script containing the commands you used, get them append to a file (cmd >> file), and then cron the script to run every hour.

Then after running FAH for a day, analyse the results.
User avatar
v00d00
 
Posts: 421
Joined: Sun Dec 02, 2007 4:53 am
Location: In the UK

Postby lc99 » Wed Dec 05, 2007 6:49 am

I tried running it again, using top in batch mode. While FAH was running, the overall summary was very consistent for the few minutes that I let it churn --
Code: Select all
top - 22:23:24 up  2:08,  1 user,  load average: 0.99, 0.69, 0.31
Tasks:  95 total,   2 running,  93 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy, 99.9%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:    971348k total,   452748k used,   518600k free,    13192k buffers
Swap:  2996080k total,        0k used,  2996080k free,   149996k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 4544 lc        39  19  160m 129m  680 R 99.9 13.6   3:52.07 FahCore_79.exe

After about 4 minutes active, the memory size grew somewhat, to this --
Code: Select all
Mem:    971348k total,   557412k used,   413936k free,    13300k buffers
Swap:  2996080k total,        0k used,  2996080k free,   150004k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 4544 lc        39  19  270m 231m 1184 R 99.9 24.5   5:52.04 FahCore_79.exe

Thanks for your suggestion; I'm going to play with it some more, when I get a chance. It's also possible that the f@h and work unit and core are ok and perhaps my system is misbehaving.

Forgot to mention -- I'm seeing horrendously slow access via a ssh connection over the network (only while FAH is active); so, tomorrow (hopefully) I'll login via a direct serial line (console) and see whether I have the same exact problem there. Regards, Larry.
lc99
 
Posts: 12
Joined: Wed Dec 05, 2007 4:56 am
Location: San Jose, CA

config options

Postby lc99 » Wed Dec 05, 2007 6:56 am

My configuration options (which haven't changed in years ... numbers in square brackets are current) --
Code: Select all
Ask before fetching/sending work (no/yes) [no]?
Use proxy (yes/no) [no]?
Allow receipt of work assignments and return of work results greater than
 5MB in size (such work units may have large memory demands) (no/yes) [yes]?
Change advanced options (yes/no) [no]? yes
Core Priority (idle/low) [low]?
Disable highly optimized assembly code (no/yes) [no]?
Interval, in minutes, between checkpoints (3-30) [15]?
Memory, in MB, to indicate (948 available) [948]?
Request work units without deadlines (no/yes) [no]?
Set -advmethods flag always, requesting new advanced
  scientific cores and/or work units if available (no/yes) [no]?
Ignore any deadline information (mainly useful if
 system clock frequently has errors) (no/yes) [no]?
Machine ID (1-8) [1]?
And in the file "client.cfg" --
Code: Select all
(...)
asknet=no
bigpackets=yes
machineid=1
local=533

[http]
active=no
host=localhost
port=8080

[core]
priority=96
Larry.
lc99
 
Posts: 12
Joined: Wed Dec 05, 2007 4:56 am
Location: San Jose, CA

Weird system behavior

Postby lc99 » Thu Dec 06, 2007 6:57 am

Good news, today I saw the FAH job running normally. Same work unit, same logfile, everything the same. As usual, my little system maintains a load average of 1 while the job is running (24hrs/7days). No funny warnings or log messages that I could find.

I also tried system login via local serial port, local ssh, and internet ssh and command-line operation was great. (Yesterday, command-line access via internet ssh was really terrible ... except when I would shut down f@h.)

The one thing I can say for sure is that when the system was acting weird, I saw the load average was near Zero (0.0) although f@h was running. I think I'm going to do something similar to what v00d00 suggested, I'll run some continuous process to monitor what's going on with the job, for some number of days ... then I may see when the FAH job is idling or stuck, and for how long. Will post any info back here in this thread. Thanks; Larry C.
lc99
 
Posts: 12
Joined: Wed Dec 05, 2007 4:56 am
Location: San Jose, CA

Running o.k. ... maybe it was simply an upload condition;

Postby lc99 » Wed Dec 12, 2007 6:51 am

I've been watching my system run for a few days and I can say that it is doing work as it should.

After that project 3903 that I observed above finally completed, it's now working on project 2416 which should progress much more quickly (I think I spent about 5 days on the 3903).

I looked at my load average samples, and it seems that there have been a couple of periods (while I've been sampling) that the load average was near zero for something like 8 to 10 minutes at a time. And I saw that the core process changed (ended), around that time.

My current suspicion is simply that when the work results are being uploaded at the end of the job, that takes several minutes since I have something like a 128Kbps upstream max Internet link; and, during that operation may have been when I was trying to ssh in, and finding what I thought was weirdness.

I'll correlate my files and time points, when I get a chance, to confirm. Larry
lc99
 
Posts: 12
Joined: Wed Dec 05, 2007 4:56 am
Location: San Jose, CA

Postby John_Weatherman » Wed Dec 12, 2007 7:08 am

I've had a few WUs from the same project, 3903, and they're big, so that might explain something.
User avatar
John_Weatherman
 
Posts: 509
Joined: Sun Dec 02, 2007 4:31 am
Location: Carrizo Plain National Monument, California

Work results upload, via 128 Kbps max connection

Postby lc99 » Sat Dec 29, 2007 11:22 pm

I'd just like to follow up on this topic, by finishing the loose end that I started and left hanging.
The entire "problem" that I had posted was that I saw the network connection of my server fully
occupied by the work results uploading (sending) ... which hampered my experience in trying to
access my system via ssh into the same network (internet) connection.

Because I'd saved some basic stats and had the logfiles from a few days of work, later on I went
back and tried to correlate. Here's what I found (in a nutshell) --
Code: Select all
Project: 3903 (Run 2924, Clone 4, Gen 1)
Done: 8217234 -> 7814041 (compressed to 95.0 percent)
(* It took 584 seconds to send the results -- about 104.5 Kbps Upload rate *)
Code: Select all
Project: 3903 (Run 3478, Clone 4, Gen 2)
Done: 8107203 -> 7753440 (compressed to 95.6 percent)
(* It took 578 seconds to send the results -- about 104.8 Kbps Upload rate *)

The "correlation" is that by looking at vmstat, ps, and top across the FAH work sessions,
I saw that the FahCore processes weren't running (only the FAH504-Linux.exe was);
the system's load average was 0.0 just during the upload times,
cpu status was idle (0 runnable processes), and more free memory was available.
My (home) network setup has a max. 128 Kbps Internet upload. What all of this tells me
is that the FAH system has some significant chunks of data that take on the order of 10
minutes to upload, from my system, and during that time period it's very difficult to perform
any interactive task within the same network connection (like, an ssh terminal session).

Am I the only person who's encountered this? Perhaps we could create a FAQ entry for some
other poor soul to read. I wonder whether the results upload could be throttled or chunked
a bit differently, to make this less of a full-bore upload task. Thanks; Larry.
lc99
 
Posts: 12
Joined: Wed Dec 05, 2007 4:56 am
Location: San Jose, CA

Re: Thought it was an issue with FAH504-Linux; see latest entry:

Postby v00d00 » Sun Dec 30, 2007 7:05 am

Welcome to BigWU's.

Some seriously large results files can be generated at the end of the fold, and they can take a long time to send on a slow(ish) connection. Thats why BigWU is not recommended to modem users (and people with not much upload bandwidth), as it may take a while to upload the results.

If you want my advice (and feel free to decline it), switch off BigWU and the results should be sent in seconds rather than minutes, but they will probably be worth less, points wise.
User avatar
v00d00
 
Posts: 421
Joined: Sun Dec 02, 2007 4:53 am
Location: In the UK

Re: Thought it was an issue with FAH504-Linux; see latest entry:

Postby lc99 » Sun Dec 30, 2007 8:23 am

I'm not ready to go to smaller WUs just because of this quirk. Actually, if my system's
upward stream is swamped for 10 minutes, it's not a big deal since I mostly use it for FAH,
email, and infrequently-accessed file storage. The problem was only that initially, I had no
idea what was going on.

Perhaps I can do something with netfilter, or some MTU runtime tweaks, to help
allow remote/interactive shell access during upload. Will look into the possibilities, when I can; Larry.
lc99
 
Posts: 12
Joined: Wed Dec 05, 2007 4:56 am
Location: San Jose, CA

Re: Thought it was an issue with FAH504-Linux; see latest entry:

Postby v00d00 » Sun Dec 30, 2007 5:20 pm

If you have access to a second network card, wire it in to your current network architecture and run it on a different subnet. As long as you dont bridge the two interfaces you should still be able to ssh into that machine by using the second interface. Its what i do, but i use it because i have Gigabit Lan and only 100Mb switches. So i run a x-over cat6 between the two machines with Gb which is highly useful when i need to ftp things between machines or use ssh when the main interfaces are swamped. The other interface is used for primary network access.

Just be sure to initialise the second adapter first in your rc files or programs tend to favour the fast interface over the one that gives you access to wan.
User avatar
v00d00
 
Posts: 421
Joined: Sun Dec 02, 2007 4:53 am
Location: In the UK

Re: Thought it was an issue with FAH504-Linux; see latest entry:

Postby lc99 » Mon Dec 31, 2007 8:10 am

Thanks, and I've done multiple network interfaces before (on Unix systems), but for this one,
I simply have both actions (FAH results upload and remote access ssh) happening through the
same internet uplink. Anyway, I'll think of something ... Larry.
lc99
 
Posts: 12
Joined: Wed Dec 05, 2007 4:56 am
Location: San Jose, CA

Re: Thought it was an issue with FAH504-Linux; see latest entry:

Postby bruce » Tue Jan 01, 2008 8:43 pm

lc99 wrote:Thanks, and I've done multiple network interfaces before (on Unix systems), but for this one,
I simply have both actions (FAH results upload and remote access ssh) happening through the
same internet uplink. Anyway, I'll think of something ... Larry.


Being somewhat of a Linux newbie I'm not sure what your options are (if any) for setting packet priorities. Certainly some MTU tweaks may help but not if they just make the FAH upload more efficient. As you might expect, FAH does not have any built-in throttlining. It simply assumes that TCP/IP will do it's job and transfer the data to the other end of the pipe. If you do find some MTU tweaks that help, let us know.

During the transfer do you see many packet errors or is it just working "efficiently" given the bandwidth limitation?
bruce
 
Posts: 23060
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Next

Return to Linux Uniprocessor Client v5.0x

Who is online

Users browsing this forum: No registered users and 1 guest

cron