scheduling uploads in Linux SMP?

Moderators: Site Moderators, PandeGroup

Re: scheduling uploads in Linux SMP?

Postby tear » Tue Jul 14, 2009 2:12 pm

All righty. I've cooked up a solution and it's looking very promising. Being tested atm.
(PM if you want the code)

There are couple of approaches I had considered:
a) overriding networking (socket calls, gethostby*, etc.) library calls (LD_PRELOAD) -- scratched as most client binaries out there are static
b) transparent proxy (iptables -j REDIRECT) -- made proof of concept application but scratched it due to *very* weird performance issues
c) regular proxy (requires client reconfiguration) -- modification/tuning of existing code base -- scratched (squid is very functional but it
seemed like an overkill to me)
d) regular proxy -- small, dedicated solution

And yes, I picked (d) :mrgreen:

Current algorithm is (this works on a per-client basis):
Code: Select all
                if (client_is_contacting_main_assignment_server()) {
                        assign_contact = time(NULL);
                        terminated = 0;
                } else {
                        collect_contact = time(NULL);
                        if (!terminated && collect_contact - assign_contact > 10 * 60) { /* currently 10 minutes */
                                terminate_connection();
                                terminated = 1;
                        }
                }



tear
One man's ceiling is another man's floor.
Image
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: scheduling uploads in Linux SMP?

Postby tear » Wed Jul 15, 2009 2:59 am

Gentlemen, all I'm asking to do know is to witness a demonstration of the possibility of movement within fourth dimension.

In plain English: yes, it works!

Had to alter algo a little bit but I'd bet you folks are more interested in actual results.

Here (pardon trailing spaces -- captured "screen" terminal):
Code: Select all
[00:06:43] Completed 227500 out of 250000 steps  (91%)                                                                                   
[00:14:17] Completed 230000 out of 250000 steps  (92%)                                                                                   
[00:21:51] Completed 232500 out of 250000 steps  (93%)                                                                                   
[00:29:25] Completed 235000 out of 250000 steps  (94%)                                                                                   
[00:36:59] Completed 237500 out of 250000 steps  (95%)                                                                                   
[00:37:02] - Autosending finished units... [July 15 00:37:02 UTC]                                                                       
[00:37:02] Trying to send all finished work units                                                                                       
[00:37:02] + No unsent completed units remaining.                                                                                       
[00:37:02] - Autosend completed                                                                                                         
[00:44:33] Completed 240000 out of 250000 steps  (96%)                                                                                   
[00:52:08] Completed 242500 out of 250000 steps  (97%)                                                                                   
[00:59:42] Completed 245000 out of 250000 steps  (98%)                                                                                   
[01:07:15] Completed 247500 out of 250000 steps  (99%)                                                                                   
[01:14:50] Completed 250000 out of 250000 steps  (100%)                                                                                 
                                                                                                                                         
Writing final coordinates.                                                                                                               
                                                                                                                                         
 Average load imbalance: 0.2 %                                                                                                           
 Part of the total run time spent waiting due to load imbalance: 0.2 %                                                                   
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Z 0 %                                                           
                                                                                                                                         
                                                                                                                                         
        Parallel run - timing based on wallclock.                                                                                       
                                                                                                                                         
               NODE (s)   Real (s)      (%)                                                                                             
       Time:  45422.000  45422.000    100.0                                                                                             
                       12h37:02                                                                                                         
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)                                                                                 
Performance:    286.351     12.024      0.951     25.234                                                                                 
                                                                                                                                         
gcq#0: Thanx for Using GROMACS - Have a Nice Day                                                                                         

[01:14:51] DynamicWrapper: Finished Work Unit: sleep=10000                                                                               
[01:15:01]                                                                                                                               
[01:15:01] Finished Work Unit:                                                                                                           
[01:15:01] - Reading up to 21200256 from "work/wudata_08.trr": Read 21200256                                                             
[01:15:01] trr file hash check passed.                                                                                                   
[01:15:01] - Reading up to 27692016 from "work/wudata_08.xtc": Read 27692016                                                             
[01:15:02] xtc file hash check passed.                                                                                                   
[01:15:02] edr file hash check passed.                                                                                                   
[01:15:02] logfile size: 180874                                                                                                         
[01:15:02] Leaving Run                                                                                                                   
[01:15:04] - Writing 49217898 bytes of core data to disk...                                                                             
[01:15:05]   ... Done.                                                                                                                   
[01:15:06] - Shutting down core                                                                                                         
[01:15:06]                                                                                                                               
[01:15:06] Folding@home Core Shutdown: FINISHED_UNIT                                                                                     
Error encountered before initializing MPICH                                                                                             
[01:18:26] CoreStatus = 64 (100)                                                                                                         
[01:18:26] Unit 8 finished with 82 percent of time to deadline remaining.                                                               
[01:18:26] Updated performance fraction: 0.821276                                                                                       
[01:18:26] Sending work to server                                                                                                       
[01:18:26] Project: 2677 (Run 11, Clone 81, Gen 23)                                                                                     
                                                                                                                                         
                                                                                                                                         
[01:18:26] + Attempting to send results [July 15 01:18:26 UTC]                                                                           
[01:18:26] - Reading file work/wuresults_08.dat from core                                                                               
[01:18:26]   (Read 49217898 bytes from disk)                                                                                             
[01:18:26] Connecting to http://171.64.65.56:8080/                                                                                       
[01:18:26] - Couldn't send HTTP request to server                                                                                       
[01:18:26] + Could not connect to Work Server (results)                                                                                 
[01:18:26]     (171.64.65.56:8080)                                                                                                       
[01:18:26] + Retrying using alternative port                                                                                             
[01:18:26] Connecting to http://171.64.65.56:80/                                                                                         
[01:18:26] - Couldn't send HTTP request to server                                                                                       
[01:18:26] + Could not connect to Work Server (results)                                                                                 
[01:18:26]     (171.64.65.56:80)                                                                                                         
[01:18:26] - Error: Could not transmit unit 08 (completed July 15) to work server.                                                       
[01:18:26] - 1 failed uploads of this unit.                                                                                             
[01:18:26]   Keeping unit 08 in queue.                                                                                                   
[01:18:26] Trying to send all finished work units                                                                                       
[01:18:26] Project: 2677 (Run 11, Clone 81, Gen 23)                                                                                     
                                                                                                                                         
                                                                                                                                         
[01:18:26] + Attempting to send results [July 15 01:18:26 UTC]                                                                           
[01:18:26] - Reading file work/wuresults_08.dat from core                                                                               
[01:18:26]   (Read 49217898 bytes from disk)                                                                                             
[01:18:26] Connecting to http://171.64.65.56:8080/                                                                                       
[01:18:26] - Couldn't send HTTP request to server                                                                                       
[01:18:26] + Could not connect to Work Server (results)                                                                                 
[01:18:26]     (171.64.65.56:8080)                                                                                                       
[01:18:26] + Retrying using alternative port                                                                                             
[01:18:26] Connecting to http://171.64.65.56:80/                                                                                         
[01:18:26] - Couldn't send HTTP request to server                                                                                       
[01:18:26] + Could not connect to Work Server (results)                                                                                 
[01:18:26]     (171.64.65.56:80)                                                                                                         
[01:18:26] - Error: Could not transmit unit 08 (completed July 15) to work server.                                                       
[01:18:26] - 2 failed uploads of this unit.                                                                                             
                                                                                                                                         
                                                                                                                                         
[01:18:26] + Attempting to send results [July 15 01:18:26 UTC]                                                                           
[01:18:26] - Reading file work/wuresults_08.dat from core                                                                               
[01:18:26]   (Read 49217898 bytes from disk)                                                                                             
[01:18:26] Connecting to http://171.67.108.25:8080/                                                                                     
[01:18:26] - Couldn't send HTTP request to server                                                                                       
[01:18:26] + Could not connect to Work Server (results)                                                                                 
[01:18:26]     (171.67.108.25:8080)                                                                                                     
[01:18:26] + Retrying using alternative port                                                                                             
[01:18:26] Connecting to http://171.67.108.25:80/                                                                                       
[01:18:26] - Couldn't send HTTP request to server                                                                                       
[01:18:26] + Could not connect to Work Server (results)                                                                                 
[01:18:26]     (171.67.108.25:80)                                                                                                       
[01:18:26]   Could not transmit unit 08 to Collection server; keeping in queue.                                                         
[01:18:26] + Sent 0 of 1 completed units to the server                                                                                   
[01:18:26] - Preparing to get new work unit...                                                                                           
[01:18:26] + Attempting to get work packet                                                                                               
[01:18:26] - Will indicate memory of 3013 MB                                                                                             
[01:18:26] - Connecting to assignment server                                                                                             
[01:18:26] Connecting to http://assign.stanford.edu:8080/                                                                               
[01:18:27] Posted data.                                                                                                                 
[01:18:27] Initial: 43AB; - Successful: assigned to (171.67.108.24).                                                                     
[01:18:27] + News From Folding@Home: Welcome to Folding@Home                                                                             
[01:18:27] Loaded queue successfully.                                                                                                   
[01:18:27] Connecting to http://171.67.108.24:8080/                                                                                     
[01:18:28] Posted data.                                                                                                                 
[01:18:28] Initial: 0000; + Could not connect to Work Server                                                                             
[01:18:28] - Attempt #1  to get work failed, and no other work to do.                                                                   
Waiting before retry.                                                                                                                   
[01:18:36] + Attempting to get work packet                                                                                               
[01:18:36] - Will indicate memory of 3013 MB                                                                                             
[01:18:36] - Connecting to assignment server                                                                                             
[01:18:36] Connecting to http://assign.stanford.edu:8080/                                                                               
[01:18:36] Posted data.                                                                                                                 
[01:18:36] Initial: 43AB; - Successful: assigned to (171.67.108.24).                                                                     
[01:18:36] + News From Folding@Home: Welcome to Folding@Home                                                                             
[01:18:37] Loaded queue successfully.                                                                                                   
[01:18:37] Connecting to http://171.67.108.24:8080/                                                                                     
[01:18:42] Posted data.                                                                                                                 
[01:18:42] Initial: 0000; - Receiving payload (expected size: 4839916)                                                                   
[01:19:02] - Downloaded at ~236 kB/s                                                                                                     
[01:19:02] - Averaged speed for that direction ~200 kB/s                                                                                 
[01:19:02] + Received work.                                                                                                             
[01:19:02] Trying to send all finished work units                                                                                       
[01:19:02] Project: 2677 (Run 11, Clone 81, Gen 23)                                                                                     
                                                                                                                                         
                                                                                                                                         
[01:19:02] + Attempting to send results [July 15 01:19:02 UTC]                                                                           
[01:19:02] - Reading file work/wuresults_08.dat from core                                                                               
[01:19:02]   (Read 49217898 bytes from disk)                                                                                             
[01:19:02] Connecting to http://171.64.65.56:8080/                                                                                       
[01:19:16] Posted data.                                                                                                                 
[01:19:17] Initial: 0000; - Uploaded at ~2089 kB/s                                                                                       
[01:19:25] - Averaged speed for that direction ~2512 kB/s                                                                               
[01:19:25] + Results successfully sent                                                                                                   
[01:19:25] Thank you for your contribution to Folding@Home.                                                                             
[01:19:25] + Number of Units Completed: 329                                                                                             
                                                                                                                                         
[01:19:25] + Sent 1 of 1 completed units to the server                                                                                   
[01:19:25] + Closed connections                                                                                                         
[01:19:25]                                                                                                                               
[01:19:25] + Processing work unit                                                                                                       
[01:19:25] At least 4 processors must be requested.Core required: FahCore_a2.exe                                                         
[01:19:25] Core found.                                                                                                                   
[01:19:25] Working on queue slot 09 [July 15 01:19:25 UTC]                                                                               
[01:19:25] + Working ...                                                                                                                 
[01:19:25] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 09 -nocpulock -checkpoint 15 -forceasm -verbose -lifeline 11087 -version 624'                                                                                                           
                                                                                                                                         
Warning: Ignoring unknown arg                                                                                                           
Warning: Ignoring unknown arg                                                                                                           
Warning: Ignoring unknown arg                                                                                                           
Warning: Ignoring unknown arg                                                                                                           
[01:19:25]                                                                                                                               
[01:19:25] *------------------------------*                                                                                             
[01:19:25] Folding@Home Gromacs SMP Core                                                                                                 
[01:19:25] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)                                                                                   
[01:19:25]                                                                                                                               
[01:19:25] Preparing to commence simulation                                                                                             
[01:19:25] - Ensuring status. Please wait.                                                                                               
[01:19:25] y forced on.                                                                                                                 
[01:19:25] - Not checking prior termination.                                                                                             
[01:19:26] - Expanded 4839404 -> 24038325 (decompressed 496.7 percent)                                                                   
[01:19:26] Called DecompressByteArray: compressed_data_size=4839404 data_size=24038325, decompressed_data_size=24038325 diff=0           
[01:19:26] - Digital signature verified                                                                                                 
[01:19:26]                                                                                                                               
[01:19:26] Project: 2671 (Run 42, Clone 77, Gen 66)

(you're supposed to draw attention to fact of download happening before upload)

If there's anyone (with gcc and basic Linux skills) interested in trying it out, setup instructions are:
1. Download the code: http://darkswarm.org/langouste2-0.6.c
2. Compile the code: gcc -Wall -O2 -o langouste2-0.6 langouste2-0.6.c
3. Pick a non-used port for a proxy to use (I use 8880)
4. Start langouste*: ./langouste2-0.6 -l port-from-step-3
5. Reconfigure the client (-config or -configonly) so it uses proxy located at localhost and port from step 3**
6. Enjoy

*) currently, langouste doesn't go into background (it doesn't detach from the controlling terminal/process group)
so keeping terminal window open is strongly recommended; I haven't exercised it with nohup/putting
into background (&) yet, so YMMV

**) langouste is designed to handle multiple clients -- running single instance is good enough

tear

EDIT: it is critical that both langouste and client are run by the same user; otherwise nothing will work (connection/pid tracking)
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: scheduling uploads in Linux SMP?

Postby tear » Wed Jul 15, 2009 4:53 pm

Update:

I noticed that even though download happens before the upload (with langouste), the whole process is synchronous -- i.e.

1. Download new WU
*then*
2. Upload old WUs
*then*
3. Start processing new WU

If step #2 blocks.... <insert conclusion>

I've worked around it (made couple other improvements as well) and am testing atm.

discouragement++; :roll:


tear
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: scheduling uploads in Linux SMP?

Postby tear » Thu Jul 16, 2009 2:01 am

Another tiny observation: when client is started with *unsent* WU, this process *is* asynchronous ...


tear
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: scheduling uploads in Linux SMP?

Postby Anachron » Thu Jul 16, 2009 12:58 pm

well, there's obviously not much point to it if the process is synchronous.

Have you updated the code at darkswarm.org yet?
Time flies like an arrow; fruit flies like a banana
Anachron
 
Posts: 50
Joined: Fri Mar 14, 2008 12:10 pm

Re: scheduling uploads in Linux SMP?

Postby tear » Thu Jul 16, 2009 1:20 pm

Anachron wrote:well, there's obviously not much point to it if the process is synchronous.

You're absolutely right -- not too much you can do with this kind of approach, however...
when you start client and there *already* is unsent WU -- process is asynchronous
(see my post above); thus, scenario where:

1) Client is launched with -oneunit
2) Client completes a WU and is NOT allowed to send results
3) Client bails out (hopefully)
4) Client is restarted with -oneunit (though this time is allowed to send results)

...might be feasible.


I also did the following experiment; I kept on blocking connections to collection server
all the way through start of new WU. Results were not sent and were kept in queue
until next "scheduled" autosend (many hours later). Thing is, I don't really like this kind
of delay.

Anachron wrote:Have you updated the code at darkswarm.org yet?

I'm testing yet another approach -- will let you know in couple hrs.


Thanks for interest!
tear
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: scheduling uploads in Linux SMP?

Postby tear » Thu Jul 16, 2009 7:30 pm

tear wrote:
Anachron wrote:Have you updated the code at darkswarm.org yet?

I'm testing yet another approach -- will let you know in couple hrs.

Note: It's a kludge but works well (I still gotta venture the path described here).

New code: http://darkswarm.org/langouste3-0.3.c
Setup instructions are basically the same (note different file name though).

Additionally, download http://darkswarm.org/langouste-helper.sh
Place it in client's directory and enable executable bit -- (from within
client's directory) call chmod +x ./langouste-helper.sh

What all this does: upon attempt to contact collection server (with a WU to return)
it forks a copy of the client (in /tmp/langouste directory) with -send parameter
and terminates connection "original" client was attempting to make.
If forked client is able to send data back, corresponding wuresults_*.dat
file is wiped from original client's work directory (so autosend bails out
and removes it from queue).

As this is just a test version it does not delete forked copy in /tmp/langouste
upon sending results.

At my end it works like this:

Original client's log:
Code: Select all
[14:35:34] Completed 225000 out of 250000 steps  (90%)
[14:43:10] Completed 227500 out of 250000 steps  (91%)
[14:50:46] Completed 230000 out of 250000 steps  (92%)
[14:58:22] Completed 232500 out of 250000 steps  (93%)
[15:05:58] Completed 235000 out of 250000 steps  (94%)
[15:13:34] Completed 237500 out of 250000 steps  (95%)
[15:21:10] Completed 240000 out of 250000 steps  (96%)
[15:28:46] Completed 242500 out of 250000 steps  (97%)
[15:36:22] Completed 245000 out of 250000 steps  (98%)
[15:43:57] Completed 247500 out of 250000 steps  (99%)
[15:51:33] Completed 250000 out of 250000 steps  (100%)

Writing final coordinates.

 Average load imbalance: 0.2 %
 Part of the total run time spent waiting due to load imbalance: 0.2 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Z 0 %


        Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:  45656.000  45656.000    100.0
                       12h40:56
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:    284.229     11.944      0.946     25.364

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[15:51:35] DynamicWrapper: Finished Work Unit: sleep=10000
[15:51:45]
[15:51:45] Finished Work Unit:
[15:51:45] - Reading up to 21173904 from "work/wudata_01.trr": Read 21173904
[15:51:45] trr file hash check passed.
[15:51:45] - Reading up to 27133244 from "work/wudata_01.xtc": Read 27133244
[15:51:46] xtc file hash check passed.
[15:51:46] edr file hash check passed.
[15:51:46] logfile size: 181796
[15:51:46] Leaving Run
[15:51:49] - Writing 48633696 bytes of core data to disk...
[15:51:50]   ... Done.
Error encountered before initializing MPICH
[15:51:51] - Shutting down core
[15:51:51]
[15:51:51] Folding@home Core Shutdown: FINISHED_UNIT
[15:55:09] CoreStatus = 64 (100)
[15:55:09] Unit 1 finished with 82 percent of time to deadline remaining.
[15:55:09] Updated performance fraction: 0.821552
[15:55:09] Sending work to server
[15:55:09] Project: 2671 (Run 26, Clone 24, Gen 67)


[15:55:09] + Attempting to send results [July 16 15:55:09 UTC]
[15:55:09] - Reading file work/wuresults_01.dat from core
[15:55:09]   (Read 48633696 bytes from disk)
[15:55:09] Connecting to http://171.67.108.24:8080/
[15:55:10] - Couldn't send HTTP request to server
[15:55:10] + Could not connect to Work Server (results)
[15:55:10]     (171.67.108.24:8080)
[15:55:10] + Retrying using alternative port
[15:55:10] Connecting to http://171.67.108.24:80/
[15:55:10] - Couldn't send HTTP request to server
[15:55:10] + Could not connect to Work Server (results)
[15:55:10]     (171.67.108.24:80)
[15:55:10] - Error: Could not transmit unit 01 (completed July 16) to work server.
[15:55:10] - 1 failed uploads of this unit.
[15:55:10]   Keeping unit 01 in queue.
[15:55:10] Trying to send all finished work units
[15:55:10] Project: 2671 (Run 26, Clone 24, Gen 67)


[15:55:10] + Attempting to send results [July 16 15:55:10 UTC]
[15:55:10] - Reading file work/wuresults_01.dat from core
[15:55:10]   (Read 48633696 bytes from disk)
[15:55:10] Connecting to http://171.67.108.24:8080/
[15:55:10] - Couldn't send HTTP request to server
[15:55:10] + Could not connect to Work Server (results)
[15:55:10]     (171.67.108.24:8080)
[15:55:10] + Retrying using alternative port
[15:55:10] Connecting to http://171.67.108.24:80/
[15:55:10] - Couldn't send HTTP request to server
[15:55:10] + Could not connect to Work Server (results)
[15:55:10]     (171.67.108.24:80)
[15:55:10] - Error: Could not transmit unit 01 (completed July 16) to work server.
[15:55:10] - 2 failed uploads of this unit.


[15:55:10] + Attempting to send results [July 16 15:55:10 UTC]
[15:55:10] - Reading file work/wuresults_01.dat from core
[15:55:10]   (Read 48633696 bytes from disk)
[15:55:10] Connecting to http://171.67.108.25:8080/
[15:55:10] - Couldn't send HTTP request to server
[15:55:10] + Could not connect to Work Server (results)
[15:55:10]     (171.67.108.25:8080)
[15:55:10] + Retrying using alternative port
[15:55:10] Connecting to http://171.67.108.25:80/
[15:55:10] - Couldn't send HTTP request to server
[15:55:10] + Could not connect to Work Server (results)
[15:55:10]     (171.67.108.25:80)
[15:55:10]   Could not transmit unit 01 to Collection server; keeping in queue.
[15:55:10] + Sent 0 of 1 completed units to the server
[15:55:10] - Preparing to get new work unit...
[15:55:10] + Attempting to get work packet
[15:55:10] - Will indicate memory of 3013 MB
[15:55:10] - Connecting to assignment server
[15:55:10] Connecting to http://assign.stanford.edu:8080/
[15:55:11] Posted data.
[15:55:11] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[15:55:11] + News From Folding@Home: Welcome to Folding@Home
[15:55:11] Loaded queue successfully.
[15:55:11] Connecting to http://171.67.108.24:8080/
[15:55:17] Posted data.
[15:55:17] Initial: 0000; - Receiving payload (expected size: 4837624)
[15:55:30] - Downloaded at ~363 kB/s
[15:55:30] - Averaged speed for that direction ~241 kB/s
[15:55:30] + Received work.
[15:55:31] Trying to send all finished work units
[15:55:31] Project: 2671 (Run 26, Clone 24, Gen 67)


[15:55:31] + Attempting to send results [July 16 15:55:31 UTC]
[15:55:31] - Reading file work/wuresults_01.dat from core
[15:55:31]   (Read 48633696 bytes from disk)
[15:55:31] Connecting to http://171.67.108.24:8080/
[15:55:31] - Couldn't send HTTP request to server
[15:55:31] + Could not connect to Work Server (results)
[15:55:31]     (171.67.108.24:8080)
[15:55:31] + Retrying using alternative port
[15:55:31] Connecting to http://171.67.108.24:80/
[15:55:31] - Couldn't send HTTP request to server
[15:55:31] + Could not connect to Work Server (results)
[15:55:31]     (171.67.108.24:80)
[15:55:31] - Error: Could not transmit unit 01 (completed July 16) to work server.
[15:55:31] - 3 failed uploads of this unit.


[15:55:31] + Attempting to send results [July 16 15:55:31 UTC]
[15:55:31] - Reading file work/wuresults_01.dat from core
[15:55:31]   (Read 48633696 bytes from disk)
[15:55:31] Connecting to http://171.67.108.25:8080/
[15:55:31] - Couldn't send HTTP request to server
[15:55:31] + Could not connect to Work Server (results)
[15:55:31]     (171.67.108.25:8080)
[15:55:31] + Retrying using alternative port
[15:55:31] Connecting to http://171.67.108.25:80/
[15:55:31] - Couldn't send HTTP request to server
[15:55:31] + Could not connect to Work Server (results)
[15:55:31]     (171.67.108.25:80)
[15:55:31]   Could not transmit unit 01 to Collection server; keeping in queue.
[15:55:31] + Sent 0 of 1 completed units to the server
[15:55:31] + Closed connections
[15:55:31]
[15:55:31] + Processing work unit
[15:55:31] At least 4 processors must be requested.Core required: FahCore_a2.exe
[15:55:31] Core found.
[15:55:31] Working on queue slot 02 [July 16 15:55:31 UTC]
[15:55:31] + Working ...
[15:55:31] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -nocpulock -checkpoint 15 -forceasm -verbose -lifeline 11087 -version 624'

Warning: Ignoring unknown arg
Warning: Ignoring unknown arg
Warning: Ignoring unknown arg
Warning: Ignoring unknown arg
[15:55:31]
[15:55:31] *------------------------------*
[15:55:31] Folding@Home Gromacs SMP Core
[15:55:31] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[15:55:31]
[15:55:31] Preparing to commence simulation
[15:55:31] - Ensuring status. Please wait.
[15:55:32] Called DecompressByteArray: compressed_data_size=4837112 data_size=24036289, decompressed_data_size=24036289 diff=0
[15:55:32] - Digital signature verified
[15:55:32]
[15:55:32] Project: 2671 (Run 10, Clone 94, Gen 67)
[15:55:32]
[15:55:32] Assembly optimizations on if available.
[15:55:32] Entering M.D.
[15:55:42]  on if available.
[15:55:42] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=iceberg
NNODES=4, MYRANK=3, HOSTNAME=iceberg
NNODES=4, MYRANK=2, HOSTNAME=iceberg
NNODES=4, MYRANK=1, HOSTNAME=iceberg
NODEID=0 argc=20
NODEID=1 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22884 system in water'
17000001 steps,  34000.0 ps (continuing from step 16750001,  33500.0 ps).
[15:55:51]  (0%)
[16:03:41] Completed 2500 out of 250000 steps  (1%)
[16:11:28] Completed 5000 out of 250000 steps  (2%)
[16:19:13] Completed 7500 out of 250000 steps  (3%)
[16:26:58] Completed 10000 out of 250000 steps  (4%)
[16:34:43] Completed 12500 out of 250000 steps  (5%)
[16:42:28] Completed 15000 out of 250000 steps  (6%)
[16:50:13] Completed 17500 out of 250000 steps  (7%)
[16:57:58] Completed 20000 out of 250000 steps  (8%)
[17:05:43] Completed 22500 out of 250000 steps  (9%)
[17:13:28] Completed 25000 out of 250000 steps  (10%)
[17:21:13] Completed 27500 out of 250000 steps  (11%)
[17:28:58] Completed 30000 out of 250000 steps  (12%)
[17:36:43] Completed 32500 out of 250000 steps  (13%)
[17:44:28] Completed 35000 out of 250000 steps  (14%)
[17:52:13] Completed 37500 out of 250000 steps  (15%)
[17:59:59] Completed 40000 out of 250000 steps  (16%)
[18:07:44] Completed 42500 out of 250000 steps  (17%)
[18:15:29] Completed 45000 out of 250000 steps  (18%)
[18:23:14] Completed 47500 out of 250000 steps  (19%)
[18:31:00] Completed 50000 out of 250000 steps  (20%)
[18:37:26] - Autosending finished units... [July 16 18:37:26 UTC]
[18:37:26] Trying to send all finished work units
[18:37:26] Project: 2671 (Run 26, Clone 24, Gen 67)
[18:37:26] - Error: Could not get length of results file work/wuresults_01.dat
[18:37:26] - Error: Could not read unit 01 file. Removing from queue.
[18:37:26] + Sent 0 of 1 completed units to the server
[18:37:26] - Autosend completed
[18:38:45] Completed 52500 out of 250000 steps  (21%)
[18:46:30] Completed 55000 out of 250000 steps  (22%)


Forked client's log (available in /tmp/langouste/<pid>/out):
Code: Select all
/tmp/langouste/11585/clientdir
processing work/wuresults_01.dat
unit number '01'
launching fah...

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

4 cores detected


--- Opening Log file [July 16 15:55:12 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /tmp/langouste/11585/clientdir
Executable: ./fah6
Arguments: -send 01 -verbosity 9 -forceasm -smp

[15:55:12] - Ask before connecting: No
[15:55:12] - Proxy: localhost:8880
[15:55:12] - User name: tear (Team 100259)
[15:55:12] - User ID: 5EF85BDF1A3A1EF1
[15:55:12] - Machine ID: 6
[15:55:12]

A potential conflict was detected:

Process 11087 is currently running and may also be a client with Mach. ID 6.
The program will now exit. Upon restart, this check will not be done --
You may wish to check that no client is currently running in
/fah/clients/fah before restarting.

Please press any key to exit.
work/wuresults_01.dat
re-launching fah...

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

4 cores detected


--- Opening Log file [July 16 15:55:13 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /tmp/langouste/12153/clientdir
Executable: ./fah6
Arguments: -send 01 -verbosity 9 -forceasm -smp

[15:55:13] - Ask before connecting: No
[15:55:13] - Proxy: localhost:8880
[15:55:13] - User name: tear (Team 100259)
[15:55:13] - User ID: 5EF85BDF1A3A1EF1
[15:55:13] - Machine ID: 6
[15:55:13]
[15:55:13] Loaded queue successfully.
[15:55:13] Attempting to return result(s) to server...
[15:55:13] Project: 2671 (Run 26, Clone 24, Gen 67)


[15:55:13] + Attempting to send results [July 16 15:55:13 UTC]
[15:55:13] - Reading file work/wuresults_01.dat from core
[15:55:13]   (Read 48633696 bytes from disk)
[15:55:13] Connecting to http://171.67.108.24:8080/
[15:55:27] Posted data.
[15:55:27] Initial: 0000; - Uploaded at ~3392 kB/s
[15:55:27] - Averaged speed for that direction ~2464 kB/s
[15:55:27] + Results successfully sent
[15:55:27] Thank you for your contribution to Folding@Home.
[15:55:27] + Number of Units Completed: 332

[15:55:28] ***** Got a SIGTERM signal (15)
[15:55:28] Killing all core threads

Folding@Home Client Shutdown.
ls: work/wuresults_01.dat: No such file or directory
unit 01 sent!
all done, unsent: 0
ls: work/wures*: No such file or directory


If you need any assistance with the process you know where to find me...


tear
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: scheduling uploads in Linux SMP?

Postby tear » Fri Jul 17, 2009 1:12 pm

Updated helper script is available.

Major changes:
-- upon successful return it removes ALL respective WU files in original client's dir
  (not only wuresults) so stale files don't confuse client when it decides to re-use
  queue slot (if you returned at least one unit with previous helper -- be sure to
  manually remove all previous queue slot'(s) files in client's work/ directory)

Minor changes:
-- removes forked client's dir upon successful WU return
-- saves log to langouste-helper-<pid>.log (not "out")
-- waits one minute before returning results so download (done by original client)
  can happen quicker

Note: remember to perform chmod +x ./langouste-helper.sh magic after the download.

tear
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: scheduling uploads in Linux SMP?

Postby Anachron » Fri Jul 17, 2009 1:19 pm

It seems to have worked. Log from the original client shows it tried to upload some times but couldn't. Then it happily went on to download and start processing a new WU.
The forked client's log shows it sent the WU, using 22 min.

That's 22 minutes saved!

At the end of the log I found this:

Code: Select all
Folding@Home Client Shutdown.
ls: cannot access work/wuresults_03.dat: No such file or directory
unit 03 sent!
processing work/wuresults_05.dat
unit number '05'
launching fah...

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

4 cores detected


--- Opening Log file [July 17 12:51:31 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /tmp/langouste/8501/clientdir
Executable: ./fah6
Arguments: -send 05 -smp -advmethods

[12:51:31] - Ask before connecting: No
[12:51:31] - Proxy: localhost:8889
[12:51:31] - User name: Bjornar (Team 31574)
[12:51:31] - User ID: 4B02ACFF73DF9FC9
[12:51:31] - Machine ID: 1
[12:51:31]
[12:51:32] Loaded queue successfully.
[12:51:32] Attempting to return result(s) to server...
[12:51:32] Project: 2671 (Run 47, Clone 39, Gen 64)
[12:51:32] - Failed to send unit 05 to server

Folding@Home Client Shutdown.
work/wuresults_05.dat
re-launching fah...

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

4 cores detected


--- Opening Log file [July 17 12:51:32 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /tmp/langouste/8501/clientdir
Executable: ./fah6
Arguments: -send 05 -smp -advmethods

[12:51:32] - Ask before connecting: No
[12:51:32] - Proxy: localhost:8889
[12:51:32] - User name: Bjornar (Team 31574)
[12:51:32] - User ID: 4B02ACFF73DF9FC9
[12:51:32] - Machine ID: 1
[12:51:32]
[12:51:32] Loaded queue successfully.
[12:51:32] Attempting to return result(s) to server...
[12:51:32] Project: 2671 (Run 47, Clone 39, Gen 64)
[12:51:32] - Failed to send unit 05 to server

Folding@Home Client Shutdown.
work/wuresults_05.dat
all done, unsent: 1
work/wuresults_05.dat


The file wuresults_05.dat was last changed on last Sunday. I don't know if it has been uploaded previously, or why it still is there. It has nothing to do with your program, I guess.
Anachron
 
Posts: 50
Joined: Fri Mar 14, 2008 12:10 pm

Re: scheduling uploads in Linux SMP?

Postby tear » Fri Jul 17, 2009 1:53 pm

Anachron wrote:It seems to have worked. Log from the original client shows it tried to upload some times but couldn't. Then it happily went on to download and start processing a new WU.
The forked client's log shows it sent the WU, using 22 min.

That's 22 minutes saved!

Happy to hear it's worked for you :D

Anachron wrote:At the end of the log I found this:

Code: Select all
Folding@Home Client Shutdown.
ls: cannot access work/wuresults_03.dat: No such file or directory
unit 03 sent!
processing work/wuresults_05.dat
unit number '05'
launching fah...

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

4 cores detected


--- Opening Log file [July 17 12:51:31 UTC]


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /tmp/langouste/8501/clientdir
Executable: ./fah6
Arguments: -send 05 -smp -advmethods

[12:51:31] - Ask before connecting: No
[12:51:31] - Proxy: localhost:8889
[12:51:31] - User name: Bjornar (Team 31574)
[12:51:31] - User ID: 4B02ACFF73DF9FC9
[12:51:31] - Machine ID: 1
[12:51:31]
[12:51:32] Loaded queue successfully.
[12:51:32] Attempting to return result(s) to server...
[12:51:32] Project: 2671 (Run 47, Clone 39, Gen 64)
[12:51:32] - Failed to send unit 05 to server

Folding@Home Client Shutdown.
work/wuresults_05.dat
re-launching fah...

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

4 cores detected


--- Opening Log file [July 17 12:51:32 UTC]


# Linux SMP Console Edition ##################################################################################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /tmp/langouste/8501/clientdir
Executable: ./fah6
Arguments: -send 05 -smp -advmethods

[12:51:32] - Ask before connecting: No
[12:51:32] - Proxy: localhost:8889
[12:51:32] - User name: Bjornar (Team 31574)
[12:51:32] - User ID: 4B02ACFF73DF9FC9
[12:51:32] - Machine ID: 1
[12:51:32]
[12:51:32] Loaded queue successfully.
[12:51:32] Attempting to return result(s) to server...
[12:51:32] Project: 2671 (Run 47, Clone 39, Gen 64)
[12:51:32] - Failed to send unit 05 to server

Folding@Home Client Shutdown.
work/wuresults_05.dat
all done, unsent: 1
work/wuresults_05.dat


The file wuresults_05.dat was last changed on last Sunday. I don't know if it has been uploaded previously, or why it still is there. It has nothing to do with your program, I guess.


Mmkay, I see couple things here:
-- the unit you completed is unit 03, is this correct?
-- forked client attempts to send all apparently unsent units (all wuresults_* files)
  hence you saw both unit 03 *and* unit 05 in langouste log
-- I'll add "-verbosity 9" to the helper script so we have more info on failed uploads
-- (just a thought) technically, forked client's directory can be removed no matter
  if the upload succeeded or failed (if it fails, wuresults_* file remains in original
  client's directory anyway)
-- yeah, I've no idea where unit 05 came from either and I don't think langouste
  did anything there...


Thanks,
tear
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: scheduling uploads in Linux SMP?

Postby Anachron » Fri Jul 17, 2009 2:05 pm

tear wrote:-- the unit you completed is unit 03, is this correct?
...
-- I'll add "-verbosity 9" to the helper script so we have more info on failed uploads


Yes, unit 03 is correct.

Hardcoding -verbosity 9 in may not be good, because it causes my (and others) client to segfault sometimes (http://foldingforum.org/viewtopic.php?f=44&t=8800).

Thank you for taking time to make this, tear. I do think it should be an integrated feature in the client though. Maybe in the future it will be.

Now, to run as a daemon...
Anachron
 
Posts: 50
Joined: Fri Mar 14, 2008 12:10 pm

Re: scheduling uploads in Linux SMP?

Postby tear » Fri Jul 17, 2009 2:17 pm

Anachron wrote:
tear wrote:-- the unit you completed is unit 03, is this correct?
...
-- I'll add "-verbosity 9" to the helper script so we have more info on failed uploads


Yes, unit 03 is correct.

Just to be on a safe side (I see you've already downloaded latest helper script), can you please
remove all *_03.* files from original client's work dir? (core doesn't really like stale files)

Anachron wrote:Hardcoding -verbosity 9 in may not be good, because it causes my (and others) client to segfault sometimes (http://foldingforum.org/viewtopic.php?f=44&t=8800).

Point taken. I won't add it then.

Anachron wrote:Thank you for taking time to make this, tear.

You are welcome.

EDIT: missed this one out
Anachron wrote:Now, to run as a daemon...

I'll see what I can do.

tear
Last edited by tear on Fri Jul 17, 2009 2:39 pm, edited 1 time in total.
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: scheduling uploads in Linux SMP?

Postby tear » Fri Jul 17, 2009 2:32 pm

Updated helper script is available.

Minor changes:
-- script unconditionally removes forked client's dir not to pollute
  /tmp directory (any unsent units are NOT removed from original client's
  directory so next forked client will re-attempt to send them anyway)

Note: remember to perform chmod +x ./langouste-helper.sh magic after the download.

tear
tear
 
Posts: 918
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: scheduling uploads in Linux SMP?

Postby codysluder » Fri Jul 17, 2009 6:45 pm

Anachron wrote:Hardcoding -verbosity 9 in may not be good, because it causes my (and others) client to segfault sometimes.


Ah, but the segfaults all happen in the WU that's being computed. I'll bet you never had a segfault during the upload that has just been eliminated. The forked client will never load a FahCore.

I think -verbosity 9 is a good idea.
Last edited by codysluder on Fri Jul 17, 2009 6:53 pm, edited 1 time in total.
codysluder
 
Posts: 2238
Joined: Sun Dec 02, 2007 12:43 pm

Re: scheduling uploads in Linux SMP?

Postby codysluder » Fri Jul 17, 2009 6:52 pm

tear wrote:Just to be on a safe side (I see you've already downloaded latest helper script), can you please
remove all *_03.* files from original client's work dir? (core doesn't really like stale files)


This is a known issue which has been fixed in some cores and not in others. I'm pretty sure it has been fixed in both SMP cores. The problem has to do with stale checkpoint files and when a WU finishes there won't be any.

Have you considered what differences we may see when a WU has an EUE rather than finishing normally. There are all sorts of combinations that might come up after an EUE and the script may not handle them all equally well.
codysluder
 
Posts: 2238
Joined: Sun Dec 02, 2007 12:43 pm

PreviousNext

Return to Linux CPU V6 Client

Who is online

Users browsing this forum: No registered users and 1 guest

cron