A 55 minute delay in posting results and getting new WU???

The most demanding Projects are only available to a small percentage of very high-end servers.

Moderators: Site Moderators, PandeGroup

Re: A 55 minute delay in posting results and getting new WU???

Postby tear » Tue Nov 24, 2009 10:58 pm

Oh come on; admit it, being a reseller makes it (== SSD stockpiling) a bit easier for you, no? :)

tear
One man's ceiling is another man's floor.
Image
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: A 55 minute delay in posting results and getting new WU???

Postby brentpresley » Tue Nov 24, 2009 11:02 pm

I'm not a reseller.

FYI.
brentpresley
 
Posts: 233
Joined: Sun Jun 15, 2008 12:05 am
Location: Dallas, TX

Re: A 55 minute delay in posting results and getting new WU???

Postby tear » Tue Nov 24, 2009 11:07 pm

OK, last word is yours :)


tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: A 55 minute delay in posting results and getting new WU???

Postby toTOW » Tue Nov 24, 2009 11:31 pm

What should I say with my SAN drives ? :(
Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.

FAH-Addict : latest news, tests and reviews about Folding@Home project.

Image
User avatar
toTOW
Site Moderator
 
Posts: 8712
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France

Re: A 55 minute delay in posting results and getting new WU???

Postby mephistopheles » Thu Nov 26, 2009 2:57 am

It takes just a few minutes here:
Code: Select all
[20:29:41] Completed 250000 out of 250000 steps  (100%)
...
[20:30:09] - Writing 100146913 bytes of core data to disk...
[20:30:15]   ... Done.
[20:30:36] - Shutting down core
[20:30:36] Folding@home Core Shutdown: FINISHED_UNIT
...
[20:33:30] CoreStatus = 64 (100)
[20:33:30] Connecting to http://171.67.108.22:8080/
  (13 min upload)
[20:46:18] Posted data.
[20:46:18] Initial: 0000; - Uploaded at ~89 kB/s
[20:51:38] - Averaged speed for that direction ~102 kB/s
[20:51:38] + Results successfully sent
...
[20:51:45] + News From Folding@Home: Welcome to Folding@Home
[20:51:45] Loaded queue successfully.
[20:51:45] Connecting to http://171.67.108.22:8080/
[20:52:25] Posted data.
[20:52:25] Initial: 0000; - Receiving payload (expected size: 30234375)
  (3 min download)
[20:55:31] - Downloaded at ~159 kB/s
...
[20:55:49] Entering M.D.

So it's 26 minutes from the end of one WU to the start of the next, minus 16 minutes up- and download = 10 minutes for everything else.

Compared to ParrLeyne - 85 min from WU to WU minus 18 min for up/download = 67 min for the rest - my WD Scorpio Blue 2.5" laptop drive appears to be a regular speed demon :shock:

I don't know why there is such a difference. I'm not using a RAM drive, but I do use the Linux laptop mode on the server to save power and tear and wear on the hard drive (syncing to disk every few hours).

Laptop mode works as an implicit RAM drive. It's still a bit strange since the problem is excessive syncing and the documentation says that laptop mode does not cache syncs.

Apart from the laptop drive and laptop mode it's a fairly standard i7 920 home server running Ubuntu 9.04 with an ext3 file system configured with relatime for less disk access - the default Linux option writes to disk on every file read.

Quote from the relatime link:
It's also perhaps the most stupid Unix design idea of all times. Unix is really nice and well done, but think about this a bit: 'For every file that is read from the disk, lets do a ... write to the disk! And, for every file that is already cached and which we read from the cache ... do a write to the disk!'


Could it be related to atime?
mephistopheles
 
Posts: 153
Joined: Tue Apr 07, 2009 7:51 am

Re: A 55 minute delay in posting results and getting new WU???

Postby tear » Thu Nov 26, 2009 3:48 am

mephistopheles wrote:It takes just a few minutes here:
Code: Select all
[20:29:41] Completed 250000 out of 250000 steps  (100%)
...
[20:30:09] - Writing 100146913 bytes of core data to disk...
[20:30:15]   ... Done.
[20:30:36] - Shutting down core
[20:30:36] Folding@home Core Shutdown: FINISHED_UNIT
...
[20:33:30] CoreStatus = 64 (100)
[20:33:30] Connecting to http://171.67.108.22:8080/
  (13 min upload)
[20:46:18] Posted data.
[20:46:18] Initial: 0000; - Uploaded at ~89 kB/s
[20:51:38] - Averaged speed for that direction ~102 kB/s
[20:51:38] + Results successfully sent
...
[20:51:45] + News From Folding@Home: Welcome to Folding@Home
[20:51:45] Loaded queue successfully.
[20:51:45] Connecting to http://171.67.108.22:8080/
[20:52:25] Posted data.
[20:52:25] Initial: 0000; - Receiving payload (expected size: 30234375)
  (3 min download)
[20:55:31] - Downloaded at ~159 kB/s
...
[20:55:49] Entering M.D.

So it's 26 minutes from the end of one WU to the start of the next, minus 16 minutes up- and download = 10 minutes for everything else.

And 4 minutes from unit completion to start of upload. That's pretty good.

mephistopheles wrote:I don't know why there is such a difference. I'm not using a RAM drive, but I do use the Linux laptop mode on the server to save power and tear and wear on the hard drive (syncing to disk every few hours).

Laptop mode works as an implicit RAM drive.

To some extent, yes.

mephistopheles wrote:It's still a bit strange since the problem is excessive syncing and the documentation says that laptop mode does not cache syncs.

Yes, explicit sync (on a file) works the same way whether laptop mode is enabled or not ("caching" would violate semantics of fsync).
Well, I expect it to. Laptop mode should not matter but hell, Linux ain't perfect, something might be b0rked there.

mephistopheles wrote:Apart from the laptop drive and laptop mode it's a fairly standard i7 920 home server running Ubuntu 9.04 with an ext3 file system configured with relatime for less disk access - the default Linux option writes to disk on every file read.

You bring up very good point. File data and metadata may not be in adjacent areas of the disk (causing additional seek on every fsync).
Performing tests with relatime or noatime would be very interesting thing to do.

Still, it seems that biggest contributor* to "slowdown" is a filesystem. Ext3 seems to be least affected (of common filesystems) in contrast to ext4 or XFS.


tear


*) don't forget, it's the client (core too) that's at fault here and we're just considering possible workarounds
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: A 55 minute delay in posting results and getting new WU???

Postby ParrLeyne » Thu Nov 26, 2009 7:38 pm

tear wrote:Still, it seems that biggest contributor* to "slowdown" is a filesystem. Ext3 seems to be least affected (of common filesystems) in contrast to ext4 or XFS.


FYI, my Ubuntu system is using ext4. :(
ParrLeyne
 
Posts: 215
Joined: Sun Dec 14, 2008 10:59 pm
Location: Toronto, Canada

Re: A 55 minute delay in posting results and getting new WU???

Postby k1wi » Thu Nov 26, 2009 7:43 pm

I posted it in another thread, but using a ramfs as my working directory (using the bidadv VM) I managed to get my delay down to 20 seconds. For those without a UPS, so long as you set up your crons to properly back up the data (ie copy to HDD and then create a backup of that to protect against powerloss while overwriting, it might be possible to do it with the maximum loss of time being the time since last backup (5-10 minutes perhaps)

k1wi.
Image
k1wi
 
Posts: 1152
Joined: Tue Sep 22, 2009 10:48 pm

Re: A 55 minute delay in posting results and getting new WU???

Postby tear » Thu Nov 26, 2009 8:34 pm

Hmm, perhaps I was too quick to draw conclusions based on filesystem itself.

Can anyone with ext4:
a) check status of barrier usage: call (as root) "mount" and see if filesystem
 of interest displays barrier status (barrier=0|1)
b) try turning barriers off and see if it makes any difference?

ad b)
It can be done run-time and requires a remount of a filesystem, e.g.
for ext4 something along the following (as root) should work:
Code: Select all
mount / -o barrier=0,remount

If necessary replace / with filesystem that stores the client (e.g. /home)

This change is *not* permanent so a reboot will void it. Also, running
without barriers may bite you pretty hard in case of hard reset/power outage
-- you've been warned.


tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: A 55 minute delay in posting results and getting new WU???

Postby tear » Fri Nov 27, 2009 1:05 am

Yup. That does the trick here (with XFS) -- WU write/clear times are back to ext3's values.

The reason ext3 does not degrade is due to I/O barriers disabled by default (also see http://en.wikipedia.org/wiki/Ext3#No_ch ... in_journal).
XFS and ext4 have had barriers enabled for some time now.

Just so we're clear -- I do not recommend disabling barriers.


tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: A 55 minute delay in posting results and getting new WU???

Postby ParrLeyne » Wed Dec 02, 2009 2:33 am

An update...

I decided to create a new ext3 volume and moved my FAH install onto it, and the log of my most recent WU is as follows:

Code: Select all
[01:23:42] Completed 247500 out of 250000 steps  (99%)
[01:51:53] Completed 250000 out of 250000 steps  (100%)

Writing final coordinates.

 Average load imbalance: 0.1 %
 Part of the total run time spent waiting due to load imbalance: 0.1 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %


   Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time: 165177.353 165177.353    100.0
                       1d21h52:57
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:    535.040     29.182      0.510     47.065

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[01:52:00] DynamicWrapper: Finished Work Unit: sleep=10000
[01:52:10]
[01:52:10] Finished Work Unit:
[01:52:10] - Reading up to 52544928 from "work/wudata_01.trr": Read 52544928
[01:52:10] trr file hash check passed.
[01:52:10] - Reading up to 46176200 from "work/wudata_01.xtc": Read 46176200
[01:52:10] xtc file hash check passed.
[01:52:10] edr file hash check passed.
[01:52:10] logfile size: 251759
[01:52:10] Leaving Run
[01:52:10] - Writing 99137803 bytes of core data to disk...
[01:52:13]   ... Done.
[01:52:29] - Shutting down core
[01:52:29]
[01:52:29] Folding@home Core Shutdown: FINISHED_UNIT
Attempting to use an MPI routine after finalizing MPICH
[01:55:34] CoreStatus = 64 (100)
[01:55:34] Unit 1 finished with 67 percent of time to deadline remaining.
[01:55:34] Updated performance fraction: 0.671155
[01:55:34] Sending work to server
[01:55:34] Project: 2683 (Run 4, Clone 0, Gen 23)


[01:55:34] + Attempting to send results [December 2 01:55:34 UTC]
[01:55:34] - Reading file work/wuresults_01.dat from core
[01:55:34]   (Read 99137803 bytes from disk)
[01:55:34] Connecting to http://171.67.108.22:8080/
[02:09:26] Posted data.
[02:09:26] Initial: 0000; - Uploaded at ~116 kB/s
[02:09:27] - Averaged speed for that direction ~116 kB/s
[02:09:27] + Results successfully sent
[02:09:27] Thank you for your contribution to Folding@Home.
[02:09:27] + Number of Units Completed: 6

[02:09:30] - Warning: Could not delete all work unit files (1): Core file absent
[02:09:30] Trying to send all finished work units
[02:09:30] + No unsent completed units remaining.
[02:09:30] - Preparing to get new work unit...
[02:09:30] Cleaning up work directory
[02:09:30] + Attempting to get work packet
[02:09:30] - Will indicate memory of 5974 MB
[02:09:30] - Connecting to assignment server
[02:09:30] Connecting to http://assign.stanford.edu:8080/
[02:09:31] Posted data.
[02:09:31] Initial: 43AB; - Successful: assigned to (171.67.108.22).
[02:09:31] + News From Folding@Home: Welcome to Folding@Home
[02:09:31] Loaded queue successfully.
[02:09:31] Connecting to http://171.67.108.22:8080/
[02:10:11] Posted data.
[02:10:11] Initial: 0000; - Receiving payload (expected size: 30236207)
[02:11:29] - Downloaded at ~378 kB/s
[02:11:29] - Averaged speed for that direction ~757 kB/s
[02:11:29] + Received work.
[02:11:29] Trying to send all finished work units
[02:11:29] + No unsent completed units remaining.
[02:11:29] + Closed connections
[02:11:29]
[02:11:29] + Processing work unit
[02:11:29] Core required: FahCore_a2.exe
[02:11:29] Core found.
[02:11:29] Working on queue slot 02 [December 2 02:11:29 UTC]
[02:11:29] + Working ...
[02:11:29] - Calling './mpiexec -np 8 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 02 -checkpoint 3 -verbose -lifeline 1902 -version 624'

[02:11:29]
[02:11:29] *------------------------------*
[02:11:29] Folding@Home Gromacs SMP Core
[02:11:29] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[02:11:29]
[02:11:29] Preparing to commence simulation
[02:11:29] - Ensuring status. Please wait.
[02:11:32] Called DecompressByteArray: compressed_data_size=30235695 data_size=159270593, decompressed_data_size=159270593 diff=0
[02:11:32] - Digital signature verified
[02:11:32]
[02:11:32] Project: 2683 (Run 14, Clone 14, Gen 8)
[02:11:32]
[02:11:32] Assembly optimizations on if available.
[02:11:32] Entering M.D.
[02:11:43] (Run 14, Clone 14, Gen 8)
[02:11:43]
[02:11:43] Entering M.D.
NNODES=8, MYRANK=2, HOSTNAME=linuxfah-desktop
NNODES=8, MYRANK=4, HOSTNAME=linuxfah-desktop
NODEID=4 argc=20
NNODES=8, MYRANK=0, HOSTNAME=linuxfah-desktop
NODEID=0 argc=20
NNODES=8, MYRANK=7, HOSTNAME=linuxfah-desktop
NODEID=2 argc=20
NNODES=8, MYRANK=3, HOSTNAME=linuxfah-desktop
NODEID=3 argc=20
NNODES=8, MYRANK=6, HOSTNAME=linuxfah-desktop
NODEID=6 argc=20
NODEID=7 argc=20
Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
NNODES=8, MYRANK=1, HOSTNAME=linuxfah-desktop
NODEID=1 argc=20
NNODES=8, MYRANK=5, HOSTNAME=linuxfah-desktop
NODEID=5 argc=20
Note: tpx file_version 48, software version 68

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 8 x 1 x 1
starting mdrun 'SINGLE VESICLE in water'
2250000 steps,   9000.0 ps (continuing from step 2000000,   8000.0 ps).

So, it seems that the "delay" problem is related to FAH disk IO patterns on ext4 volumes.
ParrLeyne
 
Posts: 215
Joined: Sun Dec 14, 2008 10:59 pm
Location: Toronto, Canada

Re: A 55 minute delay in posting results and getting new WU???

Postby theo343 » Thu Dec 03, 2009 9:09 pm

My data is backed-up realtime to an adjacent server. Loss of time to replace a drive would be about 30 min.

The performance is definitely worth it.


Besides, your page file is more abusive to one of these drives than writing these data files every other day.

Even with backup of SSD its still abuse of the SSD and if you have to replace it, its time spent on that. Ramdrive, if you can, is much better for that issue.
Image
theo343
 
Posts: 448
Joined: Thu Jul 03, 2008 12:43 pm
Location: Norway

Re: A 55 minute delay in posting results and getting new WU???

Postby Viper666 » Tue Dec 08, 2009 7:12 am

many said Ram drive is the fix I found this ,maybe it will help someone http://www.thegeekstuff.com/2008/11/ove ... -on-linux/
ImageImage
Powered by AMD and Open SUSe 11.2 ----The New servers have nothing to do with fixing the stats problem !
User avatar
Viper666
 
Posts: 223
Joined: Fri Dec 14, 2007 9:57 pm

Re: A 55 minute delay in posting results and getting new WU???

Postby shdbcamping » Thu Jan 07, 2010 9:33 am

How fast we go with HW is up to our choice in that HW.

Can we quit quibbling about esoteric stuff and get to -bigadv solutions?
shdbcamping
 
Posts: 519
Joined: Mon Nov 10, 2008 7:57 am

Re: A 55 minute delay in posting results and getting new WU???

Postby Kougar » Fri Jan 08, 2010 8:21 am

shdbcamping wrote:Can we quit quibbling about esoteric stuff and get to -bigadv solutions?


That's a bit rude, don't you think? And the solution was posted earlier in the thread already. ;)
Kougar
 
Posts: 172
Joined: Fri Apr 11, 2008 2:39 am
Location: Texas

PreviousNext

Return to SMP with bigadv

Who is online

Users browsing this forum: No registered users and 2 guests

cron