Page 2 of 4

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Sun Jul 26, 2009 5:53 pm
by tear
I came up with yet another workaround for this issue -- briefly explained here

It's quite feasible if you fold 24x7 and AC power is reliable. On the other hand, there's an extra work necessary to restart
the client after power interruption (or operator-initiated reboot for that matter).

Let me know if you're interested in giving it a shot -- thanks.


tear

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Wed Jul 29, 2009 6:56 pm
by Anachron
I made use of your hints, tear. I now have a tmpfs mounted in a directory alongside the regular folding client. I modified the init script from fahwiki, so it now runs two rather crude scripts I have made (before the client starts and after it stops, respectively).

The first one simply copies the required folding files from the permanent directory to the tmpfs.
The second script copies work results from the temporary mount back to the permanent storage.

The second script is also set up as a cron job, executing every hour, for backup.

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Wed Jul 29, 2009 7:24 pm
by tear
Right on! Saving both time and hard drive :-)

Out of curiosity -- how many "past copies" do you keep? Is it just one or more?


tear

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Thu Jul 30, 2009 5:35 pm
by Anachron
I actually keep two past copies, but it was mostly for testing purposes. Now that it seems like the system works as intended I guess I could update the script to only keep one copy.

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Thu Jul 30, 2009 5:46 pm
by tear
Anachron wrote:I actually keep two past copies, but it was mostly for testing purposes. Now that it seems like the system works as intended I guess I could update the script to only keep one copy.

I think two copies are a good idea in case power goes down the moment you're making "current" copy...

Just a thought from me,
tear

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Sat Aug 08, 2009 1:01 am
by tear
And something I've run into recently -- it happened that backup intervals lined
up (too well) with FahCore's checkpoints (and all backups for that WU were
screwed).

I incorporated kind of a fuzz factor in my backup script:
Code: Select all
SLEEP=$((120*$RANDOM/32768))
echo sleeping $SLEEP seconds
sleep $SLEEP


Random value from 0-119 second range.

Just FYI

Cheers,
tear

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Sat Aug 08, 2009 5:35 pm
by Anachron
Thanks for the heads up, tear. I had fun coming up with this addition to my backup script (as an alternative to the 'fuzz factor').

It checks for the modification time of the newest wudata_0*.log file, and waits if we're too close to an estimated write time.
There are internal checkpoints (that do not show up in FAHlog.txt) in at least some WUs. I have selected values for SAFE and UNSAFE based on the interval between internal checkpoints in my current WU (p2671).

Code: Select all
#!/bin/bash

WORK_DIR='~/.folding/smp-temp/work'
SAFE=5       # Consider it safe to start copy after $SAFE amount of seconds since last file modification
UNSAFE=180   # Consider it unsafe to start copy after $UNSAFE amount of seconds since last file modification
check_time() {
   CURRENTTIME=$(date +%s)
   MODTIME=$(stat -c %Y $FILE)
   DIFFTIME=$(($CURRENTTIME - $MODTIME))
   echo The file is $DIFFTIME seconds old
}

cd $WORK_DIR
FILE=$(ls -t |head -8 |grep wudata_...log)

check_time
while [[(($DIFFTIME -lt $SAFE || $DIFFTIME -gt $UNSAFE)) ]]
do
   # Wait for a safe timeslot, i.e. when the files are unlikely to be changed
   echo Waiting...
   sleep 5
   check_time
done
echo It is now safe to make a backup ...



tear wrote:I think two copies are a good idea in case power goes down the moment you're making "current" copy...

This is true.

Edit: Fixed 1 small error in script: WORK_DIR variable was missing quotes.

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Sat Aug 08, 2009 6:47 pm
by tear
Yeah, I too noticed those "extra" checkpoints.

Also, I think your solution is way more robust than random sleep (as there actually is a way
to check when the last checkpoint was taken why not make use of it?).

Well thought!


tear

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Mon Aug 10, 2009 4:58 am
by bruce
The various FahCores use code from different sources. That means that POTENTIALLY, each FahCore can have a different way of deciding when to do the next checkpoint. This isn't officially documented anywhere that I've seen. Nevertheless, I've observed at least three different methods.

In the following, the word "frame" means the time between progress messages in FAHlog.txt -- typically every 1%. When I talk about 15 minutes, I'm really referring to the user defined checkpoint interval which has a default setting of 15 minutes. Adapt what I say to your actual setting.

1) Perform a checkpoint at the end of every frame. If the time between frames exceeds the user defined interval (15 minutes) then perform an additional checkpoint at 15/30/45/... minutes after that frame started, even if the frame will end at, say, 15.1 minutes. Reset the timer from the start of the next frame.

2) Perform a checkpoint every 15/30/45/... minutes from the time the FahCore was started (ignore frames). Checkpoints may or may not also be performed at the end of each frame.

3) There's also some strange sequence that I never figured out which didn't follow the 15 minute interval setting but somehow injected additional checkpoints such as after 12 minutes when the frame was going to be 15.1 minutes.

In some cases, I'm pretty sure that the end-of-frame checkpoints are separate files from the timer based checkpoints and if so, I'm not sure if a restart uses the one you might expect it to use.

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Tue Aug 11, 2009 8:33 pm
by Anachron
That is interesting.
It seems the checkpoint at the end of a frame, and the "internal" checkpoints, are logged in wudata.0x.log.
But the 15-minute-checkpoints I'm not sure of.
I would have to time the client from start and check what files are changed after 15 minutes.

However, so far I have had no problems restoring the client from backup. If problems arise I will keep your points in mind.

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Mon Nov 16, 2009 7:50 pm
by rickoic
I noticed the same problem when I was using a ATA HD. So I changed to a SATA2 drive and found I had the same length of delay. Using Ubuntu 9.10 on a Z8PE-D12 mb with Xeon 5520's x 2.
My time from when it finishes saying 'done' and when it procedes is 66 minutes. Transmitt time is 41 minutes, but thats a function of my DSL connection.

On my i7 3.06GHz the time from 'done' to picking up again is 1 hour 21 minutes. Transmitt time is 42 minutes.

5520's use a Seagate Barracude 1.5TB
i7 3.06 uses a WD green 1.0TB

I don't know what the difference in the drives is, but the write time (hang time) amounts to a 15 minute difference.


Tks
Rick

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Mon Nov 16, 2009 8:45 pm
by tear
Re background uploads -- you might be interested in this.

tear

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Wed Nov 25, 2009 9:14 pm
by rickoic
I'm a noob about linux, but is there a way to make it write more than 4096 bytes at a time? Say 8192 would cut the time almost in half, and 16384 down to around 25%.

Is it possible, or even doable?

Tks
Rick

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Wed Nov 25, 2009 9:38 pm
by tear
Absolutely. It is doable.

Though the big question is what client/core are trying to accomplish there. It clearly seems
someone is trying to do something along the lines of atomic operation (perhaps to avoid
accidental reuse of stale data) except ... that's not the way to do it.



tear

Re: Core hangs: jbd2_log_wait_commit

PostPosted: Sun Nov 29, 2009 12:13 am
by rickoic
tear wrote:Absolutely. It is doable.

Though the big question is what client/core are trying to accomplish there. It clearly seems
someone is trying to do something along the lines of atomic operation (perhaps to avoid
accidental reuse of stale data) except ... that's not the way to do it.



tear


So, do you think the programmers at Stanford would let us know their reasoning, and if it would be permissable to change the write size?

Tks
Rick