Core hangs: jbd2_log_wait_commit

Moderators: Site Moderators, PandeGroup

Re: Core hangs: jbd2_log_wait_commit

Postby tear » Sun Jul 26, 2009 5:53 pm

I came up with yet another workaround for this issue -- briefly explained here

It's quite feasible if you fold 24x7 and AC power is reliable. On the other hand, there's an extra work necessary to restart
the client after power interruption (or operator-initiated reboot for that matter).

Let me know if you're interested in giving it a shot -- thanks.


tear
One man's ceiling is another man's floor.
Image
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Core hangs: jbd2_log_wait_commit

Postby Anachron » Wed Jul 29, 2009 6:56 pm

I made use of your hints, tear. I now have a tmpfs mounted in a directory alongside the regular folding client. I modified the init script from fahwiki, so it now runs two rather crude scripts I have made (before the client starts and after it stops, respectively).

The first one simply copies the required folding files from the permanent directory to the tmpfs.
The second script copies work results from the temporary mount back to the permanent storage.

The second script is also set up as a cron job, executing every hour, for backup.
Time flies like an arrow; fruit flies like a banana
Anachron
 
Posts: 50
Joined: Fri Mar 14, 2008 12:10 pm

Re: Core hangs: jbd2_log_wait_commit

Postby tear » Wed Jul 29, 2009 7:24 pm

Right on! Saving both time and hard drive :-)

Out of curiosity -- how many "past copies" do you keep? Is it just one or more?


tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Core hangs: jbd2_log_wait_commit

Postby Anachron » Thu Jul 30, 2009 5:35 pm

I actually keep two past copies, but it was mostly for testing purposes. Now that it seems like the system works as intended I guess I could update the script to only keep one copy.
Anachron
 
Posts: 50
Joined: Fri Mar 14, 2008 12:10 pm

Re: Core hangs: jbd2_log_wait_commit

Postby tear » Thu Jul 30, 2009 5:46 pm

Anachron wrote:I actually keep two past copies, but it was mostly for testing purposes. Now that it seems like the system works as intended I guess I could update the script to only keep one copy.

I think two copies are a good idea in case power goes down the moment you're making "current" copy...

Just a thought from me,
tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Core hangs: jbd2_log_wait_commit

Postby tear » Sat Aug 08, 2009 1:01 am

And something I've run into recently -- it happened that backup intervals lined
up (too well) with FahCore's checkpoints (and all backups for that WU were
screwed).

I incorporated kind of a fuzz factor in my backup script:
Code: Select all
SLEEP=$((120*$RANDOM/32768))
echo sleeping $SLEEP seconds
sleep $SLEEP


Random value from 0-119 second range.

Just FYI

Cheers,
tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Core hangs: jbd2_log_wait_commit

Postby Anachron » Sat Aug 08, 2009 5:35 pm

Thanks for the heads up, tear. I had fun coming up with this addition to my backup script (as an alternative to the 'fuzz factor').

It checks for the modification time of the newest wudata_0*.log file, and waits if we're too close to an estimated write time.
There are internal checkpoints (that do not show up in FAHlog.txt) in at least some WUs. I have selected values for SAFE and UNSAFE based on the interval between internal checkpoints in my current WU (p2671).

Code: Select all
#!/bin/bash

WORK_DIR='~/.folding/smp-temp/work'
SAFE=5       # Consider it safe to start copy after $SAFE amount of seconds since last file modification
UNSAFE=180   # Consider it unsafe to start copy after $UNSAFE amount of seconds since last file modification
check_time() {
   CURRENTTIME=$(date +%s)
   MODTIME=$(stat -c %Y $FILE)
   DIFFTIME=$(($CURRENTTIME - $MODTIME))
   echo The file is $DIFFTIME seconds old
}

cd $WORK_DIR
FILE=$(ls -t |head -8 |grep wudata_...log)

check_time
while [[(($DIFFTIME -lt $SAFE || $DIFFTIME -gt $UNSAFE)) ]]
do
   # Wait for a safe timeslot, i.e. when the files are unlikely to be changed
   echo Waiting...
   sleep 5
   check_time
done
echo It is now safe to make a backup ...



tear wrote:I think two copies are a good idea in case power goes down the moment you're making "current" copy...

This is true.

Edit: Fixed 1 small error in script: WORK_DIR variable was missing quotes.
Last edited by Anachron on Sat Aug 08, 2009 9:40 pm, edited 1 time in total.
Anachron
 
Posts: 50
Joined: Fri Mar 14, 2008 12:10 pm

Re: Core hangs: jbd2_log_wait_commit

Postby tear » Sat Aug 08, 2009 6:47 pm

Yeah, I too noticed those "extra" checkpoints.

Also, I think your solution is way more robust than random sleep (as there actually is a way
to check when the last checkpoint was taken why not make use of it?).

Well thought!


tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Core hangs: jbd2_log_wait_commit

Postby bruce » Mon Aug 10, 2009 4:58 am

The various FahCores use code from different sources. That means that POTENTIALLY, each FahCore can have a different way of deciding when to do the next checkpoint. This isn't officially documented anywhere that I've seen. Nevertheless, I've observed at least three different methods.

In the following, the word "frame" means the time between progress messages in FAHlog.txt -- typically every 1%. When I talk about 15 minutes, I'm really referring to the user defined checkpoint interval which has a default setting of 15 minutes. Adapt what I say to your actual setting.

1) Perform a checkpoint at the end of every frame. If the time between frames exceeds the user defined interval (15 minutes) then perform an additional checkpoint at 15/30/45/... minutes after that frame started, even if the frame will end at, say, 15.1 minutes. Reset the timer from the start of the next frame.

2) Perform a checkpoint every 15/30/45/... minutes from the time the FahCore was started (ignore frames). Checkpoints may or may not also be performed at the end of each frame.

3) There's also some strange sequence that I never figured out which didn't follow the 15 minute interval setting but somehow injected additional checkpoints such as after 12 minutes when the frame was going to be 15.1 minutes.

In some cases, I'm pretty sure that the end-of-frame checkpoints are separate files from the timer based checkpoints and if so, I'm not sure if a restart uses the one you might expect it to use.
bruce
 
Posts: 22322
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Core hangs: jbd2_log_wait_commit

Postby Anachron » Tue Aug 11, 2009 8:33 pm

That is interesting.
It seems the checkpoint at the end of a frame, and the "internal" checkpoints, are logged in wudata.0x.log.
But the 15-minute-checkpoints I'm not sure of.
I would have to time the client from start and check what files are changed after 15 minutes.

However, so far I have had no problems restoring the client from backup. If problems arise I will keep your points in mind.
Anachron
 
Posts: 50
Joined: Fri Mar 14, 2008 12:10 pm

Re: Core hangs: jbd2_log_wait_commit

Postby rickoic » Mon Nov 16, 2009 7:50 pm

I noticed the same problem when I was using a ATA HD. So I changed to a SATA2 drive and found I had the same length of delay. Using Ubuntu 9.10 on a Z8PE-D12 mb with Xeon 5520's x 2.
My time from when it finishes saying 'done' and when it procedes is 66 minutes. Transmitt time is 41 minutes, but thats a function of my DSL connection.

On my i7 3.06GHz the time from 'done' to picking up again is 1 hour 21 minutes. Transmitt time is 42 minutes.

5520's use a Seagate Barracude 1.5TB
i7 3.06 uses a WD green 1.0TB

I don't know what the difference in the drives is, but the write time (hang time) amounts to a 15 minute difference.


Tks
Rick
Duel 2.8 3 250's Quad 2.4 285. 260, Quad 2.4 3 250 , i7 2.27 2 250 GPU's, i7 2.24 2 250 GPU's, i7 3.06 bigadv, duel Xeon 2.27 bigadv, AMD Phenom ][ 3 250 GPU's, Laptop GT 130M.
I'm folding because Dec 2005 I had radical prostrate surgery.
rickoic
 
Posts: 306
Joined: Sat May 23, 2009 4:49 pm
Location: Mississippi near Memphis, Tn

Re: Core hangs: jbd2_log_wait_commit

Postby tear » Mon Nov 16, 2009 8:45 pm

Re background uploads -- you might be interested in this.

tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Core hangs: jbd2_log_wait_commit

Postby rickoic » Wed Nov 25, 2009 9:14 pm

I'm a noob about linux, but is there a way to make it write more than 4096 bytes at a time? Say 8192 would cut the time almost in half, and 16384 down to around 25%.

Is it possible, or even doable?

Tks
Rick
rickoic
 
Posts: 306
Joined: Sat May 23, 2009 4:49 pm
Location: Mississippi near Memphis, Tn

Re: Core hangs: jbd2_log_wait_commit

Postby tear » Wed Nov 25, 2009 9:38 pm

Absolutely. It is doable.

Though the big question is what client/core are trying to accomplish there. It clearly seems
someone is trying to do something along the lines of atomic operation (perhaps to avoid
accidental reuse of stale data) except ... that's not the way to do it.



tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Core hangs: jbd2_log_wait_commit

Postby rickoic » Sun Nov 29, 2009 12:13 am

tear wrote:Absolutely. It is doable.

Though the big question is what client/core are trying to accomplish there. It clearly seems
someone is trying to do something along the lines of atomic operation (perhaps to avoid
accidental reuse of stale data) except ... that's not the way to do it.



tear


So, do you think the programmers at Stanford would let us know their reasoning, and if it would be permissable to change the write size?

Tks
Rick
rickoic
 
Posts: 306
Joined: Sat May 23, 2009 4:49 pm
Location: Mississippi near Memphis, Tn

PreviousNext

Return to Linux CPU V6 Client

Who is online

Users browsing this forum: No registered users and 1 guest

cron