Core hangs: jbd2_log_wait_commit

Moderators: Site Moderators, PandeGroup

Core hangs: jbd2_log_wait_commit

Postby Anachron » Tue Jul 07, 2009 5:05 pm

My smp client hangs:
Code: Select all
[16:19:42] Completed 247500 out of 250000 steps  (99%)
[16:27:03] Completed 250000 out of 250000 steps  (100%)
[16:27:04] DynamicWrapper: Finished Work Unit: sleep=10000
[16:27:14]
[16:27:14] Finished Work Unit:
[16:27:14] - Reading up to 21182544 from "work/wudata_06.trr": Read 21182544
[16:27:15] trr file hash check passed.
[16:27:15] - Reading up to 27685396 from "work/wudata_06.xtc": Read 27685396
[16:27:15] xtc file hash check passed.
[16:27:15] edr file hash check passed.
[16:27:15] logfile size: 184454
[16:27:15] Leaving Run
[16:27:20] - Writing 49197146 bytes of core data to disk...
[16:27:20]   ... Done.                                  <-- Hangs here
[16:43:37] - Shutting down core
[16:43:37]
[16:43:37] Folding@home Core Shutdown: FINISHED_UNIT
[16:44:36] CoreStatus = 64 (100)

In the system monitor I can see one FahCore_a2.exe still running, and it has the waiting code jbd2_log_wait_commit.
While the client seems to hang, my hard drive sounds like it is being written to continuously.

This is on Linux Mint x64, with kernel 2.6.28-13.
Last edited by Anachron on Wed Jul 08, 2009 12:58 pm, edited 1 time in total.
Time flies like an arrow; fruit flies like a banana
Anachron
 
Posts: 50
Joined: Fri Mar 14, 2008 12:10 pm

Re: Client hangs: jbd2_log_wait_commit

Postby tear » Tue Jul 07, 2009 8:23 pm

Any hanging (that's not sleeping) in kernel mode is a red flag.

jbd2_log_commit comes from code delivering journalled block device layer
to some filesystems (notably ext3 and ext4) and hanging there may indicate
(my immediate thoughts):
- failing hard drive, causing issues on jbd layer
- kernel bug (fs/jbd, might also be something else) manifesting this way

My suggestion? Check output of "dmesg" command for possible clues;
next you might want to check/repair your filesystem with appropriate
tools.

tear
One man's ceiling is another man's floor.
Image
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Client hangs: jbd2_log_wait_commit

Postby tear » Tue Jul 07, 2009 9:55 pm

Hmm seems I've misunderstood your post. So it does NOT hang indefinitely (and eventually proceeds), right?

I'm seeing very similar behavior with XFS. It's related to client (or was it core, can't
remember off the top of my head) clearing the result file with 4k zero-blocks
and calling fsync() after every single one (checked with strace). It's sooo silly.

I brought it up in one post, lemme look it up....

EDIT: link to referred post
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Client hangs: jbd2_log_wait_commit

Postby bruce » Tue Jul 07, 2009 10:02 pm

I think there's an important distinction here. The client did not hang . . . the core did. Externally most people only see that the client is not proceeding so just as you did in the title, it is attributed to a problem in the client, when realistically, the client is doing exactly what it is supposed to do . . . waiting on the core(s) to finish the WU. As long as FahCore_a2 is waiting on (whatever), the client is not supposed to proceed.

Actually, this is good news. Assuming that there is enough information available for someone to fix the core, it's a lot easier for the Pande Group to distribute a new core than to distribute a new client.

At the point of failure, what files are present in /work?
bruce
 
Posts: 22814
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Client hangs: jbd2_log_wait_commit

Postby tear » Tue Jul 07, 2009 10:22 pm

Dear Bruce,

It's not a failure per se; it's an inefficiency of writing/clearing result file.

Result file gets written to/cleared without problems; it just takes time
on some filesystems due to way writes to file are done:

write(fd, result, 4096);
fsync(fd);
write(fd, result, 4096);
fsync(fd);
write(fd, result, 4096);
fsync(fd);

Granted, under some circumstances doing so (or even opening file with O_SYNC
so manual fsync() is not necessary) is warranted, however I don't think the whole
file needs to be written/cleared this way (besides, why not do everything and then
call fsync() once?).


tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Client hangs: jbd2_log_wait_commit

Postby bruce » Wed Jul 08, 2009 3:14 am

Understood. 50MB in 4K blocks probably represents ~1250 unnecessary fsync calls that can probably be cached and then synced only once. There may be a reason they wrote it that way, but I can't think of what it might be.

I do believe they want to see the return code before they try to upload the file, just in case a disk got full or something like that.
bruce
 
Posts: 22814
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Client hangs: jbd2_log_wait_commit

Postby toTOW » Wed Jul 08, 2009 10:45 am

I don't know what happens here, but that's definitely doing something funky with the storage ...

The phenomenon is even worse on slow drives. Here is the log from a server that uses network storage drive (iSCSI) :

Code: Select all
[05:10:53] - Writing 48598839 bytes of core data to disk...
[05:10:55]   ... Done.
[05:48:25] - Shutting down core
[05:48:25]
[05:48:25] Folding@home Core Shutdown: FINISHED_UNIT
[05:50:32] CoreStatus = 64 (100)


37 minutes lost :roll: ... and the same phenomenon occurs when it tries to clean the uploaded files :

Code: Select all
[05:50:32] + Attempting to send results [July 6 05:50:32 UTC]
[05:50:32] - Reading file work/wuresults_03.dat from core
[05:50:33]   (Read 48598839 bytes from disk)
[05:50:33] Connecting to http://171.64.65.56:8080/
[05:50:40] Posted data.
[05:50:40] Initial: 0000; - Uploaded at ~5932 kB/s
[05:50:41] - Averaged speed for that direction ~2769 kB/s
[05:50:41] + Results successfully sent
[05:50:41] Thank you for your contribution to Folding@Home.
[05:50:41] + Number of Units Completed: 21

[06:11:23] - Warning: Could not delete all work unit files (3): Core file absent
[06:11:23] Trying to send all finished work units
[06:11:23] + No unsent completed units remaining.
[06:11:23] - Preparing to get new work unit...


Another 20 minutes lost ... :(
Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.

FAH-Addict : latest news, tests and reviews about Folding@Home project.

Image
User avatar
toTOW
Site Moderator
 
Posts: 8785
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France

Re: Client hangs: jbd2_log_wait_commit

Postby tear » Wed Jul 08, 2009 12:32 pm

I worked around it (had only one machine with i-RAM) by running client off /dev/shm/
(a ram-disk, virtually) and taking periodic (6 hours, cron-driven) backups (/dev/shm
-> hard drive).

I had also experimented with LD_PRELOADed custom library that would provide
no-op versions of fsync()/msync() but I gave up that idea (can't remember why;
it was more than year ago).


tear


EDIT: fixed grammar
Last edited by tear on Wed Jul 08, 2009 2:42 pm, edited 1 time in total.
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Core hangs: jbd2_log_wait_commit

Postby Anachron » Wed Jul 08, 2009 1:03 pm

bruce: It sounds more likely to be the core that hangs. I changed the topic title.
I don't know what files were in /work when it hung.

toTOW: Your logs are the same as mine. I didn't have a log for the second hang because I restarted the client.

tear, if I understand you correctly, the core does not actually hang, it just uses an inefficient routine to write to disk?

Strange thing is, I did not see this earlier when I had ext3. It is only recently, after I reformatted with ext4, that I noticed it.

Can it be a bug in ext4?
Anachron
 
Posts: 50
Joined: Fri Mar 14, 2008 12:10 pm

Re: Core hangs: jbd2_log_wait_commit

Postby toTOW » Wed Jul 08, 2009 2:52 pm

I agree that with ext3 on physical drive, this phenomenon is less visible ... I mainly see it because of the network drive mentioned above (it uses ext3 too, but its output is limited).

Here what I get on a VM using ext3 (host disk is on SATA2 interface) :

Code: Select all
[13:32:03] - Writing 49255235 bytes of core data to disk...
[13:32:04]   ... Done.
[13:32:09] - Shutting down core
[13:32:09]
[13:32:09] Folding@home Core Shutdown: FINISHED_UNIT
[13:35:22] CoreStatus = 64 (100)


Only 3 min 30 lost here ...

Code: Select all
[13:43:42] + Results successfully sent
[13:43:42] Thank you for your contribution to Folding@Home.
[13:43:42] + Number of Units Completed: 33

[13:43:45] - Warning: Could not delete all work unit files (3): Core file absent
[13:43:45] Trying to send all finished work units
[13:43:45] + No unsent completed units remaining.


And this is barely noticeable when it tries to delete sent work ...
User avatar
toTOW
Site Moderator
 
Posts: 8785
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France

Re: Core hangs: jbd2_log_wait_commit

Postby tear » Wed Jul 08, 2009 2:54 pm

Anachron wrote:tear, if I understand you correctly, the core does not actually hang(...)

You tell me :). I just assumed (having looked at your log) that eventually whole process
(writing results and sending WU) completed without manual intervention.

Anachron wrote:(...) it just uses an inefficient routine to write to disk?

IMNSHO, yes.

Anachron wrote:Strange thing is, I did not see this earlier when I had ext3. It is only recently, after I reformatted with ext4, that I noticed it.

Can it be a bug in ext4?

Honestly? Don't know. It might; it might not. I'm not filesystem expert though I've been
tracking XFS ml for several years now.

All I can tell you is that semantics of fsync() might have changed in ext4 as a result
of a deliberate decision (to make it more "conforming" with standards/expectations)
too. I'm not dismissing a bug either though one would need to prepare much more specific
report than "FahCore writes take too long" for ext4 folks (separate test case that makes
use of the same syscalls is a good start).

Also, there are run-time variables noticeably affecting write/fsync performance, like, for
instance, journalling mode (writeback/ordered/journal) and I don't know what the defaults
are in kernels you used (assuming you did not override them).


tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Core hangs: jbd2_log_wait_commit

Postby tear » Sat Jul 11, 2009 1:48 am

Additional info (FWIW; strace -p excerpt and other minor bits):

Code: Select all
fsync(6)                                = 0
write(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
fsync(6)                                = 0
write(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
fsync(6)                                = 0
write(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
fsync(6)                                = 0
write(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
fsync(6)                                = 0
write(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
fsync(6^C <unfinished ...>
Process 2470 detached
[fah@matroska ~]$ ls -l /proc/2470/fd
total 0
lrwx------ 1 fah fah 64 Jul 10 19:41 0 -> /dev/pts/1
l-wx------ 1 fah fah 64 Jul 10 19:41 1 -> pipe:[8256]
l-wx------ 1 fah fah 64 Jul 10 19:41 2 -> pipe:[8257]
lrwx------ 1 fah fah 64 Jul 10 19:41 3 -> /fah/clients/fah/FAHlog.txt
l-wx------ 1 fah fah 64 Jul 10 19:41 4 -> /fah/clients/fah/work/logfile_04.txt
l-wx------ 1 fah fah 64 Jul 10 19:41 5 -> pipe:[8256]
l-wx------ 1 fah fah 64 Jul 10 19:41 6 -> /fah/clients/fah/work/wudata_04.trr
l-wx------ 1 fah fah 64 Jul 10 19:41 7 -> pipe:[8257]
[fah@matroska ~]$ ps faux > /tmp/faux
[fah@matroska ~]$ ps aux -L > /tmp/aux-L
[fah@matroska ~]$ ps 2470
  PID TTY      STAT   TIME COMMAND
 2470 pts/1    DNl+ 593:28 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 04 -nocpulock -checkpoint 15 -forceasm -verbose -lifeline 2461 -version 624

Soo.. it's clearing .trr file actually.

Log-wise we are...
Code: Select all
[01:40:26] xtc file hash check passed.
[01:40:26] edr file hash check passed.
[01:40:26] logfile size: 205372
[01:40:26] Leaving Run
[01:40:29] - Writing 100183988 bytes of core data to disk...
[01:40:30]   ... Done.
here <-------


EDIT: after some time it switched to .xtc (doing same thing: write 4k zeros, fsync()), then .cpt file, then .tpr file.

EDIT2: per "vmstat 1" it's doing 200kB/s (pain, pain, hurt, hurt)

EDIT3: sheesh, see this:
Code: Select all
[fah@matroska ~]$ strace -efile -p 2470
Process 2470 attached - interrupt to quit
unlink("work/wudata_04.tpr")            = 0
open("work/wudata_04.dyn", O_WRONLY|O_SYNC) = -1 ENOENT (No such file or directory)
open("work/wudata_04.edr", O_WRONLY|O_SYNC) = 6
unlink("work/wudata_04.edr")            = 0
open("work/wudata_04.gro", O_WRONLY|O_SYNC) = 6
unlink("work/wudata_04.gro")            = 0
open("work/wudata_04.log", O_WRONLY|O_SYNC) = 6
unlink("work/wudata_04.log")            = 0
open("work/wudata_04.sas", O_WRONLY|O_SYNC) = -1 ENOENT (No such file or directory)
open("work/wudata_04.goe", O_WRONLY|O_SYNC) = -1 ENOENT (No such file or directory)
open("work/wudata_04.pdo", O_WRONLY|O_SYNC) = -1 ENOENT (No such file or directory)
open("work/wudata_04.xvg", O_WRONLY|O_SYNC) = -1 ENOENT (No such file or directory)
open("work/wudata_04.bxv", O_WRONLY|O_SYNC) = -1 ENOENT (No such file or directory)
open("work/wudata_04.dat", O_WRONLY|O_SYNC) = 6
unlink("work/wudata_04.dat")            = 0
open("work/wuinfo_04.dat", O_WRONLY|O_SYNC) = 6
unlink("work/wuinfo_04.dat")            = 0
open("work/wudata_04.err", O_WRONLY|O_SYNC) = -1 ENOENT (No such file or directory)
open("work/wudata_04.log", O_WRONLY|O_SYNC) = -1 ENOENT (No such file or directory)
open("work/wudata_04.key", O_WRONLY|O_SYNC) = -1 ENOENT (No such file or directory)
open("work/wudata_04.xyz", O_WRONLY|O_SYNC) = -1 ENOENT (No such file or directory)
open("work/wudata_04.dyn", O_WRONLY|O_SYNC) = -1 ENOENT (No such file or directory)
open("work/core78.sta", O_WRONLY|O_SYNC) = 6
unlink("work/core78.sta")               = 0
Process 2470 detached

So not only it opens files with O_SYNC but explicitly calls fsync() as well (redundant).
Afterwards, it *removes* the file. I am sorry but I'm not getting this whole
"logic".

It finished...
Code: Select all
[02:13:49] - Shutting down core
[02:13:49]
[02:13:49] Folding@home Core Shutdown: FINISHED_UNIT


If file is to be removed why not just unlink it? Or do a truncate()/ftruncate()/open with O_TRUNC, whatever...

Can anyone authoritatively say what the purpose of clearing files by synchronously writing zeroes is?


tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Core hangs: jbd2_log_wait_commit

Postby tear » Sat Jul 11, 2009 2:49 am

Yet another update -- it seems *client* does things the same way:

Code: Select all
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
fsync(4)                                = 0
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
fsync(4)                                = 0
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
fsync(4)                                = 0
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
fsync(4)                                = 0
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
fsync(4)                                = 0
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
fsync(4^C <unfinished ...>
Process 2462 detached
[fah@matroska ~]$ ls -l /proc/2462/fd
total 0
lrwx------ 1 fah fah 64 Jul 10 20:40 0 -> /dev/pts/1
lrwx------ 1 fah fah 64 Jul 10 20:40 1 -> /dev/pts/1
lrwx------ 1 fah fah 64 Jul 10 19:41 2 -> /dev/pts/1
lrwx------ 1 fah fah 64 Jul 10 20:40 3 -> /fah/clients/fah/FAHlog.txt
l-wx------ 1 fah fah 64 Jul 10 20:40 4 -> /fah/clients/fah/work/wuresults_04.dat
[fah@matroska ~]$ ps aux -L | grep 2462
fah   2461  2462  0.0    3  0.0  25932   488 pts/1    DNl+ 06:43   0:00 ./fah6 -oneunit
fah   8211  8211  0.0    1  0.0  85000   784 pts/2    S+   20:40   0:00 grep 2462
[fah@matroska ~]$



tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Core hangs: jbd2_log_wait_commit

Postby Anachron » Sat Jul 11, 2009 9:47 am

That does not look good at all. Why would they do it like that?

I'm hoping this is fixed in SMP2.
Anachron
 
Posts: 50
Joined: Fri Mar 14, 2008 12:10 pm

Re: Core hangs: jbd2_log_wait_commit

Postby Anachron » Sun Jul 26, 2009 10:09 am

I upgraded to kernel 2.6.30, and this continues to be a problem.
I guess there is an incompatibility between ext4's way of doing things and the folding client's way of doing things.

Could we get some official word on this perhaps?
Or, preferably, release SMP2! :)
Anachron
 
Posts: 50
Joined: Fri Mar 14, 2008 12:10 pm

Next

Return to Linux CPU V6 Client

Who is online

Users browsing this forum: No registered users and 0 guests

cron