Page 1 of 2

Memory leak @ 2416

Posted: Wed Dec 12, 2007 12:29 am
by czonkin
Pls. help, on my Linux machine runs this:
...
[23:49:22] Folding@Home Gromacs Core
[23:49:22] Version 1.90 (March 8, 2006)
...
[23:49:24] Project: 2416 (Run 60, Clone 62, Gen 7)
...
and it's eating almost all my memory (as sys, not nice) and growing (~ 1M/sec) ...
Cpu(s): 0.0%us, 99.3%sy, 0.7%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 953948k total, 947952k used, 5996k free, 71904k buffers
Swap: 1270072k total, 1270028k used, 44k free, 101644k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3661 root 39 19 1788m 596m 752 R 99.6 64.1 36:11.59 FahCore_78.exe
3429 stanisla 15 0 14644 708 468 R 0.7 0.1 0:07.76 top
1 root 15 0 10316 264 236 S 0.0 0.0 0:00.30 init

I'm not sure what to do, or if I make some mistake ...

Thanks!

Stanislav

Posted: Wed Dec 12, 2007 11:04 am
by toTOW
Is it doing the same if you restart your client :?:

Whar happens if you let it grow :?: Does it stops, or does the machine crash :?:

Posted: Wed Dec 12, 2007 11:41 am
by czonkin
Yes, with standard loops or optimized, the same. No, it grows until machine (Fedora 7@Sempron 2200+) slows down to unusable state (full swap). Hard reset was the only solution.

Have I to delete this unit, or to try something else?

Posted: Wed Dec 12, 2007 12:53 pm
by toTOW
Make a backup of your FAH folder (including everything), just in case you need to send it to Stanford ... then delete the WU.

We'll wait for an answer from Stanford to see if they need you to send the backup or if they can have a look the that particular WU ;)

edit : I sent a mail to Paula who is in charge of this project ... let's wait for her answer ;)

Posted: Wed Dec 12, 2007 2:43 pm
by Ivoshiee
You can send the WU to me as well.

I will take a look

Posted: Wed Dec 12, 2007 6:28 pm
by ppetrone
Hey guys!
thank you for taking care of this.
I will take a look and come back to you asap.

pau

Re: I will take a look

Posted: Thu Dec 13, 2007 12:52 am
by czonkin
Okay, so if you are still interested in it, so look at http://rapidshare.com/files/76186778/FA ... k.zip.html

Thanks!

Stanislav

Re: I will take a look

Posted: Thu Dec 13, 2007 10:00 am
by Ivoshiee
czonkin wrote:Okay, so if you are still interested in it, so look at http://rapidshare.com/files/76186778/FA ... k.zip.html

Thanks!

Stanislav
This WU is broken. I hope Pande Group can nail at least one cause of the 0x79 error with this WU.

The FAH504-Linux.exe will report this:
[10:05:40] Project: 2416 (Run 60, Clone 62, Gen 7)
[10:05:40]
[10:05:40] Assembly optimizations on if available.
[10:05:40] Entering M.D.

Gromacs is Copyright (c) 1991-2003, University of Groningen, The Netherlands
This inclusion of Gromacs code in the Folding@Home Core is under
a special license (see http://folding.stanford.edu/gromacs.html)
specially granted to Stanford by the copyright holders. If you
are interested in using Gromacs, visit http://www.gromacs.org where
you can download a free version of Gromacs under
the terms of the GNU General Public License (GPL) as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.

[10:05:47] Protein: p2416_Ribosome_Na
[10:05:47]
[10:05:47] Writing local files
Fatal error: realloc for nlist->jjnr (1041039360 bytes, file ns.c, line 388, nlist->jjnr=0x0x74c48008): Cannot allocate memory
[10:07:01] Gromacs error.
[10:07:01]
[10:07:01] Folding@home Core Shutdown: UNKNOWN_ERROR
[10:07:02] CoreStatus = 79 (121)
[10:07:02] Client-core communications error: ERROR 0x79
[10:07:02] Deleting current work unit & continuing...
[10:07:19] - Preparing to get new work unit...
[10:07:19] + Attempting to get work packet
[10:07:19] - Connecting to assignment server
[10:07:20] - Successful: assigned to (171.65.103.162).
[10:07:20] + News From Folding@Home: Welcome to Folding@Home
My box has 2 GB of memory and it managed to allocate a 80% before erroring out.


With fah6:
[10:11:36] Project: 2416 (Run 60, Clone 62, Gen 7)
[10:11:36]
[10:11:36] Assembly optimizations on if available.
[10:11:36] Entering M.D.

Gromacs is Copyright (c) 1991-2003, University of Groningen, The Netherlands
This inclusion of Gromacs code in the Folding@Home Core is under
a special license (see http://folding.stanford.edu/gromacs.html)
specially granted to Stanford by the copyright holders. If you
are interested in using Gromacs, visit http://www.gromacs.org where
you can download a free version of Gromacs under
the terms of the GNU General Public License (GPL) as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.

[10:11:43] Protein: p2416_Ribosome_Na
[10:11:43]
[10:11:43] Writing local files
Fatal error: realloc for nlist->jjnr (1055457280 bytes, file ns.c, line 388, nlist->jjnr=0x0x73e61008): Cannot allocate memory
[10:13:13] Gromacs error.
[10:13:13]
[10:13:13] Folding@home Core Shutdown: UNKNOWN_ERROR
[10:13:13] CoreStatus = 79 (121)
[10:13:13] Client-core communications error: ERROR 0x79
[10:13:13] Deleting current work unit & continuing...
[10:13:24] - Preparing to get new work unit...
[10:13:24] + Attempting to get work packet
[10:13:24] - Connecting to assignment server
This time it managed to allocate 85% of memory before death.

Thank you!

Posted: Thu Dec 13, 2007 6:18 pm
by ppetrone
Ok. I am convinced :?
I will remove it for now. In the meantime, I am still running it...

Thank you everybody!

pau

Posted: Thu Dec 13, 2007 8:15 pm
by gwildperson
Thank you, Paula.

We do hope that "development" can isolate this problem and deliver some new code that deals with this issue promptly.

Re: Thank you!

Posted: Thu Dec 13, 2007 8:54 pm
by Ivoshiee
ppetrone wrote:Ok. I am convinced :?
I will remove it for now. In the meantime, I am still running it...

Thank you everybody!

pau
I hope that the action is not only pulling of the WU, but a research why it is doing what it is doing and implementing a fix for that issue into the FAH Core files.

Re: Thank you!

Posted: Thu Dec 13, 2007 10:06 pm
by ppetrone
Yes, exactly. That is the only reason why I am running it.
The research will be to find out whether this is a specific WU problem (most likely) or a more general problem.
Thanks,

Pau

Posted: Thu Dec 13, 2007 10:55 pm
by codysluder
ppetrone wrote:The research will be to find out whether this is a specific WU problem (most likely) or a more general problem.
Even if you find that that specific WU has a problem, it also has a more general problem. The client deleted the WU rather than reporting the problem to the server. You (i.e.-Stanford) should have some indication that this WU has failed 100 times (or some much smaller number) so you can decide if it needs to be removed from circulation without having us keep the statistics for you.

Posted: Thu Dec 13, 2007 11:56 pm
by ppetrone
As I said before I am currently working in this specific WU, to understand if the issue is generalized or not.

I am sorry if it seems as if "you" are keeping statistics for "us".

I understand Folding@Home as a big group of people (donors+scientists) doing statistics *together* to solve relevant biological questions. For that reason, I believe there is a tacit agreement of collaboration and patience.

Paula

Posted: Fri Dec 14, 2007 1:31 am
by codysluder
ppetrone wrote:I am sorry if it seems as if "you" are keeping statistics for "us".
Sorry, I didn't mean that there was a big distinction between "you" and "us" but I can see how it sounded like that.

In fact, there are three types of things that are collaboratively working on FAH. Some things are best done by the Pande-group-type people. Some things are best done by the donor-type people. Some things are best done by software.

If several of the donor-type-people all encounter the same error, it's statistically unlikely that they'll find each other. Nevertheless, in this instance they did. Because of that, the donor-type-people were able to generate a request to find out what's going on with this case. (And a big thank you for accepting this responsibility)

Figuring out why the WU failed is best done by you, and that issue is (probably) important in more WUs than just that one, but even if it's unique, it's important.

If the FAH client is able to report this condition to the server, it's a statistical certainty that those various error reports can find each other and be examined by the Pande-group-type-people like yourself. I decided to call that a universal bug, though maybe it can be considered as an enhancement request. In any case, the reports to the server are a universal problem that is best done by improved software (no matter what is actually wrong with the WU).

As a donor-type-person, I'm also saying that it seems like we're wasting valuable resources (much more than "usually") repeating the same WUs with the the same errors many, many times and there ought to be a better way to find them and transfer them from our queue to your queue with less wasted resources.