Memory leak @ 2416

Moderators: Site Moderators, FAHC Science Team

Memory leak @ 2416

Postby czonkin » Wed Dec 12, 2007 1:29 am

Pls. help, on my Linux machine runs this:
...
[23:49:22] Folding@Home Gromacs Core
[23:49:22] Version 1.90 (March 8, 2006)
...
[23:49:24] Project: 2416 (Run 60, Clone 62, Gen 7)
...
and it's eating almost all my memory (as sys, not nice) and growing (~ 1M/sec) ...
Cpu(s): 0.0%us, 99.3%sy, 0.7%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 953948k total, 947952k used, 5996k free, 71904k buffers
Swap: 1270072k total, 1270028k used, 44k free, 101644k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3661 root 39 19 1788m 596m 752 R 99.6 64.1 36:11.59 FahCore_78.exe
3429 stanisla 15 0 14644 708 468 R 0.7 0.1 0:07.76 top
1 root 15 0 10316 264 236 S 0.0 0.0 0:00.30 init

I'm not sure what to do, or if I make some mistake ...

Thanks!

Stanislav
czonkin
 
Posts: 3
Joined: Wed Dec 12, 2007 1:20 am
Location: Czech republic

Postby toTOW » Wed Dec 12, 2007 12:04 pm

Is it doing the same if you restart your client :?:

Whar happens if you let it grow :?: Does it stops, or does the machine crash :?:
Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.

FAH-Addict : latest news, tests and reviews about Folding@Home project.

Image
User avatar
toTOW
Site Moderator
 
Posts: 5726
Joined: Sun Dec 02, 2007 11:38 am
Location: Bordeaux, France

Postby czonkin » Wed Dec 12, 2007 12:41 pm

Yes, with standard loops or optimized, the same. No, it grows until machine (Fedora 7@Sempron 2200+) slows down to unusable state (full swap). Hard reset was the only solution.

Have I to delete this unit, or to try something else?
czonkin
 
Posts: 3
Joined: Wed Dec 12, 2007 1:20 am
Location: Czech republic

Postby toTOW » Wed Dec 12, 2007 1:53 pm

Make a backup of your FAH folder (including everything), just in case you need to send it to Stanford ... then delete the WU.

We'll wait for an answer from Stanford to see if they need you to send the backup or if they can have a look the that particular WU ;)

edit : I sent a mail to Paula who is in charge of this project ... let's wait for her answer ;)
User avatar
toTOW
Site Moderator
 
Posts: 5726
Joined: Sun Dec 02, 2007 11:38 am
Location: Bordeaux, France

Postby Ivoshiee » Wed Dec 12, 2007 3:43 pm

You can send the WU to me as well.
Ivoshiee
Site Moderator
 
Posts: 822
Joined: Sun Dec 02, 2007 1:05 am
Location: Estonia

I will take a look

Postby ppetrone » Wed Dec 12, 2007 7:28 pm

Hey guys!
thank you for taking care of this.
I will take a look and come back to you asap.

pau
ppetrone
Pande Group Member
 
Posts: 115
Joined: Wed Dec 12, 2007 7:20 pm
Location: Stanford

Re: I will take a look

Postby czonkin » Thu Dec 13, 2007 1:52 am

Okay, so if you are still interested in it, so look at http://rapidshare.com/files/76186778/FAH-MemLeak.zip.html

Thanks!

Stanislav
czonkin
 
Posts: 3
Joined: Wed Dec 12, 2007 1:20 am
Location: Czech republic

Re: I will take a look

Postby Ivoshiee » Thu Dec 13, 2007 11:00 am

czonkin wrote:Okay, so if you are still interested in it, so look at http://rapidshare.com/files/76186778/FAH-MemLeak.zip.html

Thanks!

Stanislav


This WU is broken. I hope Pande Group can nail at least one cause of the 0x79 error with this WU.

The FAH504-Linux.exe will report this:
[10:05:40] Project: 2416 (Run 60, Clone 62, Gen 7)
[10:05:40]
[10:05:40] Assembly optimizations on if available.
[10:05:40] Entering M.D.

Gromacs is Copyright (c) 1991-2003, University of Groningen, The Netherlands
This inclusion of Gromacs code in the Folding@Home Core is under
a special license (see http://folding.stanford.edu/gromacs.html)
specially granted to Stanford by the copyright holders. If you
are interested in using Gromacs, visit http://www.gromacs.org where
you can download a free version of Gromacs under
the terms of the GNU General Public License (GPL) as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.

[10:05:47] Protein: p2416_Ribosome_Na
[10:05:47]
[10:05:47] Writing local files
Fatal error: realloc for nlist->jjnr (1041039360 bytes, file ns.c, line 388, nlist->jjnr=0x0x74c48008): Cannot allocate memory
[10:07:01] Gromacs error.
[10:07:01]
[10:07:01] Folding@home Core Shutdown: UNKNOWN_ERROR
[10:07:02] CoreStatus = 79 (121)
[10:07:02] Client-core communications error: ERROR 0x79
[10:07:02] Deleting current work unit & continuing...
[10:07:19] - Preparing to get new work unit...
[10:07:19] + Attempting to get work packet
[10:07:19] - Connecting to assignment server
[10:07:20] - Successful: assigned to (171.65.103.162).
[10:07:20] + News From Folding@Home: Welcome to Folding@Home

My box has 2 GB of memory and it managed to allocate a 80% before erroring out.


With fah6:
[10:11:36] Project: 2416 (Run 60, Clone 62, Gen 7)
[10:11:36]
[10:11:36] Assembly optimizations on if available.
[10:11:36] Entering M.D.

Gromacs is Copyright (c) 1991-2003, University of Groningen, The Netherlands
This inclusion of Gromacs code in the Folding@Home Core is under
a special license (see http://folding.stanford.edu/gromacs.html)
specially granted to Stanford by the copyright holders. If you
are interested in using Gromacs, visit http://www.gromacs.org where
you can download a free version of Gromacs under
the terms of the GNU General Public License (GPL) as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.

[10:11:43] Protein: p2416_Ribosome_Na
[10:11:43]
[10:11:43] Writing local files
Fatal error: realloc for nlist->jjnr (1055457280 bytes, file ns.c, line 388, nlist->jjnr=0x0x73e61008): Cannot allocate memory
[10:13:13] Gromacs error.
[10:13:13]
[10:13:13] Folding@home Core Shutdown: UNKNOWN_ERROR
[10:13:13] CoreStatus = 79 (121)
[10:13:13] Client-core communications error: ERROR 0x79
[10:13:13] Deleting current work unit & continuing...
[10:13:24] - Preparing to get new work unit...
[10:13:24] + Attempting to get work packet
[10:13:24] - Connecting to assignment server

This time it managed to allocate 85% of memory before death.
Ivoshiee
Site Moderator
 
Posts: 822
Joined: Sun Dec 02, 2007 1:05 am
Location: Estonia

Thank you!

Postby ppetrone » Thu Dec 13, 2007 7:18 pm

Ok. I am convinced :?
I will remove it for now. In the meantime, I am still running it...

Thank you everybody!

pau
ppetrone
Pande Group Member
 
Posts: 115
Joined: Wed Dec 12, 2007 7:20 pm
Location: Stanford

Postby gwildperson » Thu Dec 13, 2007 9:15 pm

Thank you, Paula.

We do hope that "development" can isolate this problem and deliver some new code that deals with this issue promptly.
gwildperson
 
Posts: 450
Joined: Tue Dec 04, 2007 9:36 pm

Re: Thank you!

Postby Ivoshiee » Thu Dec 13, 2007 9:54 pm

ppetrone wrote:Ok. I am convinced :?
I will remove it for now. In the meantime, I am still running it...

Thank you everybody!

pau

I hope that the action is not only pulling of the WU, but a research why it is doing what it is doing and implementing a fix for that issue into the FAH Core files.
Ivoshiee
Site Moderator
 
Posts: 822
Joined: Sun Dec 02, 2007 1:05 am
Location: Estonia

Re: Thank you!

Postby ppetrone » Thu Dec 13, 2007 11:06 pm

Yes, exactly. That is the only reason why I am running it.
The research will be to find out whether this is a specific WU problem (most likely) or a more general problem.
Thanks,

Pau
ppetrone
Pande Group Member
 
Posts: 115
Joined: Wed Dec 12, 2007 7:20 pm
Location: Stanford

Postby codysluder » Thu Dec 13, 2007 11:55 pm

ppetrone wrote:The research will be to find out whether this is a specific WU problem (most likely) or a more general problem.


Even if you find that that specific WU has a problem, it also has a more general problem. The client deleted the WU rather than reporting the problem to the server. You (i.e.-Stanford) should have some indication that this WU has failed 100 times (or some much smaller number) so you can decide if it needs to be removed from circulation without having us keep the statistics for you.
codysluder
 
Posts: 1024
Joined: Sun Dec 02, 2007 1:43 pm

Postby ppetrone » Fri Dec 14, 2007 12:56 am

As I said before I am currently working in this specific WU, to understand if the issue is generalized or not.

I am sorry if it seems as if "you" are keeping statistics for "us".

I understand Folding@Home as a big group of people (donors+scientists) doing statistics *together* to solve relevant biological questions. For that reason, I believe there is a tacit agreement of collaboration and patience.

Paula
ppetrone
Pande Group Member
 
Posts: 115
Joined: Wed Dec 12, 2007 7:20 pm
Location: Stanford

Postby codysluder » Fri Dec 14, 2007 2:31 am

ppetrone wrote:I am sorry if it seems as if "you" are keeping statistics for "us".


Sorry, I didn't mean that there was a big distinction between "you" and "us" but I can see how it sounded like that.

In fact, there are three types of things that are collaboratively working on FAH. Some things are best done by the Pande-group-type people. Some things are best done by the donor-type people. Some things are best done by software.

If several of the donor-type-people all encounter the same error, it's statistically unlikely that they'll find each other. Nevertheless, in this instance they did. Because of that, the donor-type-people were able to generate a request to find out what's going on with this case. (And a big thank you for accepting this responsibility)

Figuring out why the WU failed is best done by you, and that issue is (probably) important in more WUs than just that one, but even if it's unique, it's important.

If the FAH client is able to report this condition to the server, it's a statistical certainty that those various error reports can find each other and be examined by the Pande-group-type-people like yourself. I decided to call that a universal bug, though maybe it can be considered as an enhancement request. In any case, the reports to the server are a universal problem that is best done by improved software (no matter what is actually wrong with the WU).

As a donor-type-person, I'm also saying that it seems like we're wasting valuable resources (much more than "usually") repeating the same WUs with the the same errors many, many times and there ought to be a better way to find them and transfer them from our queue to your queue with less wasted resources.
codysluder
 
Posts: 1024
Joined: Sun Dec 02, 2007 1:43 pm

Next

Return to Issues with a specific WU

Who is online

Users browsing this forum: mgetz and 3 guests

cron