Project 2682 malloc error

The most demanding Projects are only available to a small percentage of very high-end servers.

Moderators: Site Moderators, PandeGroup

Re: Project 2682 malloc error

Postby brentpresley » Tue Aug 10, 2010 2:57 am

How about those of us w/ > 16 core machines that cannot utilize that power on the 2682 WUs?

This is a programming bug, plain and simple. The > 4GB RAM allocation problem does not exist, to the best of my knowledge, on Linux, since that codebase can utilize a 64-bit memory address.

tear has done some good work on isolating the bug, and providing a reasonable fix (recompile w/ the proper headers to enable 64-bit memory allocation).


Can I get a vote for version 6.31 to test his fix out? ;)
brentpresley
 
Posts: 233
Joined: Sun Jun 15, 2008 12:05 am
Location: Dallas, TX

Re: Project 2682 malloc error

Postby tear » Tue Aug 10, 2010 3:23 am

7im wrote:2 ways to look at it... Fine is fine as long as the total memory used doesn't exceed that .5 GB per fahcore recommendation, which for most isn't a problem. But I still see people with 4 GB system memory on a 64 bit system, or people on Windows 32 bit systems trying to run 8 or more fahcores. That isn't fine.

2nd way is as you mentioned. The internal fahcore code is "blowing up" as the decomp goes about 8 cores. Which is very possible, and very likely in this case. Some proteins just don't parallelize well at increasingly higher divisors.

What happens if you divide up the work unit parts to a point where some quadrants don't have any protein data?

I don't think you really understand what the issue is. Please take the _time_ to re-read the thread.

tear
One man's ceiling is another man's floor.
Image
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Project 2682 malloc error

Postby tear » Tue Aug 10, 2010 3:46 am

brentpresley wrote:How about those of us w/ > 16 core machines that cannot utilize that power on the 2682 WUs?

This is a programming bug, plain and simple.

Yes and no. It all depends on how you define operational constraints (maximal number of atoms, maximal number of threads, etc.).
A3 core (2.22) works as expected for variety of workloads (all smaller projects) and the OS/architecture constraints
aren't in the way. They are though with >1m-atom* systems with -smp 18 and more*. Should someone have thought about it?
Probably yes.

*) roughly


Now, what else _may_ be going on here is: there might be a real bug that causes excessive (abnormal, greater than expected-from-the-algorithm)
memory use. Code should be given at least brief review there.

The > 4GB RAM allocation problem does not exist, to the best of my knowledge, on Linux, since that codebase can utilize a 64-bit memory address.

Windows FahCore A3 is a 32-bit binary (VM won't grow past 4GB). Linux SMP FahCores have been 64-bit binaries for a long, long time (if not since day one).

tear has done some good work on isolating the bug,

Yes...
and providing a reasonable fix (recompile w/ the proper headers to enable 64-bit memory allocation).

...and no. LAA flag can raise the limit to 4GB max (on 64-bit Windows; 32-bit Windows need extra fiddling but _can_ reach 3GB).
Besides, as it turned out, LAA flag is already there (I overlooked it at first, my bad).

Possible (not exclusive) approaches are:
1) Respinning the FahCore for 64-bit Windows; that's significant workload: validation is one thing, another one is management (you get to have two FahCores, one for 32-bit Windows, the other one for 64-bit Windows)
2) Examining simulation code to (a) find memory leaks and (b) try to reduce run-time memory usage (quite a challenge for devs but less validation time and no management horror)

Can I get a vote for version 6.31 to test his fix out? ;)

That would be a FahCore fix rather than client fix. But yeah, I'm sure there's a number of people who'd like this issue taken care of one way or another.


tear
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Project 2682 malloc error

Postby vladh4x0r » Tue Aug 10, 2010 5:05 am

Since we are already bumping into VM (process address space) limits with a 32-bit FahCore, I doubt that it would get any better in the future, as BigAdv projects push the frontiers of what's possible. Sooner or later the FahCore would have to go 64-bit, I don't think there is much disagreement there. So if this will have to be done "some day" anyway, why not do it sooner rather than later? I'd be willing to test a 64-bit FahCore_a3.exe on my lowly i7, and I'm probably not alone in that.

EDIT: There exist definite HPC advantages for a 64-bit binary on Intel(R) 64 / AMD64 architecture -- namely, double the number of (directly accessible) SSE registers. It would be interesting to see whether this outweighs the increased cache pressures of 64-bit pointer and related data structures. AFAIK it's not currently possible to do an "apples to apples" comparison for an SMP FahCore -- the Windows and OS X versions are 32-bit, while the Linux one is 64-bit. Is there a platform where an SMP FahCore is currently available in 32-bit and 64-bit version for the same OS?

Curiously enough, the (hypothetical 32-bit Windows) A2 core could have pushed this limit out a few years, since it spawns off n processes (1 per core) with roughly 1/n address space use per each. Yeah I know, coulda woulda shoulda :)
Image
vladh4x0r
 
Posts: 30
Joined: Tue Jul 28, 2009 5:04 am
Location: Folsom, CA, USA

Re: Project 2682 malloc error

Postby kasson » Tue Aug 10, 2010 6:26 am

We've figured out why 2682 is worse in this respect than some of the newer projects. Essentially the newer structures use a more compact representation for some of the data that was added to Gromacs after 2682 was developed. Some of these data currently exist in one copy for each thread (we'll chat with the developer responsible for that to see if it can be changed, but he has a long queue right now). I'm not sure whether it's feasible to switch over, but at least we know that.
User avatar
kasson
Pande Group Member
 
Posts: 1906
Joined: Thu Nov 29, 2007 9:37 pm

Re: Project 2682 malloc error

Postby GeneralRavel » Tue Aug 10, 2010 8:07 am

If the 2682s can be restructured like some of the newer projects, that might alleviate the "memory bloating" we are seeing on the machines running > 17 threads. If that is not feasible in the short term, perhaps multiple 16 thread clients could be run on the largest machines to keep them loaded up for PPD sake. I realize that is not an ideal solution, but it may be the pragmatic approach for donors in the short term.
But looking at the progress of the SMP projects and their growth, I believe vladh4x0r is absolutely right, the cores are going to need to be able to use a 64 bit memory space. Many projects stay in the 2gig range for their entire runtime. That's only a factor of two less than the wall the core will hit with larger loads. In light of the recent shift over to exclusively Windows -bigadv, the WUs need to be able to "Run Large" in that enviornment, as they have on Linux for quite some time.
If 32 bit and 64 bit code cannot be simultaneously maintained for whatever reason, perhaps a fraction of upcoming projects will need to be 64 bit only, and employ detection for 64 bit support when work is assigned.
Anyone remember Marty's Quake II Playhouse? :)
Image
User avatar
GeneralRavel
 
Posts: 59
Joined: Sun May 23, 2010 10:18 am
Location: Ohio

Re: Project 2682 malloc error

Postby Parja » Tue Aug 10, 2010 1:20 pm

Just an FYI for those of you also having issues with FahCore_a3 crashing. Before you close the crash window, activate your Folding client window and kill it with a Ctrl-C. Then close the core crash window. That way you won't flush your WU.
Parja
 
Posts: 22
Joined: Sat Jun 28, 2008 1:38 am

Re: Project 2682 malloc error

Postby Punchy » Tue Aug 10, 2010 2:29 pm

The obvious workaround appears to be to remove 2682 from the assignment servers and let bigadv systems process functional bigadv work, pending a restructuring of 2682 or a core fix, and then releasing the fix for beta testing before opening up to the entire population.
Punchy
 
Posts: 218
Joined: Fri Feb 19, 2010 1:49 am

Re: Project 2682 malloc error

Postby Parja » Tue Aug 10, 2010 2:32 pm

Punchy wrote:The obvious workaround appears to be to remove 2682 from the assignment servers and let bigadv systems process functional bigadv work, pending a restructuring of 2682 or a core fix, and then releasing the fix for beta testing before opening up to the entire population.


Well, there's already a rule in place where the servers won't assign -bigadv units if the client reports less than 8 cores. So could it also be possible to limit the assignment of 2682s only to machines reporting 12 or less cores?
Parja
 
Posts: 22
Joined: Sat Jun 28, 2008 1:38 am

Re: Project 2682 malloc error

Postby kasson » Tue Aug 10, 2010 2:42 pm

Unfortunately, things are a touch more complicated in that regard. All the bigadv work units are on the same work server (and the core restriction logic is on the assign server), so we'd be limiting all bigadv WU's to 12 or fewer cores. Probably not what people most want...
User avatar
kasson
Pande Group Member
 
Posts: 1906
Joined: Thu Nov 29, 2007 9:37 pm

Re: Project 2682 malloc error

Postby Punchy » Tue Aug 10, 2010 2:47 pm

Another workaround would be to move the 2682 back to Linux A2 core. Think outside the box!
Punchy
 
Posts: 218
Joined: Fri Feb 19, 2010 1:49 am

Re: Project 2682 malloc error

Postby kasson » Tue Aug 10, 2010 2:53 pm

As a matter of policy, we've discontinued A2 projects.
We're downweighting 2682's and replacing them with a test version of 2692--please see the post on that topic. If the test looks like it fixes the issue, we'll switch over entirely.
(Please direct comments on 2692 to the thread on that project.)
Thanks for your feedback and suggestions.
User avatar
kasson
Pande Group Member
 
Posts: 1906
Joined: Thu Nov 29, 2007 9:37 pm

Re: Project 2682 malloc error

Postby 10e » Tue Aug 10, 2010 3:21 pm

Two Core I7s in my setup have completed 2682s successfully.

First one is a Core I7 920 with 6GB of RAM, Win7 X64, 2048MB max allocated to the client, running -smp 8 dedicated. It finished the following work unit:

Project: 2682 (Run 4, Clone 23, Gen 18)

The second is a Core I7 875K with 4GB of RAM, WinXP x64, 2560MB max allocated to the client, running -smp 7 and it completed one, and is currently on 27% of a second:

Project: 2682 (Run 9, Clone 22, Gen 18)

I have switched it back to -smp 8 to see if it fails, because the WU is backed up.

One strange thing on the Core I7 875K: At -smp 7 it was using 2,116 MB, but at -smp 8 it is using 2,064 MB. Why would -smp 8 use LESS RAM than -smp 7 (albeit a small amount less)?

Both machines take approximately 72 seconds to decompress, and I watched the TaskMgr on the Core I7 875K during client restart/decompression and set to -smp 8, and it never rose beyond 2064 MB.

Not sure if I'm being helped by the fact that I'm limiting the max memory allocation on each box to 2048 or 2560MB, or does it not work that way? Ie. how does this setting work? Is it a hard ceiling to RAM allocation?
Image Image
10e
 
Posts: 75
Joined: Thu Mar 05, 2009 12:36 pm
Location: Toronto, Ontario, CANADA

Re: Project 2682 malloc error

Postby tear » Tue Aug 10, 2010 9:57 pm

kasson wrote:As a matter of policy, we've discontinued A2 projects.
We're downweighting 2682's and replacing them with a test version of 2692--please see the post on that topic. If the test looks like it fixes the issue, we'll switch over entirely.
(Please direct comments on 2692 to the thread on that project.)
Thanks for your feedback and suggestions.

That's good work, Peter.

I appreciate your keeping the community in the loop.


Thanks,
Kris
tear
 
Posts: 857
Joined: Sun Dec 02, 2007 4:08 am
Location: Rocky Mountains

Re: Project 2682 malloc error

Postby GeneralRavel » Tue Aug 10, 2010 10:29 pm

@10e,
The problem doesn't seem to manifest until the thread count gets above 17 (credit to tear for that discovery) You should be fine at smp 7 and 8.

But after reading that post I had a thought for the guys running >17 threads. I wonder if the units would be able to run if the clients were reconfigured to use max 3.9gigs or so.
Might be worth a try, although I don't know what happens when the WU needs more than what is allocated. (Massive slowdown? Or outright crash?)
User avatar
GeneralRavel
 
Posts: 59
Joined: Sun May 23, 2010 10:18 am
Location: Ohio

PreviousNext

Return to SMP with bigadv

Who is online

Users browsing this forum: No registered users and 2 guests

cron