Error handling in FAH

Moderators: Site Moderators, FAHC Science Team

bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Error handling in FAH

Post by bruce »

It depends on whay you mean by "issues"

Are your CPUs idle or are they busy with other things? If FAH can make use of additional threads to process the WU faster, it will try to make use of those those extra threads which will potentially increase the use of virtual memory. Do you want it to do that or do you want to restrict it?

It's up to you to decide if you want to adjust the BIOS to restrict your I7 to 4 threads or 8 threads. The OS will schedule work within the constrains you set. In either case, you don't want to run out of virtual memory.

Some of the 134xx projects were being slowed up by being CPU restrictions and Core_22 was recently changed to make use of (at least) one more CPU thread per GPU.
Sparkly
Posts: 73
Joined: Sun Apr 19, 2020 11:01 am

Re: Error handling in FAH

Post by Sparkly »

bruce wrote:It depends on whay you mean by "issues"
What I mean by issues is that the software isn’t handling the multithread distribution over several different cores very gracefully, so when releasing heap addresses on 1 core, the software thread running on another core, since there are several software threads running at the same time for each FAHCore process, but that might be running at a different speed due to the HTT, misses the mark and fails during the end process, since the heap address and memory location is no longer valid.

This isn’t a virtual memory issue, this is a heap issue created by the software, something that can be verified by setting affinity for each running process of FAHCore manually, to not be in conflict with each other.

Doing a little research I find that this load balancing issue over multiple cores is at least a decade old issue, since someone actually made a program to handle it in 2009 WITH examples directly showing FAH use.

http://affinitychanger.sourceforge.net/

Personally I have installed an automatic affinity manager on the 4c8t system that handles this issue by distributing FAHCore processes on separate individual logical cores and are now running 8 GPUs with no more 100% fail issues, while the 4c4t system is also running 8 GPUs, but require no affinity management.

This is a FAH software issue, so the programming just needs to handle the end process better than it does now, since that part of the software clearly has issues with multithreading on CPUs with HTT.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Error handling in FAH

Post by bruce »

The example at the top of page 1 is talking about FAHCore_22* (OpenMM). Let's be sure you're not talking about setting affinity for CPU threads on FAHCore_a* (GROMACS). We need to be talking about a single issue.

Development of FAHCore_22 is still being debuged and projects 134xx have been intentionally isolated to dig deeper into the crashes found in new projects with FAHCore_22. I don't know if the success/failure reports differentiate between changes in multithreading but it's certainly an issue that needs to be looked at.

Have any of your observations been in Linux or are you talking exclusively about Windows? Debugging the Windows heap can be challenging.
Sparkly
Posts: 73
Joined: Sun Apr 19, 2020 11:01 am

Re: Error handling in FAH

Post by Sparkly »

bruce wrote:We need to be talking about a single issue.
I am only talking about FAHCore_22 issues in Windows, so I don’t know if it is the same with Linux.

The 100% issue is a rather obvious heap thing, since doing the following affinity changes removes the issue completely:

Image
mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Re: Error handling in FAH

Post by mwroggenbuck »

I thought that affinity would also fix the problem viewtopic.php?f=81&t=35482&start=30 However, I still had issues. Personally, I am waiting for a new core.
Post Reply