Gromacs cannot continue further.

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: slegrand, Site Moderators, PandeGroup

Gromacs cannot continue further.

Postby Dinkydau » Sat Aug 06, 2011 1:32 pm

Hello,

It happens to me sometimes that my monitor suddenly stops working, then goes back on and folding@home GPU client says:
Code: Select all
[13:15:59] Completed 74%
[13:16:05] Run: exception thrown during GuardedRun
[13:16:05] Run: exception thrown in GuardedRun -- Gromacs cannot continue further.
[13:16:05] Going to send back what have done -- stepsTotalG=10000000
[13:16:05] Work fraction=0.7403 steps=10000000.

Apparently something went wrong. That can happen sometimes. The problem is that this always happens when I start a task that uses the GPU, for example start a 3d-game or play a video. Therefore I don't think it is necessary for folding@home to stop working on the workunit and send back the unfinished work. I think it would work fine if, when something like this happens, folding@home would go back to the last saved state of the workunit and try again before giving it up. As long as I don't start other applications that use the GPU, folding@home works fine, always. Also it always goes wrong when those applications are starting up, not while they are already running. That means there is nothing wrong with the work units. They can actually be finished. Gromacs can cantinue further. I think valuable work is being wasted by returning these work units unfinished, while they could have been finished if the client didn't give it up so quickly.
Dinkydau
 
Posts: 15
Joined: Sat Jun 04, 2011 10:55 am

Re: Gromacs cannot continue further.

Postby patonb » Sat Aug 06, 2011 2:14 pm

Right, your gpu is not able to do both high level calculations AND display 3d crap. The gpu calculates wrong so it stops itself from continuing. Its a fail safe. Remember this is not just random numbers, its a serious high profile research program. 1 wrong number could be a disastor.. I'd rather the unit fail on the side of caution, cause 500pts loss to you isn't worth the risk to someones life.

This is why its recommended to pause f@h when gaming. Pause the work, then restart when youre finished playing....
WooHoo = L5639 @ 3.3Ghz Evga SR-2 6x2gb Corsair XMS3 CM 212+ Corsair 1050hx Blackhawk Ultra EVGA 560ti

Foldie = i7 950@ 4.0Ghz x58a-ud3r 216-216 @ 850/2000 3x2gb OCZ Gold NH-u12 Heatsink Corsair hx520 Antec 900
patonb
 
Posts: 1074
Joined: Thu Oct 23, 2008 2:42 am

Re: Gromacs cannot continue further.

Postby Dinkydau » Sat Aug 06, 2011 3:04 pm

Clearly you have no idea what I'm talking about. My gpu CAN run folding@home and display what you call 3d crap at the same time. Actually 249/250 times the client doesn't crash when I start a 3d-game or watch a video (don't forget that, it happens with videos too). I don't care about points at all. I am not one of those who buy a hypermodern computer with 2 of the fastest gpu's and a 12-core cpu and leave it running 24/7 just for folding@home. I'm trying to help to improve the efficiency of the project. If my idea will not work out, don't get angry, it's only an idea. You can't expect me to know exactly how this high profile research program works. :?
Dinkydau
 
Posts: 15
Joined: Sat Jun 04, 2011 10:55 am

Re: Gromacs cannot continue further.

Postby Jesse_V » Sat Aug 06, 2011 4:09 pm

Easy. Looks like sometimes the GPU client can't back off properly on your machine when something else needs the graphics card. I'm not sure how to fix this, but make sure your client and GPU drivers are up to date and everything should be fine. Pausing the client is a good idea. Sorry that things aren't working for you.
User avatar
Jesse_V
 
Posts: 2753
Joined: Mon Jul 18, 2011 4:44 am
Location: Logan, Utah, USA

Re: Gromacs cannot continue further.

Postby bruce » Sat Aug 06, 2011 8:01 pm

There's not enough information for someone to identify and fix this problem.

First, you didn't specify which WU or which FahCore was running or what game creates a potential conflict. Even if you provide that information, I really doubt that anyone can fix the problem. Collect all that data and hand it to a programmer and Murphy's law of program support says that he's going to try it less than 249 times and it will work every time he tries. He won't find anything that he can fix. Moreover, it's not clear to me whether this is something that might be fixed in the FahCore or if it's something that might be fixed in the driver or if it's something that might be fixed in the game. That means that three different programmers at three different companies will need to capture a snapshot of the problem as it occurs. :(

I understand your concern, but I don't know what can be done about it, other than pausing FAH as patonb suggested.

What happens if you (A) Pause FAH, (B) Start the game, (C) Unpause FAH, and (D) run both your game and FAH simultaneously? (Do it at least 250 times, too.)
bruce
Site Admin
 
Posts: 16851
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Gromacs cannot continue further.

Postby 7im » Sun Aug 07, 2011 6:10 am

v6 and earlier clients were always a little too quick to upload partial WUs after an error. That's just the way it was programmed a long time ago, and that carried over to the GPU3 client when it was released.

Also note that v6 client development has been ended in favor of V7 software development. This behavior won't be fixed in v6, but I'm hoping V7 does better.
Please do not mistake my brevity as dispassion or condescension. I recognize the time you spend reading the forum is time you could use elsewhere, so my short responses save you time. Please do not hesitate to ask for clarification if I was too terse.
User avatar
7im
 
Posts: 13314
Joined: Thu Nov 29, 2007 4:30 pm
Location: Arizona

Re: Gromacs cannot continue further.

Postby Dinkydau » Sun Aug 07, 2011 2:21 pm

bruce wrote:There's not enough information for someone to identify and fix this problem.

First, you didn't specify which WU or which FahCore was running or what game creates a potential conflict. Even if you provide that information, I really doubt that anyone can fix the problem. Collect all that data and hand it to a programmer and Murphy's law of program support says that he's going to try it less than 249 times and it will work every time he tries. He won't find anything that he can fix. Moreover, it's not clear to me whether this is something that might be fixed in the FahCore or if it's something that might be fixed in the driver or if it's something that might be fixed in the game. That means that three different programmers at three different companies will need to capture a snapshot of the problem as it occurs. :(

I understand your concern, but I don't know what can be done about it, other than pausing FAH as patonb suggested.

What happens if you (A) Pause FAH, (B) Start the game, (C) Unpause FAH, and (D) run both your game and FAH simultaneously? (Do it at least 250 times, too.)

It's FahCore_11.exe

Because it is rare that this happens I have only 1 line of project information in my logs left:
Project: 6606 (Run 9, Clone 894, Gen 799)

It has happened with videos on YouTube and the game Grand theft auto San Andreas.

From now on I'll pause the client every time I'm going to start something that uses the GPU and then restart it. If I find anything interesting I'll let you know.

I understand that this problem is probably incredibly difficult to fix. That's why I thought it might be a good idea to make folding@home try it again from the last saved state when it goes wrong.
Dinkydau
 
Posts: 15
Joined: Sat Jun 04, 2011 10:55 am

Re: Gromacs cannot continue further.

Postby bruce » Sun Aug 07, 2011 3:04 pm

You did get partial credit for Project: 6606 (Run 9, Clone 894, Gen 799). Because of the failure, it was reissued and two people completed it successfully. (This is how FAH is expected to behave to a processing error in a WU.)
bruce
Site Admin
 
Posts: 16851
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Gromacs cannot continue further.

Postby patonb » Mon Aug 08, 2011 1:27 am

Dinkydau wrote:Clearly you have no idea what I'm talking about. My gpu CAN run folding@home and display what you call 3d crap at the same time. Actually 249/250 times the client doesn't crash when I start a 3d-game or watch a video (don't forget that, it happens with videos too). I don't care about points at all. I am not one of those who buy a hypermodern computer with 2 of the fastest gpu's and a 12-core cpu and leave it running 24/7 just for folding@home. I'm trying to help to improve the efficiency of the project. If my idea will not work out, don't get angry, it's only an idea. You can't expect me to know exactly how this high profile research program works. :?


I've been folding gpu, since the nvidia client came out on my 8800gt... I know exactly what youre talking about. It still doesn't change the fact that switching from on gpu process to another doesn't cause a misstep in the calculations.

1 misstep in a billion calculation still invalidates the results. 1 missstep in 250 gpu switches still screws up results. Theses safeties need to be there to protect the work being done.
patonb
 
Posts: 1074
Joined: Thu Oct 23, 2008 2:42 am

Re: Gromacs cannot continue further.

Postby MtM » Mon Aug 08, 2011 8:01 am

As I understand his request, he is asking if the client shouldn't retry the last frame after each failure, as sometimes the failure will not be repeated when running the frame again, probably due to environmental/circumstantial influences which caused the initial eue to happen. The problem with that request I think is the effectiveness of that solution, if all eue's would run the last frame again, that's also duplicate work, and since the system to check eue's is now in place with reassigning the work unit to another donor replacing it with another one would need to be backed up by strong evidence of increased efficiency.

If you eue'd at a high percentage, there is something to be said for retrying the last frame, saving the time to get to that percentage on another donor's system. But, time has shown that one particular system can eue at a fixed percentage when another will complete the unit without problems, reassigning the work unit to another donor stays a necessity therefore.

I don't want to dismiss the suggestion though, if there were abundant resources to spend on improving the clients and server code I would think it's not a bad idea to retry a frame in certain conditions ( not the first frame for instance ) before returning it to Stanford with an eue report. Since the resources are limited, focus should be applied on things with the highest efficiency increases, I'm not sure where this would rank since it's not a win win situation. In some/most cases, the frame would eue even when being repeated, waisting time before the wu is assigned to another donor, slowing the whole chain of generations.
MtM
 
Posts: 3233
Joined: Fri Jun 27, 2008 2:20 pm
Location: The Netherlands

Re: Gromacs cannot continue further.

Postby Dinkydau » Mon Aug 08, 2011 2:07 pm

If there is something with my computer that prevents a certain work unit from finishing no matter what, then the retry would still cost only a few minutes of extra work, compared to hours of extra work if it is sent to 2 other computers. I now understand what the downsides of this could be.

I think it's after all more efficient for me to not pause the client. If I paused the client every time that would easily make a difference of 4 less work units finished per day. That one work unit per week (or something like that) that crashes doesn't make up for that.
Dinkydau
 
Posts: 15
Joined: Sat Jun 04, 2011 10:55 am

Re: Gromacs cannot continue further.

Postby bruce » Mon Aug 08, 2011 4:29 pm

Many studies of the reliability of CPUs have been done and results published. One study of the reliability of GPUs was completed by Stanford and the paper was published. These are very important considerations needed to back up the computational methods used by FAH.

Stable computers will produce the same result if you give it the same set of calculations to be repeated. What you're asking for is for science to accept results from unstable computers. No reputable scientist will trust results from unstable computers and as a consequence, Stanford does not condone/support overclocking. They can't prevent you from doing it but they can insist that your hardware be inherently reliable.

They have no control over why your system fails, whether the failure was caused by bad hardware, by bad drivers, by overclocking, by a fault in some other program or by some unknown factor. They do have control over the results that are acceptable to their studies. The answer is pretty simple: bad results will be discarded.
bruce
Site Admin
 
Posts: 16851
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Gromacs cannot continue further.

Postby Dinkydau » Tue Aug 09, 2011 1:07 pm

That's clear. My GPU is not overclocked, never changed anything to it since I bought it.
Dinkydau
 
Posts: 15
Joined: Sat Jun 04, 2011 10:55 am

Re: Gromacs cannot continue further.

Postby patonb » Tue Aug 09, 2011 2:24 pm

The problem isn't overclock... Its the gpu's inability to perfectly switch from one graphics task to another. So unless Pande group starts making gpu's, they have no control over the glitch.

You are wanting more efficiency, but what if youre idea happens. A gpu fails and restarts, but this btime the error isn't caught. 2 months from now, the error is noticed. By then that minor glitch has propegated through the subsequent runs making all that work invalid and 2 wasted months of work. Where if the units dropped and resent to someone else, the project is not injepordy and an average of 2-3hrs is missed.

If youre all for science, this minor slow down is well worth the effect.
patonb
 
Posts: 1074
Joined: Thu Oct 23, 2008 2:42 am

Re: Gromacs cannot continue further.

Postby bruce » Tue Aug 09, 2011 3:47 pm

patonb wrote:The problem isn't overclock... Its the gpu's inability to perfectly switch from one graphics task to another. So unless Pande group starts making gpu's, they have no control over the glitch.
Even if we assume, as you have probably done, that the problem in inherent in the GPU's hardware, it's up to the drivers to resolve conflicts so I'd guess that someday a future version of the drivers will someday resolve this sort of issue. The fundamental problem is lack of repeatability, though, because those who write drivers are unlikely to ever see this failure in their testing. In any case, be sure to report the problem on the GPU's home website.

Are you still running the Nvidia 8800 GTS that's listed in your profile? There's a lot of recent discussion about NV drivers (12 pages and counting) questioning which version works correctly with specific hardware, but there are also some ATI driver issues mentioned here on our website.

It probably wouldn't hurt to report it on the Game's technical support site, too, though that's less likely to matter unless your report is followed by a number of "+1" reports.
bruce
Site Admin
 
Posts: 16851
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Next

Return to V6 GPU3 beta (including Fermi) OpenMM

Who is online

Users browsing this forum: No registered users and 1 guest