Solutions to crashes on remote systems?

A forum for discussing FAH-related hardware choices and info on actual products (not speculation).

Moderator: Site Moderators

Forum rules
Please read the forum rules before posting.

Solutions to crashes on remote systems?

Postby hiigaran » Wed Jan 25, 2017 2:34 am

I'm away from home for up to 8 days at a time, during which, my folding rigs turn my bedroom into a smelting furnace. This is generally not an issue. However, I have two problems that occasionally crop up that are rather annoying.

Problem 1:
While stable, we all get the occasional FAHCore crash. I've noticed that if I've left one of the computers on for a prolonged period of time without checking if anything has crashed, some or all of the GPU slots remain stuck at 0%, saying ready. This is fixed with a restart of the computer.

Problem 2:
This problem has only happened once, just now, but in case this happens again in the future, it would be useful to know if there are solutions. Currently, in my TeamViewer list, it shows one of my three folding rigs offline. Can't be an internet issue, otherwise the other systems would be shown offline as well. Since TV starts up automatically, I know the system did not restart either. The only two possible things I can think of are either that the system has frozen and/or got a BSOD, or something died. Since the former is more likely than the latter (and I certainly hope that is the case!), it will remain stuck in that state until I return and restart the computer myself.

So, regarding the first problem, is there any way to mitigate the issue? I know I could remote in more frequently to ensure that any loss of computing time is reduced as much as possible, but is there a better solution than that? I've seen some people talk about scripting automatic restarts every 24 hours, and I have a feeling that this might be what I'll have to settle for, but again, is there anything better? Can the root problem be solved, with the core crashes? Is there any reason why the client does not recover after a crash, and have the GPU slots stuck at 0%?

As for problem 2, I need to double check if auto restarts are disabled for BSODs, which might solve the issue (further investigations will need to be done when I get home obviously), but if it is a case of a system being frozen, but not in a BSOD state, can anything be done remotely, or would I be completely out of luck there?
User avatar
hiigaran
 
Posts: 104
Joined: Thu Nov 17, 2011 6:01 pm

Re: Solutions to crashes on remote systems?

Postby bruce » Wed Jan 25, 2017 4:53 am

In my personal experience, problems detected by a FAHCore DO NOT CRASH the OS. Occasionally something will happen to a single WU, but FAH dumps it an resumes work. It's designed to run unattended. Unfortunately, that only works if the Operating System, itself, is reliable.

BSODs, hung systems, and system crashes tend to be a result of too much heat, too much overclocking, marginal hardware (especially power), or bad drivers. To mitigate problems like you're describing, I would reset everything to a non-overclocked system and be sure the temperatures are low.

There should be entries in the Windows event viewer (or the equivalent on Linux or OSX) indicating the nature of each BSOD.

Obviously if you and the system are remote from each other, there's not much you can do until you go home.
bruce
 
Posts: 21272
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Solutions to crashes on remote systems?

Postby hiigaran » Wed Jan 25, 2017 9:00 am

At the moment, I am running all video cards at 85% power though my overclocking software, to maintain temperatures of 75 degrees. Both systems, which have 4 1080s each, run on 1500 watt units. Neither heat not power should be an issue.

Still, it would be all well and good to just restart the computer when something like this happens, as I can be made aware of a BSOD or frozen computer much more easily (Teamviewer will show the computer in question as offline). However, the biggest issue is trying to figure out why F@H does not recover after a core crash, despite the computer remaining operational. I haven't seen any GPU driver crashes, mind you. Just Windows telling me that the process has crashed.
User avatar
hiigaran
 
Posts: 104
Joined: Thu Nov 17, 2011 6:01 pm

Re: Solutions to crashes on remote systems?

Postby SteveWillis » Wed Jan 25, 2017 11:44 am

If it turns out to be a BSOD this might work: https://3gstore.com/products/107_view_a ... wAodv6EI2w
Image
My thanks to my very indulgent wife
http://folding.extremeoverclocking.com/user_summary.php?s=&u=712804

3 AMD Linux rigs 3, 4, and 5 GPUs 7 X GTX 1080, 5 X GTX 1080 TI
SteveWillis
 
Posts: 212
Joined: Fri Apr 15, 2016 12:42 am

Re: Solutions to crashes on remote systems?

Postby hiigaran » Fri Feb 24, 2017 3:09 am

Oh, right, I forgot to update this. The majority of the issues were caused by what I believe to be a faulty USB wireless device that I have been using on one of my rigs, as I had no ethernet cables to spare. After wiring it up directly to the router, almost all of the problems went away. I still have to restart the rigs, but not as frequently.

I think Windows also automatically updated the nVidia drivers while I was away. I had purposefully used an older version because at the time there were issues with newer drivers, but I'm guessing some newer versions have been released since then which finally allowed folding again. I'm still going to have to keep an eye on things though. I've seen a single instance of slots showing 'failed', but it might have just been a one off.

Currently, the most annoying things are the process crashes. When a fahcore crashes and Windows displays the message, it seems like the offending slot will not commence work on a new work unit until the crash message has been dismissed, which means that while I'm away, I would still need to frequently remote in to my systems and dismiss the messages to ensure that all slots are working. Is there a way around this particular issue?
User avatar
hiigaran
 
Posts: 104
Joined: Thu Nov 17, 2011 6:01 pm

Re: Solutions to crashes on remote systems?

Postby bruce » Fri Feb 24, 2017 4:39 am

hiigaran wrote:At the moment, I am running all video cards at 85% power though my overclocking software, to maintain temperatures of 75 degrees. Both systems, which have 4 1080s each, run on 1500 watt units. Neither heat not power should be an issue.

Still, it would be all well and good to just restart the computer when something like this happens, as I can be made aware of a BSOD or frozen computer much more easily (Teamviewer will show the computer in question as offline). However, the biggest issue is trying to figure out why F@H does not recover after a core crash, despite the computer remaining operational. I haven't seen any GPU driver crashes, mind you. Just Windows telling me that the process has crashed.


The best solution is to provide hardware/software that does not cause Windows to crash. Simply assuming that there will always be BSODs or frozen computers is a bad plan.

What does the event viewer say happened? What can you do so that event doesn't reoccur?
bruce
 
Posts: 21272
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Solutions to crashes on remote systems?

Postby PS3EdOlkkola » Fri Feb 24, 2017 1:37 pm

Have you tried pausing the slot that has crashed, waiting about two minutes, then unpausing the slot? This problem happens to my rigs about once every ~1,500 (i.e. every three days), and on random systems and GPUs. The pause, wait 2 minutes, unpause process works for me every time. If it doesn't, there is another usually more serious issue, like failing hardware,
User avatar
PS3EdOlkkola
 
Posts: 184
Joined: Tue Aug 26, 2014 9:48 pm
Location: Dallas, TX


Return to FAH Hardware

Who is online

Users browsing this forum: No registered users and 2 guests

cron