Solutions to crashes on remote systems?

A forum for discussing FAH-related hardware choices and info on actual products (not speculation).

Moderator: Site Moderators

Forum rules
Please read the forum rules before posting.
Post Reply
hiigaran
Posts: 134
Joined: Thu Nov 17, 2011 6:01 pm

Solutions to crashes on remote systems?

Post by hiigaran »

I'm away from home for up to 8 days at a time, during which, my folding rigs turn my bedroom into a smelting furnace. This is generally not an issue. However, I have two problems that occasionally crop up that are rather annoying.

Problem 1:
While stable, we all get the occasional FAHCore crash. I've noticed that if I've left one of the computers on for a prolonged period of time without checking if anything has crashed, some or all of the GPU slots remain stuck at 0%, saying ready. This is fixed with a restart of the computer.

Problem 2:
This problem has only happened once, just now, but in case this happens again in the future, it would be useful to know if there are solutions. Currently, in my TeamViewer list, it shows one of my three folding rigs offline. Can't be an internet issue, otherwise the other systems would be shown offline as well. Since TV starts up automatically, I know the system did not restart either. The only two possible things I can think of are either that the system has frozen and/or got a BSOD, or something died. Since the former is more likely than the latter (and I certainly hope that is the case!), it will remain stuck in that state until I return and restart the computer myself.

So, regarding the first problem, is there any way to mitigate the issue? I know I could remote in more frequently to ensure that any loss of computing time is reduced as much as possible, but is there a better solution than that? I've seen some people talk about scripting automatic restarts every 24 hours, and I have a feeling that this might be what I'll have to settle for, but again, is there anything better? Can the root problem be solved, with the core crashes? Is there any reason why the client does not recover after a crash, and have the GPU slots stuck at 0%?

As for problem 2, I need to double check if auto restarts are disabled for BSODs, which might solve the issue (further investigations will need to be done when I get home obviously), but if it is a case of a system being frozen, but not in a BSOD state, can anything be done remotely, or would I be completely out of luck there?
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Solutions to crashes on remote systems?

Post by bruce »

In my personal experience, problems detected by a FAHCore DO NOT CRASH the OS. Occasionally something will happen to a single WU, but FAH dumps it an resumes work. It's designed to run unattended. Unfortunately, that only works if the Operating System, itself, is reliable.

BSODs, hung systems, and system crashes tend to be a result of too much heat, too much overclocking, marginal hardware (especially power), or bad drivers. To mitigate problems like you're describing, I would reset everything to a non-overclocked system and be sure the temperatures are low.

There should be entries in the Windows event viewer (or the equivalent on Linux or OSX) indicating the nature of each BSOD.

Obviously if you and the system are remote from each other, there's not much you can do until you go home.
hiigaran
Posts: 134
Joined: Thu Nov 17, 2011 6:01 pm

Re: Solutions to crashes on remote systems?

Post by hiigaran »

At the moment, I am running all video cards at 85% power though my overclocking software, to maintain temperatures of 75 degrees. Both systems, which have 4 1080s each, run on 1500 watt units. Neither heat not power should be an issue.

Still, it would be all well and good to just restart the computer when something like this happens, as I can be made aware of a BSOD or frozen computer much more easily (Teamviewer will show the computer in question as offline). However, the biggest issue is trying to figure out why F@H does not recover after a core crash, despite the computer remaining operational. I haven't seen any GPU driver crashes, mind you. Just Windows telling me that the process has crashed.
SteveWillis
Posts: 409
Joined: Fri Apr 15, 2016 12:42 am
Hardware configuration: PC 1:
Linux Mint 17.3
three gtx 1080 GPUs One on a powered header
Motherboard = [MB-AM3-AS-SB-990FXR2] qty 1 Asus Sabertooth 990FX(+59.99)
CPU = [CPU-AM3-FX-8320BR] qty 1 AMD FX 8320 Eight Core 3.5GHz(+41.99)

PC2:
Linux Mint 18
Open air case
Motherboard: ASUS Crosshair V Formula-Z AM3+ AMD 990FX SATA 6Gb/s USB 3.0 ATX AMD
AMD FD6300WMHKBOX FX-6300 6-Core Processor Black Edition with Cooler Master Hyper 212 EVO - CPU Cooler with 120mm PWM Fan
three gtx 1080,
one gtx 1080 TI on a powered header

Re: Solutions to crashes on remote systems?

Post by SteveWillis »

If it turns out to be a BSOD this might work: https://3gstore.com/products/107_view_a ... wAodv6EI2w
Image

1080 and 1080TI GPUs on Linux Mint
hiigaran
Posts: 134
Joined: Thu Nov 17, 2011 6:01 pm

Re: Solutions to crashes on remote systems?

Post by hiigaran »

Oh, right, I forgot to update this. The majority of the issues were caused by what I believe to be a faulty USB wireless device that I have been using on one of my rigs, as I had no ethernet cables to spare. After wiring it up directly to the router, almost all of the problems went away. I still have to restart the rigs, but not as frequently.

I think Windows also automatically updated the nVidia drivers while I was away. I had purposefully used an older version because at the time there were issues with newer drivers, but I'm guessing some newer versions have been released since then which finally allowed folding again. I'm still going to have to keep an eye on things though. I've seen a single instance of slots showing 'failed', but it might have just been a one off.

Currently, the most annoying things are the process crashes. When a fahcore crashes and Windows displays the message, it seems like the offending slot will not commence work on a new work unit until the crash message has been dismissed, which means that while I'm away, I would still need to frequently remote in to my systems and dismiss the messages to ensure that all slots are working. Is there a way around this particular issue?
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Solutions to crashes on remote systems?

Post by bruce »

hiigaran wrote:At the moment, I am running all video cards at 85% power though my overclocking software, to maintain temperatures of 75 degrees. Both systems, which have 4 1080s each, run on 1500 watt units. Neither heat not power should be an issue.

Still, it would be all well and good to just restart the computer when something like this happens, as I can be made aware of a BSOD or frozen computer much more easily (Teamviewer will show the computer in question as offline). However, the biggest issue is trying to figure out why F@H does not recover after a core crash, despite the computer remaining operational. I haven't seen any GPU driver crashes, mind you. Just Windows telling me that the process has crashed.
The best solution is to provide hardware/software that does not cause Windows to crash. Simply assuming that there will always be BSODs or frozen computers is a bad plan.

What does the event viewer say happened? What can you do so that event doesn't reoccur?
PS3EdOlkkola
Posts: 184
Joined: Tue Aug 26, 2014 9:48 pm
Hardware configuration: 10 SMP folding slots on Intel Phi "Knights Landing" system, configured as 24 CPUs/slot
9 AMD GPU folding slots
31 Nvidia GPU folding slots
50 total folding slots
Average PPD/slot = 459,500
Location: Dallas, TX

Re: Solutions to crashes on remote systems?

Post by PS3EdOlkkola »

Have you tried pausing the slot that has crashed, waiting about two minutes, then unpausing the slot? This problem happens to my rigs about once every ~1,500 (i.e. every three days), and on random systems and GPUs. The pause, wait 2 minutes, unpause process works for me every time. If it doesn't, there is another usually more serious issue, like failing hardware,
Image
Hardware config viewtopic.php?f=66&t=17997&p=277235#p277235
Post Reply