How does FAH protect against memory errors?

Moderators: Site Moderators, FAHC Science Team

Post Reply
NoMoreQuarantine
Posts: 182
Joined: Tue Apr 07, 2020 2:38 pm

How does FAH protect against memory errors?

Post by NoMoreQuarantine »

Most servers & supercomputers use ECC to prevent memory errors, but FAH relies on mostly consumer grade hardware that generally does not use ECC. Is anything done to detect and protect against memory errors? Are memory errors not a big deal for this kind of system?
Joe_H
Site Admin
Posts: 7856
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: How does FAH protect against memory errors?

Post by Joe_H »

In the case of processing on GPUs, periodic sanity checks are done on the WU data with the calculations being done by the CPU. If that sanity check fails, the WU starts over at the previous checkpoint. Too many errors causing restarts and the WU is failed out and a report set back to get the WU assigned to another system.

For CPU processing these kind of errors will usually result in a fault condition such as a NaN error. Again there will be a restart from a prior checkpoint, etc.

All WUs get basic sanity and other checks when received by the servers. The next Gen WU is created from that return and sent out.

Ultimately the Markov State Model methods being used are statistical in nature, so individual trajectories are only part of the statistics being analyzed. With the calculations being spread over a range of systems, an error that escaped other checks should not be enough to change the final results.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
NoMoreQuarantine
Posts: 182
Joined: Tue Apr 07, 2020 2:38 pm

Re: How does FAH protect against memory errors?

Post by NoMoreQuarantine »

Thanks Joe_H! That is super cool. I wish I could see how this is implemented in detail. While I'm a complete novice, probability theory is a big area of interest for me; particularly when applied to computers.
MeeLee
Posts: 1375
Joined: Tue Feb 19, 2019 10:16 pm

Re: How does FAH protect against memory errors?

Post by MeeLee »

I've ran GPU WUs for a few years, and I occasionally (perhaps 6 times a year est.) on errors that I can't classify.
Most of the errors happen due to incorrectly set voltages or overclock settings.
But those out of the equation, I think modern memory has come a long way.
Since the WUs are only inside the memory between 1 to 24 hours on most GPUs and CPUs, chances on errors also are lower.
If WUs were to process for days, ECC memory might be necessary.
Post Reply