General problem with Stanford Servers?

Mr.Nosmo · Post by **Mr.Nosmo** » Thu Aug 05, 2010 7:09 am

Lately I have had issues with getting work and uploading work and there are quite a few posts in the forum about this from other folders. Maybe it's because of "Internet Security" software, ISP's or other factors?

Does Stanford have a general issue with the servers or is it me that ask too much? I'm not uset to down-time, because I used to work for IBM zSeries and downtime is something "we" can't accept....

I'm not complaining and I'll continue to fold as long as the project runs, I can afford the electricity-bill, or until I die, but I'll be happy to have a bit info about the servers and maybe start a brain-storming on how the up-time can be improved...

Please come with some positive input!

John_Weatherman · Post by **John_Weatherman** » Thu Aug 05, 2010 11:26 am

There have been some server issues but nothing major. If you're having problems then post your log file and it can be checked out. A little downtime is something folders must get used to, but luckily it does n't normally mess up the science. And that's what we're all here for.

Post by **bruce** » Thu Aug 05, 2010 6:14 pm

IBM is in the business of providing high-availability solutions. If a merchant can't get a credit card authorization in a few seconds, it's a serious failure. FAH is not like that.

FAH takes a distributed approach where (almost) any component of the system can fail occasionally and it may take some time to repair. As long as the overall functionality isn't seriously impacted, some downtime is acceptable. i.e.- a component can be down as long as the overall system is still functional.

As you collect gripes from others and attempt to draw overall conclusions about FAH (concluding that "we" can't accept downtime), consider the following:
> If I can't get work from my favorite server yet I can get work from some other server, should that be called a failure?
(>>> If the other server gives me an WU with a lower PPD, is that a failure?)
> If I can't upload right now but the result does find its way home in a few hours, should that be called a failure?
> If somebody discards a WU (intentionally or unintentionally) should that be called a FAH a failure?
> If I can't get work for XX minutes, should that be called a failure?
(>>> ... and how should XX compare to the "few seconds" in my credit card approval example.)
> If some points are temporarily "lost" but a re-credit is processed manually some time later, should that be called a failure?

7im · Post by **7im** » Thu Aug 05, 2010 6:31 pm

Yes, Stanford does have problems. They are underfunded, understaffed, and overworked students and faculty. They are not technology professionals (well, most aren't). As Bruce indicated, the project usually has many servers to provide work and collect work so that individual server uptime is not a critical issue. And even when that doesn't work, the fah client is designed to cache the completed work unit, and download new work from a different server, and keep processing until a solution is found. Even then, a fah data packet is not the same priority as a banking transaction data packet. In real life, the bank is the priority. (to me, the fah data packet is more important, because I'm guaranteed to die someday, but never guaranteed to have a lot of money

)

Brainstorm...

1. Huge donation of IBM servers, IBM service, or cash, or all three.
2. Patience while stanford continues the process of installing and converting over to, and optimizing the new servers that $100,000 purchased for them last year.
3. Patience while the client and server code is rewritten from ground up to be more up to date, reliable, and easier to maintain. (2nd half of #1 could help speed this part along) V7 client may help, but new server code is much bigger.
4. Patience while Pande Group gets it's new crop of researchers up to speed, and expands the research and the location of FAH servers to several other universities. (co-location can be a great uptime helper, as long as they are well managed and well integrated)
5. And if no patience is available, then tolerance is an acceptable alternative.

Any additions to what's already taking place?

P.S. The head of the project addressed a similar issue like this... and I hope he doesn't mind if I repost it...

VijayPande wrote:One has to put this all in perspective. Supercomputer centers have 10x to 100x the budget we have for operations and still are often down over the weekends for much longer than FAH is when something unexpected comes up. We have some very dedicated people in our team -- people willing to do fixes on weekends and holidays -- but they do have to sleep. Also, running a FAH server is not like running apache (it is a lot more complex and people aren't familiar with it), so hiring a 3rd party firm to manage off hours wouldn't work (or would be very, very expensive).

So, if you see a problem that isn't being fixed and it's in between 10:30pm and 7:30am pacific time, odds are it will have to wait until 8:30am pacific time or so for someone to deal with it. We've built a lot of redundancy into FAH operations, but there are limits to this too, especially in very early beta projects like GPU3. Hopefully with this in mind, people can have a better sense of when fixes can be made, and how hard we work to fix them as quickly as possible.

Bobby-Uschi · Post by **Bobby-Uschi** » Sun Aug 08, 2010 5:31 pm

http://fah-web.stanford.edu/serverstat.html -????
Will not work for GPU2.No connection with any GPU server

Post by **toTOW** » Sun Aug 08, 2010 5:45 pm

Some servers are pretty slow to answer, but I got work on my two NV GPUs after a few attempts ...

weedacres · Post by **weedacres** » Sun Aug 08, 2010 5:51 pm

Lots of hand tending this morning, having to manually restart gpu clients hung while trying to download from 171.67.108.20 and 21.

Bobby-Uschi · Post by **Bobby-Uschi** » Sun Aug 08, 2010 5:55 pm

Thanks toTOW,
2 machines working again, but one is without work.
The servers are so slow .......
Thank you

Folding Forum

General problem with Stanford Servers?

General problem with Stanford Servers?

Re: General problem with Stanford Servers?

Re: General problem with Stanford Servers?

Re: General problem with Stanford Servers?

Re: General problem with Stanford Servers?

Re: General problem with Stanford Servers?

Re: General problem with Stanford Servers?

Re: General problem with Stanford Servers?