Server has no record of this unit

Moderators: slegrand, Site Moderators, PandeGroup

Re: Server has no record of this unit

Postby deesy58 » Mon Feb 15, 2010 4:17 am

Maybe nobody thinks they should have to work on weekends. Maybe the project is not sufficiently important ...
deesy58
 
Posts: 16
Joined: Mon Feb 02, 2009 8:46 pm

Re: Server has no record of this unit

Postby EvilAlchemist » Mon Feb 15, 2010 4:58 am

deesy58 wrote:Maybe nobody thinks they should have to work on weekends. Maybe the project is not sufficiently important ...


This is not a business. It is an education institution.
Believe it of not, Standford & The Pande Group does not have the financial & personnel resources to keep someone in the labs & server rooms 24/7.
They take F@H very seriously and do everything they can to ensure the maximum amount of "uptime" on the servers.

With the sheer number of server they have to maintain, it is a challenge to say the least.
http://fah-web.stanford.edu/serverstat.html

Give them a little time and they will get it fixed, they always do.

leexgx wrote:they need to add an remote pc that they can use to press the buttons remotely on the servers that they have (


They do already have that ability, but that assumes it is a software issue.
If a piece of hardware has failed , like a raid card, then a "reset" will not solve the problem.
User avatar
EvilAlchemist
 
Posts: 329
Joined: Fri Feb 08, 2008 4:24 pm
Location: Columbia, Tennessee

Re: Server has no record of this unit

Postby deesy58 » Mon Feb 15, 2010 5:13 am

The server that is not responding to my machine seems to have a different IP address (171.67.108.21). If several servers have failed, wouldn't it be sufficiently important to come in on a weekend to fix them? I seem to recall spending more than one evening and weekend "babysitting" a balky server for no extra pay when I worked in IT. I wonder why I did that. Hmm ...
deesy58
 
Posts: 16
Joined: Mon Feb 02, 2009 8:46 pm

Re: Server has no record of this unit

Postby VijayPande » Mon Feb 15, 2010 3:39 pm

deesy58 wrote:Maybe nobody thinks they should have to work on weekends. Maybe the project is not sufficiently important ...


Yes, quite to the contrary, we have teams working on the weekends and on holidays (today is a holiday in the US). However, we don't have support over the full 24 hours. If something happens at night (eg 10pm, 11pm, or after) it will not be something we can get to until early the next morning (eg 7am). For example, that's what happened with the NVIDIA servers this time.

Howerver, at 7am sharp, we were on the job, diagnosing the problem, and fixing it. In this case, there was a v5 WS bug which broke some of the redundancy (a server was down in a strange way, so it was not reporting that it was down, so the AS kept on assigning to it).

Anyway, that's resolved now. However, please keep in mind that from ~10pm to 7am pacific time, we won't be able to immediately respond to most issues. (Stanford does have basic staff for basic network connectivity, etc, but the FAH servers are complex enough that this is beyond what the basic staff would handle).
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
User avatar
VijayPande
Pande Group Member
 
Posts: 2651
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: Server has no record of this unit

Postby bingo-dog » Mon Feb 15, 2010 4:05 pm

The issue this time around is that the problem was first reported around 7AM pacific time on Sunday, with no response until 24 hours later. I think the lack of any staff or moderator response in this thread
viewtopic.php?f=18&t=13434&start=0
is what had people quite upset.

I'm sure there are plenty of folks still wondering if any of the work that received the "server has already received unit" message actually went to advancing the projects, or just into the bit bucket.
bingo-dog
 
Posts: 47
Joined: Tue Dec 29, 2009 3:41 pm

Re: Server has no record of this unit

Postby VijayPande » Mon Feb 15, 2010 4:29 pm

bingo-dog wrote:The issue this time around is that the problem was first reported around 7AM pacific time on Sunday, with no response until 24 hours later. I think the lack of any staff or moderator response in this thread
viewtopic.php?f=18&t=13434&start=0
is what had people quite upset.

I'm sure there are plenty of folks still wondering if any of the work that received the "server has already received unit" message actually went to advancing the projects, or just into the bit bucket.


Indeed, after looking more deeply at this, I see that you are right. I'm very sorry some how this got missed by us, especially since many of us were working yesterday (Sunday). I'll do some checking to see how this fell between the cracks both on my team's side and the moderators. It is unfortunate that the WS bug that caused this lead to everything "looking" fine by our normal analytics (otherwise FAH serverstat would have told my team there is a problem as well).
User avatar
VijayPande
Pande Group Member
 
Posts: 2651
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: Server has no record of this unit

Postby seanego » Mon Feb 15, 2010 5:10 pm

And can anyone at PG please look at the problem with upload when it responds "Server does not have record of this unit", too?
seanego
 
Posts: 17
Joined: Wed Feb 10, 2010 12:35 pm
Location: Moscow

Re: Server has no record of this unit

Postby VijayPande » Mon Feb 15, 2010 5:32 pm

seanego wrote:And can anyone at PG please look at the problem with upload when it responds "Server does not have record of this unit", too?


please see the discussion of this in the other threads (eg here is a recent post viewtopic.php?f=18&t=13434&p=131402#p131402 but there is a dedicated thread on this topic)
User avatar
VijayPande
Pande Group Member
 
Posts: 2651
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: Server has no record of this unit

Postby weedacres » Mon Feb 15, 2010 11:17 pm

VijayPande wrote:
bingo-dog wrote:The issue this time around is that the problem was first reported around 7AM pacific time on Sunday, with no response until 24 hours later. I think the lack of any staff or moderator response in this thread
viewtopic.php?f=18&t=13434&start=0
is what had people quite upset.

I'm sure there are plenty of folks still wondering if any of the work that received the "server has already received unit" message actually went to advancing the projects, or just into the bit bucket.


Indeed, after looking more deeply at this, I see that you are right. I'm very sorry some how this got missed by us, especially since many of us were working yesterday (Sunday). I'll do some checking to see how this fell between the cracks both on my team's side and the moderators. It is unfortunate that the WS bug that caused this lead to everything "looking" fine by our normal analytics (otherwise FAH serverstat would have told my team there is a problem as well).

This problem actually started around 2200 PST on Saturday. That is when the first of my completed work units could not be uploaded.
Image
weedacres
 
Posts: 394
Joined: Mon Dec 24, 2007 11:18 pm
Location: Eastern Washington

Re: Server has no record of this unit

Postby VijayPande » Mon Feb 15, 2010 11:47 pm

There are several threads on this. I've been posting in viewtopic.php?f=18&t=13434&start=165 .

I'll lock this and suggest people post in the other thread.

The bottom line update right now is that this is not a simple fix (eg not as simple as restarting a server) and we are hunting down a WS bug that may have been recently introduced.
User avatar
VijayPande
Pande Group Member
 
Posts: 2651
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Previous

Return to NVIDIA specific issues

Who is online

Users browsing this forum: No registered users and 2 guests

cron