plfah1-*.mskcc.org temporarily going down for maintenance

Moderators: Site Moderators, PandeGroup

plfah1-*.mskcc.org temporarily going down for maintenance

Postby JohnChodera » Thu Jan 18, 2018 7:23 pm

`plfah1-*.mskcc.org` is having some temporary RAID issues, so the work servers are being suspended for a few hours.

~ The Chodera lab
User avatar
JohnChodera
Pande Group Member
 
Posts: 114
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Postby JimboPalmer » Thu Jan 18, 2018 7:37 pm

Thank you for notifying us!

(It is always comforting to know it is not something we did)
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
JimboPalmer
 
Posts: 589
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Postby JohnChodera » Fri Jan 19, 2018 4:47 pm

Update: The RAID rebuild failed and the controller may be faulty, but we're trying to research the cabling first in case that fixes the issue. If not, we'll replace the controller and start to rebuild the RAID, bringing the server back online once the rebuild is complete.

We've heard some sporadic reports that the WUs did not have unaffected servers listed as collection servers (CSs), so we're coordinating with some other FAH Consortium labs to add more offsite collection servers so that disruption will be minimal in case this ever happens again in the future.

More updates soon. Again, our apologies for the downtime here.

The affected server IP range is 140.163.4.231-140.163.4.235
User avatar
JohnChodera
Pande Group Member
 
Posts: 114
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Postby JohnChodera » Fri Jan 19, 2018 6:42 pm

UPDATE: Reseating the cables did not resolve the issue, so Dell is dispatching a technician and parts within 4 hours today to replace the RAID controller, drawer, and SAS chain cable.

More updates on ETA for restoration once the RAID has started rebuilding.

~ The Chodera Lab
User avatar
JohnChodera
Pande Group Member
 
Posts: 114
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Postby JohnChodera » Sat Jan 20, 2018 1:00 am

The hardware vendor apparently doesn't consider the chassis drawer to be subject to our 4-hour onsite warranty, so is having a replacement drawer shipped. Unfortunately, this means the earliest we project being online following RAID rebuild is Thu 25 Jan.

Apologies again for the disruption, and I'll update if there is any new information in the meantime.

~ The Chodera lab
User avatar
JohnChodera
Pande Group Member
 
Posts: 114
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Postby JohnChodera » Sat Jan 20, 2018 6:21 pm

Update: Dell is dispatching a tech with the replacement part this morning! Hopefully we will be back online sooner than planned!

~ The Chodera Lab
User avatar
JohnChodera
Pande Group Member
 
Posts: 114
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Postby JohnChodera » Sat Jan 20, 2018 8:40 pm

UPDATE: Our awesome Open Systems Group and datacenter team now have the hardware replaced and the RAID is rebuilding, with an ETA for completion of 60+ hours.

~ The Chodera lab
User avatar
JohnChodera
Pande Group Member
 
Posts: 114
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Postby JohnChodera » Mon Jan 22, 2018 2:39 am

UPDATE: Estimates suggest approximately 40 hours remain for RAID rebuild.
User avatar
JohnChodera
Pande Group Member
 
Posts: 114
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Postby JimboPalmer » Mon Jan 22, 2018 4:50 pm

Dr Chodera,

The donors get error messages with IP addresses, while you have reported the downtime of a server by DNS name. Would it be possible to give us the IP address so we could use your estimated time to rebuild to address donor issues?

Yours

Jimbo
JimboPalmer
 
Posts: 589
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Postby Joe_H » Mon Jan 22, 2018 5:06 pm

These WS's are not currently taking any connections, so I doubt there will be any reports for a bit. But if you look at the Server Status page, the ones down are IP numbers 140.163.4.231-235 and 140.163.4.241-245.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Joe_H
Site Admin
 
Posts: 3973
Joined: Tue Apr 21, 2009 4:41 pm
Location: W. MA

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Postby JohnChodera » Mon Jan 22, 2018 9:07 pm

WSs 140.163.4.231-233 are back online! Thanks for your patience.
User avatar
JohnChodera
Pande Group Member
 
Posts: 114
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Postby JohnChodera » Mon Jan 22, 2018 9:07 pm

I'll be sure to report IP addresses in the subject line next time. Sorry for the hassle!
User avatar
JohnChodera
Pande Group Member
 
Posts: 114
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Postby ChristianVirtual » Tue Jan 23, 2018 12:05 pm

Thanks for the efforts and greetings to the IT support team
ImageImage
Please contribute your logs to http://ppd.fahmm.net
User avatar
ChristianVirtual
 
Posts: 1526
Joined: Tue May 28, 2013 12:14 pm
Location: 日本 東京


Return to Issues with a specific server

Who is online

Users browsing this forum: No registered users and 2 guests

cron