Page 1 of 1

plfah1-*.mskcc.org temporarily going down for maintenance

Posted: Thu Jan 18, 2018 7:23 pm
by JohnChodera
`plfah1-*.mskcc.org` is having some temporary RAID issues, so the work servers are being suspended for a few hours.

~ The Chodera lab

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Posted: Thu Jan 18, 2018 7:37 pm
by JimboPalmer
Thank you for notifying us!

(It is always comforting to know it is not something we did)

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Posted: Fri Jan 19, 2018 4:47 pm
by JohnChodera
Update: The RAID rebuild failed and the controller may be faulty, but we're trying to research the cabling first in case that fixes the issue. If not, we'll replace the controller and start to rebuild the RAID, bringing the server back online once the rebuild is complete.

We've heard some sporadic reports that the WUs did not have unaffected servers listed as collection servers (CSs), so we're coordinating with some other FAH Consortium labs to add more offsite collection servers so that disruption will be minimal in case this ever happens again in the future.

More updates soon. Again, our apologies for the downtime here.

The affected server IP range is 140.163.4.231-140.163.4.235

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Posted: Fri Jan 19, 2018 6:42 pm
by JohnChodera
UPDATE: Reseating the cables did not resolve the issue, so Dell is dispatching a technician and parts within 4 hours today to replace the RAID controller, drawer, and SAS chain cable.

More updates on ETA for restoration once the RAID has started rebuilding.

~ The Chodera Lab

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Posted: Sat Jan 20, 2018 1:00 am
by JohnChodera
The hardware vendor apparently doesn't consider the chassis drawer to be subject to our 4-hour onsite warranty, so is having a replacement drawer shipped. Unfortunately, this means the earliest we project being online following RAID rebuild is Thu 25 Jan.

Apologies again for the disruption, and I'll update if there is any new information in the meantime.

~ The Chodera lab

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Posted: Sat Jan 20, 2018 6:21 pm
by JohnChodera
Update: Dell is dispatching a tech with the replacement part this morning! Hopefully we will be back online sooner than planned!

~ The Chodera Lab

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Posted: Sat Jan 20, 2018 8:40 pm
by JohnChodera
UPDATE: Our awesome Open Systems Group and datacenter team now have the hardware replaced and the RAID is rebuilding, with an ETA for completion of 60+ hours.

~ The Chodera lab

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Posted: Mon Jan 22, 2018 2:39 am
by JohnChodera
UPDATE: Estimates suggest approximately 40 hours remain for RAID rebuild.

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Posted: Mon Jan 22, 2018 4:50 pm
by JimboPalmer
Dr Chodera,

The donors get error messages with IP addresses, while you have reported the downtime of a server by DNS name. Would it be possible to give us the IP address so we could use your estimated time to rebuild to address donor issues?

Yours

Jimbo

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Posted: Mon Jan 22, 2018 5:06 pm
by Joe_H
These WS's are not currently taking any connections, so I doubt there will be any reports for a bit. But if you look at the Server Status page, the ones down are IP numbers 140.163.4.231-235 and 140.163.4.241-245.

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Posted: Mon Jan 22, 2018 9:07 pm
by JohnChodera
WSs 140.163.4.231-233 are back online! Thanks for your patience.

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Posted: Mon Jan 22, 2018 9:07 pm
by JohnChodera
I'll be sure to report IP addresses in the subject line next time. Sorry for the hassle!

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Posted: Tue Jan 23, 2018 12:05 pm
by ChristianVirtual
Thanks for the efforts and greetings to the IT support team