140.163.0.0/16: blacklisting MSKCC via FW

Moderators: Site Moderators, FAHC Science Team

rafwiewiora
Scientist
Posts: 167
Joined: Mon Aug 03, 2015 8:23 pm
Location: New York

Re: 140.163.0.0/16: blacklisting MSKCC via FW

Post by rafwiewiora »

Also its Easter and im sure the people behind FAH would like to have a day of peace and quiet, to enjoy themselves away from here and spend time with their families. Given the current situation, I think they would like to enjoy time with their loved ones. Things will get back to normal after the Easter weekend. just give it some time.
If you are not happy then don't run the software. You have absolutely no idea how hard people are working in the background to get additional servers online, upgrade the server software to handle an increased load, generate the work units and somehow actually do something with the data that we have generated over the last few weeks. They have precious little time to be answering all the this server is down posts. They know about it, they are working on it.
Thank you v00d00 and Nathan_P :)

It's so sad to me to see some others of you here with your selfishness and passive aggressiveness. Do you think I'm suddenly capable of doing 10 people's jobs because the number of the volunteers has increased? I thank you all very much, but why you would think that the fact you provide a part of this massive network gives you the right to come here and berate our work is beyond me. How productive are you all over here in lockdown? Have some damn empathy people. My job is to do science here and it will always come first. If you think my job should be to first keep you all happy then you're in a wrong place. I have never seen a community of people devoted to something as much as all of you are and I'll forever remember and be thankful for being part of this grand idea. But please, think about what your words can do to someone who's been fighting new problems over and over again for days, instead of focusing on doing actual science. Science is doing things that have never been done before, if you expect a completely stable product that comes with Amazon's comment section where you can angrily make yourself feel better -- you're in a wrong place.
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: 140.163.0.0/16: blacklisting MSKCC via FW

Post by JohnChodera »

Hi folks! I just wanted to step in to apologize for the persistent issues with plfah1-*.mskcc.org and plfah2-*.mskcc.org over the past couple of weeks.

Our poor little servers at MSKCC were purchased in 2007 and have been chugging away serving the majority of GPU work units probably up to where FAH hit nearly 1 exaflop before they started to fall over.

Also, I apologize for the perceived lack of transparency here---it's really been just a matter of putting out tons of other fires and focusing our energies on providing critical scientific support for active COVID-19 drug discovery efforts. Huge thanks to @bruce for pointing me here to at least dash off a quick update.

We've had a few issues that have appeared:
* Throughput limits with the server code: This is now addressed, or at least much improved!
* Drive failures: Now fixed, with more shelf spares that have just showed up!
* Disk space that was burned through something like 5-20x faster due to all of you wonderful people completing WUs so quickly: We're rapidly evacuating completed projects to a new external storage unit that just came online days ago, and have 14T free on plfah1 now. We're also going to release a new core22 version this week that will allow us to send back only the solute structure data and thus solve these storage issues for good.
* Server software stability: Occasionally, it looks like the server can't operate on specific WUs because the files are reported by the OS as "busy". We don't have a solution for this yet, but so far have been restarting the server when we encounter this. We could use a rapid warning system, maybe, instead of having to have @bruce ping us when this happens for a couple of hours.

TL;DR: I think it's OK to ahead and keep the blacklist for a few more days. We've spun up a bunch of new external servers which are going to be much more performant.
We're bringing new (actual, brand new!) physical servers online at MSK in the next few days. We got them just before the COVID-19 emergency, and it's been slower to roll them out as a result, but we hope we can get them into service to replace plfah1/2 in the next week!

Huge thanks to all of you folks, and stay safe and healthy. We love you all.

~ John Chodera @ MSKCC ~
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: 140.163.0.0/16: blacklisting MSKCC via FW

Post by JohnChodera »

Also, I need to give a huge shout out to @rafwiewiora and our awesome overworked Linux sysadmin from the MSK Open Systems Group who managed to get us to 1 exaflop with two limping six-year-old servers and a 60-drive RAID that was eating 4TB disks for breakfast.

They both deserve to sleep for a good long time now that we have a bunch of external work servers spun up now...

~ John Chodera @ MSKCC ~
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 140.163.0.0/16: blacklisting MSKCC via FW

Post by PantherX »

ToeBlister wrote:...it gets frustrating when WUs is assigned and crunched but servers cannot accept the results due to bandwidth or disk space limitations as donors' money (electrical cost, hardware depre cost, etc) has been expended and will be in vain as WUs will timeout and be re-assigned.
May I suggest a different perspective; Imagine you spent a lot of time doing research, calculations and wrote a really good report. However, before you saved it, your system crashed and now you're back to square one. That's the same for these research project. A "lost WU" is "lost work" for them. It can lead up to delays in submitting their paper, have issues with missing data during analysis, etc. It is not that they don't care, it is that they care very much (as mentioned above by JohnChodera) but being human, they have human limitations.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
ToeBlister
Posts: 36
Joined: Thu Mar 26, 2020 3:23 pm

Re: 140.163.0.0/16: blacklisting MSKCC via FW

Post by ToeBlister »

PantherX wrote:
ToeBlister wrote:...it gets frustrating when WUs is assigned and crunched but servers cannot accept the results due to bandwidth or disk space limitations as donors' money (electrical cost, hardware depre cost, etc) has been expended and will be in vain as WUs will timeout and be re-assigned.
May I suggest a different perspective; Imagine you spent a lot of time doing research, calculations and wrote a really good report. However, before you saved it, your system crashed and now you're back to square one. That's the same for these research project. A "lost WU" is "lost work" for them. It can lead up to delays in submitting their paper, have issues with missing data during analysis, etc. It is not that they don't care, it is that they care very much (as mentioned above by JohnChodera) but being human, they have human limitations.
I don't disagree with that perspective of a scientist, but that does not discount the the perspective of a donor.
And it is great that Rafal and John has addressed these issues from their POV, and especially John's reply is PR done right.

edit: sorry. typo on the "disagree". totally changed the intent of my reply. didn't realise until now.
Post Reply