WS 140.163.4.200 Upload blocked by untangle"

Moderators: Site Moderators, FAHC Science Team

Post Reply
DOMiNiON79
Posts: 5
Joined: Sat Apr 04, 2020 10:27 am

WS 140.163.4.200 Upload blocked by untangle"

Post by DOMiNiON79 »

Hello,

the upload to ws 140.163.4.200 is not possible at the moment:

12:13:50:WU00:FS01:Trying to send results to collection server
12:13:50:WU00:FS01:Uploading 8.45MiB to 140.163.4.200
12:13:50:WU00:FS01:Connecting to 140.163.4.200:8080
12:14:24:WU00:FS01:Upload 2.22%
12:14:24:ERROR:WU00:FS01:Exception: Transfer failed

That has happened several times now.

Please check this, thx!
Gnomuz
Posts: 31
Joined: Sat Nov 21, 2020 5:07 pm

Re: WS 140.163.4.200 No Upload possible

Post by Gnomuz »

Same here, so it's probably a general issue.
In fact, this server (140.163.4.200) is the collection server linked with the work server 18.188.125.154, which is also unaccessible. I currently have 3 WUs stuck on upload, all on project 13444.
Image

Nvidia RTX 3060 Ti & GTX 1660 Super - AMD Ryzen 7 5800X - MSI MEG X570 Unify - 16 GB RAM - Ubuntu 20.04.2 LTS - Nvidia drivers 460.56
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: WS 140.163.4.200 No Upload possible

Post by Neil-B »

May well be overloaded comms trying to cope with the ofload from aws3 .. believe from post in discord that may be up and running again or at least should be soon so things may start to sort themselves out
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Joe_H
Site Admin
Posts: 7867
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: WS 140.163.4.200 No Upload possible

Post by Joe_H »

WS aws3 is currently down, it ran out of space. They started work to free up space yesterday evening, had freed up some by around midnight, and that all filled up again.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
ajm
Posts: 754
Joined: Sat Mar 21, 2020 5:22 am
Location: Lucerne, Switzerland

Re: WS 140.163.4.200 No Upload possible

Post by ajm »

I don't know the constraints and I don't want to annoy people, but I see that there is well over 1,500 TiB free on available servers (pllwskifah1.mskcc.org: 233, pllwskifah2.mskcc.org: 233, highland1.engr.wustl.edu: 545, vav17: 100, vav21: 100, vav19: 100, vav22: 100, vav23: 100, vav24: 100).

It might be worth programming something to better balance the load, no?
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: WS 140.163.4.200 No Upload possible

Post by Neil-B »

Work is iirc placed on kit normally located/managed by the labs running the projects .. don't think it is a simple as just load balancing across what may look like a single extended capability but is actually made up of locally managed/controlled kit
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
ajm
Posts: 754
Joined: Sat Mar 21, 2020 5:22 am
Location: Lucerne, Switzerland

Re: WS 140.163.4.200 No Upload possible

Post by ajm »

I understand that it may be difficult to share server space among different labs and projects.
But here, with aws3 (no space available), pllwskifah1 (233 TiB free), and pllwskifah2 (233 TiB free), it would be the same lab, the same person, and the same project type.
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: WS 140.163.4.200 No Upload possible

Post by Neil-B »

If my geolocation work is correct then aws3 is hosted out of Columbus Ohio on Amazon infrastructure ... the two pllws are hosted out of New York City through Memorial Sloan Kettering ISP ... so very different locations to start with ... and looking at the projects currently associated with each of those WSs it looks like there might be different groups/system maintainers for each.

There is also the situation that with the way the code works I believe a project has to be hosted from a single WS - yes one or more CS can be associated with that but the science still needs to get back to the original WS for the science to continue - CSs are simply buffers for certain conditions - therefore wherever projects are hosted there can be issues ... Yes today could move projects from one infrastructure to another and use up space there but then that other infrastructure (which at least on of the servers in question has a whole series of projects already associated with it may fill up ... but even so relocation of projects has iirc be done - it just takes a large amount of effort and can break things such as the scripts that generate the new wus.

... and it isnt all about the disk space - the throughput and bandwidth may well play into it - what server hosts which projects and the decisions around this may well take into account the size and number of the project wus and the amount of computational effort needed to service the generation of new wus from the returned wus.

I think what I am trying to say it that it may not be as simple as just load balancing or relocating data (which in itself is non trivial given the volumes) ... I deal occasionally with relocating similar amounts of data and tbh pulling the disks from one location and sending them physically to the new location - even with the pain and cost in time and security is still in many cases the easiest, quickest and least painful approach ... just adding more disk capacity can be a short term solution but tbh in most cases the data needs to be triaged to much smaller quantities by onward processing and then relocated off the generation/collection systems and into longer term into analytic/storage/retrieval capabilities ... The challenge can be that to triage the data in itself requires significant compute/disk access so in many cases it is simpler to shut down the front end whilst dealing with the data manipulation and offload (which I guess is what is happening).

My gut also tells me (but may well be wrong) that the two pllws may be utilising some form of shared network attached storage (hence reporting same figure (which possibly might be a "duplicate" and one may be a new build as running newer server build than the other - aws3 matches the server build for the older of the two pllws builds ... I am also not sure how much the server builds are vanilla - they may be based on a build then bespoke tailored to local kit/conditions - which may mean that relocating and/or load sharing might get tripped up by slight differences in configuration.

The researchers/admins will be as disappointed/stressed/overloaded by the current issues and the contributing folders are as this will be impeding their research ... I am sure that the Fah Consortium will regularly review such issues and will be trying to move the infrastructure and software into a better place/configuration and current efforts/responses will be the best they can achieve at the moment ... I for one would not want to have t coordinate coherent development across multiple academic institutions in different countries and funding/authority regimes :(

... and to go back to the topic title ... .200 is the older build of the two pllws machine - it has disk space but is possibly overloaded with traffic due to as farming more requests at it because aws3 is down and because it is acting as the cs for aws3? ... ouch
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
ajm
Posts: 754
Joined: Sat Mar 21, 2020 5:22 am
Location: Lucerne, Switzerland

Re: WS 140.163.4.200 No Upload possible

Post by ajm »

Thank you Neil! Great post! I understand the situation much better now.
Alright, best wishes and Godspeed to the team that has to sort this out!
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: WS 140.163.4.200 No Upload possible

Post by bruce »

The upload issue has been carefully explained (above) and ideally, the WUs destined to be returned to Work Server A might need to be temporarily uploaded to server B which is acting as a Collection Server for WS A. The projects based at A will be interrupted until service at A is restored but you don't see that issue as long as service at B can buffer the uploads. The projects based at A can also be buffered on your HD but you'll start getting warning messages.

A secondary issue is whether your kit can be assigned some OTHER project (based at any OTHER server C1, C2, C3...) can distribute work for you to do.
Badsinger
Posts: 15
Joined: Tue May 19, 2020 9:01 am

Re: WS 140.163.4.200 No Upload possible

Post by Badsinger »

It's still having problems I'm afraid. I'm sure for most folders uploading multiple times just mean things take a bit longer. For me, and the others like me with a data limit, uploading 5 times instead of 1 means 2 WU"S I'll never be able to start. The bandwidth for them is gone. I hope it gets fixed eventually.
ajm
Posts: 754
Joined: Sat Mar 21, 2020 5:22 am
Location: Lucerne, Switzerland

Re: WS 140.163.4.200 No Upload possible

Post by ajm »

I can upload to 140.163.4.200 but I think that there is something wrong or delayed with the stats again.
I haven't checked all WUs to isolate the faulty server, but since at least 36 hours, my EOC account gets only some 60-70% of the Total Estimated Points Per Day announced by FAHControl.
And I observe the same gap lately in the EOC aggregate summary https://folding.extremeoverclocking.com ... ary.php?s= as well as in several teams summaries I checked.
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: WS 140.163.4.200 No Upload possible

Post by Neil-B »

For anyone with this "No Upload" issue you may want to double check firewall/av (untangle was the firewall in question below) settings especially if the message is along the lines of "Received short response, expected 8 bytes, got 0" ... I'll post below a discord post form someone with similar issues ... even though the firewall was allowing all traffic it was inspecting the flow and that caused this issue ... Once he turned that off all was fine.

Jason EllingsonToday at 11:24
Well... gee... figured out the problem. I set my firewall to pass all traffic... but apparently it still "inspects" port 80 traffic for compliance. Disabled that and voila! We're in business.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Post Reply