Page 1 of 1

WS 140.163.4.200 Upload blocked by untangle"

Posted: Thu Feb 04, 2021 12:22 pm
by DOMiNiON79
Hello,

the upload to ws 140.163.4.200 is not possible at the moment:

12:13:50:WU00:FS01:Trying to send results to collection server
12:13:50:WU00:FS01:Uploading 8.45MiB to 140.163.4.200
12:13:50:WU00:FS01:Connecting to 140.163.4.200:8080
12:14:24:WU00:FS01:Upload 2.22%
12:14:24:ERROR:WU00:FS01:Exception: Transfer failed

That has happened several times now.

Please check this, thx!

Re: WS 140.163.4.200 No Upload possible

Posted: Thu Feb 04, 2021 12:43 pm
by Gnomuz
Same here, so it's probably a general issue.
In fact, this server (140.163.4.200) is the collection server linked with the work server 18.188.125.154, which is also unaccessible. I currently have 3 WUs stuck on upload, all on project 13444.

Re: WS 140.163.4.200 No Upload possible

Posted: Thu Feb 04, 2021 12:50 pm
by Neil-B
May well be overloaded comms trying to cope with the ofload from aws3 .. believe from post in discord that may be up and running again or at least should be soon so things may start to sort themselves out

Re: WS 140.163.4.200 No Upload possible

Posted: Thu Feb 04, 2021 4:07 pm
by Joe_H
WS aws3 is currently down, it ran out of space. They started work to free up space yesterday evening, had freed up some by around midnight, and that all filled up again.

Re: WS 140.163.4.200 No Upload possible

Posted: Thu Feb 04, 2021 4:58 pm
by ajm
I don't know the constraints and I don't want to annoy people, but I see that there is well over 1,500 TiB free on available servers (pllwskifah1.mskcc.org: 233, pllwskifah2.mskcc.org: 233, highland1.engr.wustl.edu: 545, vav17: 100, vav21: 100, vav19: 100, vav22: 100, vav23: 100, vav24: 100).

It might be worth programming something to better balance the load, no?

Re: WS 140.163.4.200 No Upload possible

Posted: Thu Feb 04, 2021 5:07 pm
by Neil-B
Work is iirc placed on kit normally located/managed by the labs running the projects .. don't think it is a simple as just load balancing across what may look like a single extended capability but is actually made up of locally managed/controlled kit

Re: WS 140.163.4.200 No Upload possible

Posted: Thu Feb 04, 2021 5:56 pm
by ajm
I understand that it may be difficult to share server space among different labs and projects.
But here, with aws3 (no space available), pllwskifah1 (233 TiB free), and pllwskifah2 (233 TiB free), it would be the same lab, the same person, and the same project type.

Re: WS 140.163.4.200 No Upload possible

Posted: Thu Feb 04, 2021 8:11 pm
by Neil-B
If my geolocation work is correct then aws3 is hosted out of Columbus Ohio on Amazon infrastructure ... the two pllws are hosted out of New York City through Memorial Sloan Kettering ISP ... so very different locations to start with ... and looking at the projects currently associated with each of those WSs it looks like there might be different groups/system maintainers for each.

There is also the situation that with the way the code works I believe a project has to be hosted from a single WS - yes one or more CS can be associated with that but the science still needs to get back to the original WS for the science to continue - CSs are simply buffers for certain conditions - therefore wherever projects are hosted there can be issues ... Yes today could move projects from one infrastructure to another and use up space there but then that other infrastructure (which at least on of the servers in question has a whole series of projects already associated with it may fill up ... but even so relocation of projects has iirc be done - it just takes a large amount of effort and can break things such as the scripts that generate the new wus.

... and it isnt all about the disk space - the throughput and bandwidth may well play into it - what server hosts which projects and the decisions around this may well take into account the size and number of the project wus and the amount of computational effort needed to service the generation of new wus from the returned wus.

I think what I am trying to say it that it may not be as simple as just load balancing or relocating data (which in itself is non trivial given the volumes) ... I deal occasionally with relocating similar amounts of data and tbh pulling the disks from one location and sending them physically to the new location - even with the pain and cost in time and security is still in many cases the easiest, quickest and least painful approach ... just adding more disk capacity can be a short term solution but tbh in most cases the data needs to be triaged to much smaller quantities by onward processing and then relocated off the generation/collection systems and into longer term into analytic/storage/retrieval capabilities ... The challenge can be that to triage the data in itself requires significant compute/disk access so in many cases it is simpler to shut down the front end whilst dealing with the data manipulation and offload (which I guess is what is happening).

My gut also tells me (but may well be wrong) that the two pllws may be utilising some form of shared network attached storage (hence reporting same figure (which possibly might be a "duplicate" and one may be a new build as running newer server build than the other - aws3 matches the server build for the older of the two pllws builds ... I am also not sure how much the server builds are vanilla - they may be based on a build then bespoke tailored to local kit/conditions - which may mean that relocating and/or load sharing might get tripped up by slight differences in configuration.

The researchers/admins will be as disappointed/stressed/overloaded by the current issues and the contributing folders are as this will be impeding their research ... I am sure that the Fah Consortium will regularly review such issues and will be trying to move the infrastructure and software into a better place/configuration and current efforts/responses will be the best they can achieve at the moment ... I for one would not want to have t coordinate coherent development across multiple academic institutions in different countries and funding/authority regimes :(

... and to go back to the topic title ... .200 is the older build of the two pllws machine - it has disk space but is possibly overloaded with traffic due to as farming more requests at it because aws3 is down and because it is acting as the cs for aws3? ... ouch

Re: WS 140.163.4.200 No Upload possible

Posted: Thu Feb 04, 2021 8:33 pm
by ajm
Thank you Neil! Great post! I understand the situation much better now.
Alright, best wishes and Godspeed to the team that has to sort this out!

Re: WS 140.163.4.200 No Upload possible

Posted: Thu Feb 04, 2021 9:06 pm
by bruce
The upload issue has been carefully explained (above) and ideally, the WUs destined to be returned to Work Server A might need to be temporarily uploaded to server B which is acting as a Collection Server for WS A. The projects based at A will be interrupted until service at A is restored but you don't see that issue as long as service at B can buffer the uploads. The projects based at A can also be buffered on your HD but you'll start getting warning messages.

A secondary issue is whether your kit can be assigned some OTHER project (based at any OTHER server C1, C2, C3...) can distribute work for you to do.

Re: WS 140.163.4.200 No Upload possible

Posted: Sat Feb 06, 2021 7:51 pm
by Badsinger
It's still having problems I'm afraid. I'm sure for most folders uploading multiple times just mean things take a bit longer. For me, and the others like me with a data limit, uploading 5 times instead of 1 means 2 WU"S I'll never be able to start. The bandwidth for them is gone. I hope it gets fixed eventually.

Re: WS 140.163.4.200 No Upload possible

Posted: Sun Feb 07, 2021 4:32 pm
by ajm
I can upload to 140.163.4.200 but I think that there is something wrong or delayed with the stats again.
I haven't checked all WUs to isolate the faulty server, but since at least 36 hours, my EOC account gets only some 60-70% of the Total Estimated Points Per Day announced by FAHControl.
And I observe the same gap lately in the EOC aggregate summary https://folding.extremeoverclocking.com ... ary.php?s= as well as in several teams summaries I checked.

Re: WS 140.163.4.200 No Upload possible

Posted: Mon Feb 08, 2021 5:56 pm
by Neil-B
For anyone with this "No Upload" issue you may want to double check firewall/av (untangle was the firewall in question below) settings especially if the message is along the lines of "Received short response, expected 8 bytes, got 0" ... I'll post below a discord post form someone with similar issues ... even though the firewall was allowing all traffic it was inspecting the flow and that caused this issue ... Once he turned that off all was fine.

Jason EllingsonToday at 11:24
Well... gee... figured out the problem. I set my firewall to pass all traffic... but apparently it still "inspects" port 80 traffic for compliance. Disabled that and voila! We're in business.