Page 1 of 1

Response to Folding@Home Dev Chat: distributed storage

Posted: Thu Apr 09, 2020 11:11 pm
by NoMoreQuarantine
I watched the dev chat and have a lot to say, but there is too much to fit here right now and I need to think about it and probably make some drawings. For now I just want to make a post about what seems to be the largest bottleneck in the system right now: server space and what I currently believe is the only sustainable solution: distributed storage. This is my own naive opinions, but I hope it starts some brainstorming and discussion.

The reasons Joseph gave for why a "torrent-like" system wouldn't work do not make sense to me. First, he said that torrenting works because people all want the same file; that part makes sense, P2P sharing needs incentives for people to seed after they have leeched. The incentives for seeding include reciprocity: people feel good when they give back and bad when they don't, and future benefit: if they don't seed they may get blacklisted and will not be able to download more in the future. What I think Joseph is missing is that there already are incentives for us folders. The only proof you need is that we are here running your client in the first place. The other concern was sending redundant data to people. I understand the desire to avoid repeating work, but in engineering redundancy can often be a good thing. If people are willing to offer extra storage space on their machines, don't hesitate to use it if it solves the problem, even if it requires some duplication of data (which it would to avoid any one machine going offline and taking the data with it).

It is likely that a distributed exponentially growing storage system will be the only way to keep up with a distributed exponentially growing processing system. No amount of data optimization or compression will fix it, and I strongly believe that anything that reduces the quality of information we send to the scientists should be an option of last resort.

I want to close by saying thank you for hosting this dev chat and for your tireless support of this daunting effort. I can't be the only one has been humbled by your commitment to the betterment of our world. You guys are making history in the best way possible way.

Re: Response to Folding@Home Dev Chat: distributed storage

Posted: Sun Apr 12, 2020 1:04 am
by AmDD
In the crypto currency world, there are distributed storage projects that could possibly be used here. Storj and Sia are two that come to mind. The big advantage there is the infrastructure is already built and running.

Re: Response to Folding@Home Dev Chat: distributed storage

Posted: Sun Apr 12, 2020 1:10 pm
by NuovaApe
I will commit to leaving my PC running until the WUs are complete - most evenings and weekends. But not running 24/7 as I pay the electricity bills here.

So my free storage would be transient and my MTBF low. Which is why duplication is needed when using crowd instead of cloud.

Amazon has AWS S3, Google has Cloud Storage, Microsoft has Data Lake - all off-the-shelf (but not free) enterprise level distributed storage platforms. These scale massively and have very high availability/durability.

Unlike my lame sum of 1.5TB local finite storage. My enthusiasm to calculate though is boundless :D

Re: Response to Folding@Home Dev Chat: distributed storage

Posted: Sun Apr 12, 2020 5:59 pm
by ChristianVirtual
what is the storage demand ? In Tera or Peta byte ?

Re: Response to Folding@Home Dev Chat: distributed storage

Posted: Sun Apr 12, 2020 6:15 pm
by Joe_H
At this point, probably Petabytes to store the results. As Understand it, after getting the raw data off a WS it can be reduced a lot. But still takes a lot of space. There are hundreds of projects running, and thousands that have been done over the years.

Re: Response to Folding@Home Dev Chat: distributed storage

Posted: Sun Apr 12, 2020 6:42 pm
by NoMoreQuarantine
ChristianVirtual wrote:what is the storage demand ? In Tera or Peta byte ?
If I remember correctly, it is not the raw storage capacity that is currently lacking (for now), but the RAM capacity and network bandwidth. Currently, they buffer the WUs into RAM as they come in from the clients before dumping them into NVM. They are planning on fixing that with an update that streams the data into NVM as it gets uploaded. Then there's the issue of network bandwidth, as the servers need to keep their ports open as the data slowly streams in from each client with their standard (slow) home upload speeds. Joseph said they were considering removing the water atom physics to reduce the data they need to transfer, which I am adamantly opposed if it affects the overall simulation quality for the scientists. My proposal is to use volunteer clients as a sort of cache to buffer the data until the servers can get to it.

Re: Response to Folding@Home Dev Chat: distributed storage

Posted: Sun Apr 12, 2020 6:52 pm
by foldinghomealone2
@NoMoreQuarantine:
When they were talking about the torrent system they talked about distributing WUs via torrents.
The problem is that every folder gets a unique WU. However with torrent every user has to get the same WU.
That's the issue.
Why you want that folders calculate the same WU over and over? The results won't get better and you should use the compute power to compute new WUs. Not WUs that have been solved.

FAH is not a 'exponentially growing processing system'. It was quite stable for many year and just sees an Covid-19 hype.
In 2 months the hype will be over and it will be stable again.
Storage is not the problem per se. It just needs the researchers to intervene and copy data to their system and 'clean' it to reduce storage.
As Joseph mentioned they think about an automatic 'cleaning' system. That would save time for the researchers.

Re: Response to Folding@Home Dev Chat: distributed storage

Posted: Sun Apr 12, 2020 7:05 pm
by NoMoreQuarantine
foldinghomealone2 wrote:Why you want that folders calculate the same WU over and over? The results won't get better and you should use the compute power to compute new WUs. Not WUs that have been solved.
That is not what I was proposing. Please see my previous comment for clarification on my proposal.
foldinghomealone2 wrote:FAH is not a 'exponentially growing processing system'. It was quite stable for many year and just sees an Covid-19 hype.
In 2 months the hype will be over and it will be stable again.
Storage is not the problem per se. It just needs the researchers to intervene and copy data to their system and 'clean' it to reduce storage.
As Joseph mentioned they think about an automatic 'cleaning' system. That would save time for the researchers.
Let me put it another way, a processing system that randomly grows and shrinks with time, sometimes at massive and unprecedented scale, would benefit from a storage system that can grow and shrink along with it.

Re: Response to Folding@Home Dev Chat: distributed storage

Posted: Sun Apr 12, 2020 8:19 pm
by foldinghomealone2
NoMoreQuarantine wrote:If I remember correctly, it is not the raw storage capacity that is currently lacking (for now), but the RAM capacity and network bandwidth. Currently, they buffer the WUs into RAM as they come in from the clients before dumping them into NVM. They are planning on fixing that with an update that streams the data into NVM as it gets uploaded. Then there's the issue of network bandwidth, as the servers need to keep their ports open as the data slowly streams in from each client with their standard (slow) home upload speeds. Joseph said they were considering removing the water atom physics to reduce the data they need to transfer, which I am adamantly opposed if it affects the overall simulation quality for the scientists. My proposal is to use volunteer clients as a sort of cache to buffer the data until the servers can get to it.
Currently they need to buffer the complete WU into the RAM and then it will be stored on the server. (Their assumption is 50MB per WU and 1000-1500 connection at the same time makes 50 to 75GB of RAM).
They think about an update in this regards in the near future, to start storing even the WU is not fully uploaded. This would reduce needed RAM.

Streaming is what they plan for the far future. One benefit would be that the client calculates the 'next WU' by itself and therefore there is no download (except for the first one). This would reduce data transfer but it would mean much more clients connected to the working servers.

How you want to ensure data integrity when it's stored on a home computer? Servers are much more suitable for storing data.
Another issue would be upload speed. Commercial grade servers can distribute the needed data much faster than home computers

Re: Response to Folding@Home Dev Chat: distributed storage

Posted: Sun Apr 12, 2020 8:47 pm
by NoMoreQuarantine
foldinghomealone2 wrote:Streaming is what they plan for the far future. One benefit would be that the client calculates the 'next WU' by itself and therefore there is no download (except for the first one). This would reduce data transfer but it would mean much more clients connected to the working servers.
I was actually just refering to a data stream from RAM to NVM in regards to this issue:
foldinghomealone2 wrote:Currently they need to buffer the complete WU into the RAM and then it will be stored on the server. (Their assumption is 50MB per WU and 1000-1500 connection at the same time makes 50 to 75GB of RAM).
They think about an update in this regards in the near future, to start storing even the WU is not fully uploaded. This would reduce needed RAM.
foldinghomealone2 wrote:How you want to ensure data integrity when it's stored on a home computer? Servers are much more suitable for storing data.
Another issue would be upload speed. Commercial grade servers can distribute the needed data much faster than home computers
The way I imagine it would be implemented: a client finishes their WU, checks to see if the servers are ready for the data, if not it would store it into NVM (if they opted into storing for FAH) and send a duplicate to another volunteer client (or two) to prevent data loss in the case of disconnection. Periodically they would check if the servers are ready to start receiving the data; when it is the data can be sent and the duplicates flushed (or possibly retained if they want backups and the volunteers are happy with it). Meanwhile, the original client can be working on the next WU.

Re: Response to Folding@Home Dev Chat: distributed storage

Posted: Sun Apr 12, 2020 10:14 pm
by PantherX
NuovaApe wrote:...Amazon has AWS S3, Google has Cloud Storage, Microsoft has Data Lake - all off-the-shelf (but not free) enterprise level distributed storage platforms. These scale massively and have very high availability/durability....
That's a good idea until you have to pay the bill. F@H needs a mixture of storage:
1) Slow -> For storing past projects that have been analyzed.
2) Fast -> For active projects that are being distributed to Donors and when researches need to analyze it.

I know some companies that use cloud storage in TBs and their monthly bill is way more than the operating budget (based off guesswork) of F@H for an entire year.

However, I am aware in the current situation that there are cooperate sponsorship involved so in due time, I hope we can learn about it since some cooperate sponsorship isn't made public at all.

Re: Response to Folding@Home Dev Chat: distributed storage

Posted: Sun Apr 12, 2020 10:19 pm
by PantherX
NoMoreQuarantine wrote:...Joseph said they were considering removing the water atom physics to reduce the data they need to transfer, which I am adamantly opposed if it affects the overall simulation quality for the scientists...
AFAIK, there are two simulation methods that F@H uses:
1) Implicit: The water molecules act as a constant and thus are not included in the simulation
2) Explicit: The water molecules are part of the simulation and calculated

The great aspect of F@H is that it is run by researchers and not cooperate/marketing people. Thus, I am sure that the integrity of scientific will not be sacrificed. I haven't seen that happen in the past nor expect that to happen in the future. However, if there's a way to mix and match, i.e. start with implicit and then transform to explicit (to get better details) and than change to implicit automatically via ML without compromising the scientific integrity, that would be brilliant.

Re: Response to Folding@Home Dev Chat: distributed storage

Posted: Sun Apr 12, 2020 10:26 pm
by PantherX
foldinghomealone2 wrote:...Streaming is what they plan for the far future. One benefit would be that the client calculates the 'next WU' by itself and therefore there is no download (except for the first one). This would reduce data transfer but it would mean much more clients connected to the working servers...
My take on streaming would be that all the streaming clients involved need to have fast and reliable internet connection. The streaming server would have heaps of SSD storage, RAM and internet connection. The streaming client connect to the streaming server to download a "stream" and starts processing. Once it reaches a checkpoint, it streams the data back to the streaming server which will take note of it and can perform additional validation to ensure that the stream is on the right path while the streaming client simply carries on. Thus, while there will be a bit less download, the uploads would be a lot more and the number of connections that need to be open on the streaming server would be huge. If the streaming client stops, only loss of data occurred since the last checkpoint which means that the next streaming client can pick it up from that verified checkpoint.

A wonderful concept and hopefully, it comes sooner than later :egeek:

Re: Response to Folding@Home Dev Chat: distributed storage

Posted: Sun Jul 05, 2020 5:42 am
by bruce
There are several interesting comments in this topic. I apologize for my late discovery of the discussion.
NoMoreQuarantine wrote: If I remember correctly, it is not the raw storage capacity that is currently lacking (for now), but the RAM capacity and network bandwidth.
Yes, it's primarly network bandwidth ... counting each WU currently being distributed and each WU currently being returned ... both at home-networking speeds ... from each Work Server. Typically there are about 20+ active Work Servers, each with it's own bandwidth limitations and with it's own list of active projects. For the most part, the WS is assigned a Collection Server which can cache the uploads for that WS and forward the WU to the WS whenever conditions permit. When the WS reaches the limits of it's physical RAID storage, manual intervention is required.
Then there's the issue of network bandwidth, as the servers need to keep their ports open as the data slowly streams in from each client with their standard (slow) home upload speeds. Joseph said they were considering removing the water atom physics to reduce the data they need to transfer, which I am adamantly opposed if it affects the overall simulation quality for the scientists. My proposal is to use volunteer clients as a sort of cache to buffer the data until the servers can get to it.

The capability of discarding the solvent atoms before uploading the results has been implemented in the latest version of FAHCore_22. Also, the capability of actively using file-compression has been added, further decreasing the size of the uploaded files at the expense of a bit of extra processing time
PantherX wrote:AFAIK, there are two simulation methods that F@H uses:
1) Implicit: The water molecules act as a constant and thus are not included in the simulation
2) Explicit: The water molecules are part of the simulation and calculated

The great aspect of F@H is that it is run by researchers and not cooperate/marketing people. Thus, I am sure that the integrity of scientific will not be sacrificed. I haven't seen that happen in the past nor expect that to happen in the future. However, if there's a way to mix and match, i.e. start with implicit and then transform to explicit (to get better details) and than change to implicit automatically via ML without compromising the scientific integrity, that would be brilliant.
Using implicit solvent is increasingly rare. Most simulations now use explicit solvent, but once the final positions of all the atoms has been calculated, the "water" in the solvent box is rarely needed.

Re: Response to Folding@Home Dev Chat: distributed storage

Posted: Sun Jul 05, 2020 5:53 am
by bruce
foldinghomealone2 wrote:...Streaming is what they plan for the far future. One benefit would be that the client calculates the 'next WU' by itself and therefore there is no download (except for the first one). This would reduce data transfer but it would mean much more clients connected to the working servers...
I supect this will not be implemented any time soon. It essentially locks a specific trajectory to a given client. Randomizing the WUs assigned to fast/slow hardware has a number of advantages, both from a scientific perspective and from a client's perspective.

A streaming version of FAH was tested several years ago and there were both advantages and disadvantages discovered.