How can I keep WU's queued up?

legoman666 · Post by **legoman666** » Mon Jan 05, 2009 8:54 pm

It seems that half the time, when one of my machines finishes a WU, it either cannot upload the completed WU or it cannot download a new WU. So it just sits there and wastes time. Is there a way to keep 1 or 2 WU's in a queue to avoid downtime? I looked through the advanced options but did not find anything. Any help would be appreciated.

metal03326 · Post by **metal03326** » Mon Jan 05, 2009 10:23 pm

Unfortunately, there isn't such an option.

P5-133XL · Post by **P5-133XL** » Mon Jan 05, 2009 10:23 pm

no, Stanford does not allow people to queue-up WU's. As a practical matter, you could always have multiple clients collect WU's and then simply run one client at a time and then when a client stalls, then start up then next client (which already has a WU). It will require constant monitoring, excessive hand holding, and the rotating of the clients to prevent WU's expiring.

While that may work, I'm absolutely sure it is not the way Stanford wants you to run the clients because you will be likely to be missing deadlines on a regular basis: If you don't run into a stall, then the extra clients are not needed and those WU's are likely to expire if you are not rotating the clients or if you can't finish multiple WU's in the time needed for one. Missed deadlines are very costly in terms of the science. Even delayed finishing is bad/costly for the science because if you (and others) are holding WU's for future use, then the next generation of WU's can not be released until the held WU's are finished.

Post by **bruce** » Tue Jan 06, 2009 4:45 am

P5-133XL wrote:no, Stanford does not allow people to queue-up WU's. As a practical matter, you could always have multiple clients collect WU's and then simply run one client at a time and then when a client stalls, then start up then next client (which already has a WU). It will require constant monitoring, excessive hand holding, and the rotating of the clients to prevent WU's expiring.

While that may work, I'm absolutely sure it is not the way Stanford wants you to run the clients because you will be likely to be missing deadlines on a regular basis: If you don't run into a stall, then the extra clients are not needed and those WU's are likely to expire if you are not rotating the clients or if you can't finish multiple WU's in the time needed for one. Missed deadlines are very costly in terms of the science. Even delayed finishing is bad/costly for the science because if you (and others) are holding WU's for future use, then the next generation of WU's can not be released until the held WU's are finished.

Moreover, when you have any WU that is not being worked on, you're delaying the project, even if you don't miss any deadlines. Nobody else can work on that trajectory while it's on your machine -- except after it expires.

legoman666 wrote:It seems that half the time, when one of my machines finishes a WU, it either cannot upload the completed WU or it cannot download a new WU. So it just sits there and wastes time.

This isn't strictly true.

If an upload fails, the client will move on to download a new WU. Once you get a new WU, your machine will be working on that WU and will not be wasting time, even though the previous WU is still waiting in the queue to upload.

Vijay has mentioned that the server code has been rewritten and that it should be possible to see a roll-out of this code soon. I'm confident that this new code will go a very long way toward solving the upload and download problems that we have been seeing.

legoman666 · Post by **legoman666** » Fri Jan 09, 2009 1:33 pm

Ah, thanks for the clarification. Still, it'd be nice if this was possible. I'd mainly like this for the GPU2 work units, which have a dead line of 3 days and my 4850's can munch through in about 3 hours. I can see how it wouldn't be practical on big SMP units though.

whynot · Post by **whynot** » Sat Jan 10, 2009 3:13 pm

P5-133XL wrote:While that may work, I'm absolutely sure it is not the way Stanford wants you to run the clients because you will be likely to be missing deadlines on a regular basis: If you don't run into a stall, then the extra clients are not needed and those WU's are likely to expire if you are not rotating the clients or if you can't finish multiple WU's in the time needed for one. Missed deadlines are very costly in terms of the science. Even delayed finishing is bad/costly for the science because if you (and others) are holding WU's for future use, then the next generation of WU's can not be released until the held WU's are finished.

IMHO from cruncher POV that can be beaten with "let's invent a kind of reputation rating, then when cruncher (me actually) would reach some trust level she/he would be allowed to queue". But from F@H POV inventing new complexities adds nothing to the goal. Really, the usual cycle (downolad - crunch - upload) works almost all the time. When something goes wrong on upload servers (mostly -- blocks) that's just a matter of patience.

MtM · Post by **MtM** » Sat Jan 10, 2009 3:26 pm

What does a trust level have to do with queuing?

Btw, adding to the controversie, the 3rd party tools list contains a tool which can still cache wu's on windows machines..

I think queueing is bad, but there are some cases in which I will not speak out against it. Being that you're running on a connection which isn't going to allow you to dl, crunch, upload with prior knowledge you got a connection at the time the wu will be done. In that case, the deadlineless wu's have been discontinued I think ( and there never been any for hpc anyway ), and cycling through 3 or so slots when you know that's about the interval you got with your connection is something which is officially frowned upon but from the one person I know of who does this I can't say I'd rather have him donate cycles to another dc project.

I understand it's not the way it's ment to be, and I don't/won't actually recommend this.

alpha754293 · Post by **alpha754293** » Sun Jan 18, 2009 2:14 am

Can't there be WUs that do not have a deadline?

IF the WU server runs out of work to do, how do we know if and when it will have more work, especially for the beta clients?

7im · Post by **7im** » Mon Jan 19, 2009 7:18 am

alpha754293 wrote:Can't there be WUs that do not have a deadline?

IF the WU server runs out of work to do, how do we know if and when it will have more work, especially for the beta clients?

They tried WUs without deadlines for a while. It only works for a very small part of the folding science. But because each work unit is a part of a long chain of data, each generation of work unit builds on the next. The generation B work unit isn't even created until generation A is completed and returned to Stanford. What would happen if generation B sat on someone's computer for a long time because there was no deadline? It would hold up all of the future generations of that work. And that's why deadlines are not the issue here. The issue is getting each work unit back as fast as possible, so that the next one in line can be processed.

Besides, the deadlines on the CPU client are already VERY generous. A P3-500 only has to fold 8 hours a day to make the deadline. If you can't make that deadline with a P4 or newer, then you've got problems. So again, the deadline isn't the issue.

If the Server runs out of work, your client is either assigned to a different server and gets more work, or it gets assigned to the default server of 0.0.0.0 and the client sits idle waiting for new work units to be loaded on the server. When new WUs are available, the client will start folding again. Since the client starts up again on its own, knowing when there are more WUs isn't necessary. While it does happen, it is very rare. And even then, someone posts about it and Stanford fixes it quickly. There will always be more work units to process.

alpha754293 · Post by **alpha754293** » Mon Jan 19, 2009 7:35 am

7im wrote:
alpha754293 wrote:Can't there be WUs that do not have a deadline?

IF the WU server runs out of work to do, how do we know if and when it will have more work, especially for the beta clients?
They tried WUs without deadlines for a while. It only works for a very small part of the folding science. But because each work unit is a part of a long chain of data, each generation of work unit builds on the next. The generation B work unit isn't even created until generation A is completed and returned to Stanford. What would happen if generation B sat on someone's computer for a long time because there was no deadline? It would hold up all of the future generations of that work. And that's why deadlines are not the issue here. The issue is getting each work unit back as fast as possible, so that the next one in line can be processed.

Besides, the deadlines on the CPU client are already VERY generous. A P3-500 only has to fold 8 hours a day to make the deadline. If you can't make that deadline with a P4 or newer, then you've got problems. So again, the deadline isn't the issue.

If the Server runs out of work, your client is either assigned to a different server and gets more work, or it gets assigned to the default server of 0.0.0.0 and the client sits idle waiting for new work units to be loaded on the server. When new WUs are available, the client will start folding again. Since the client starts up again on its own, knowing when there are more WUs isn't necessary. While it does happen, it is very rare. And even then, someone posts about it and Stanford fixes it quickly. There will always be more work units to process.

when I was beta-ing the 5.91/5.92 Linux SMP client, I ran out of WU's for like 4 months. I know that there wasn't an issue with my network connection or anything like that because the non-SMP Windows clients were running without any problems.

As a result of that, I stopped folding for about a year.

Do they always go straight from A -> B or do they process all of the "A" data first before publishing "B" data to make it available for people to crunch?

I just think that it might be nice to "lease" the WU's from a server so that if you uninstall the program (and all of the programs, including the console clients should have an "uninstall" option), it should be able to send the WUs back.

I'm not sure how distributed.net does theirs although they do allow having your own proxy client and you can keep like a WU store/buffer while your system works on it.

I don't know what would be the best solution.

MtM · Post by **MtM** » Mon Jan 19, 2009 12:06 pm

He already explained the serial nature, accept it and move on

Leasing in combination with trying to queue wu's are not helpfull to the project. Distributed.net uses uniproccos clients which use the old approach of scientific research. The new hpc clients can use another approach based on a Markov Chain. That is why it is serial and the things you propose are not feasible.

7im · Post by **7im** » Tue Jan 20, 2009 9:01 pm

Additionally, the SMP servers DID NOT run out of work units for 4 months. The Stanford network has never been down more than a day or two at the most, in the entire history of the project.

If your client had stopped working, you could have come to this forum for help to figure out why the client could not connect to get new work. It is unfortunate that you lost a year of folding due to a simple misunderstanding.

alpha754293 · Post by **alpha754293** » Wed Jan 21, 2009 3:01 am

MtM wrote:He already explained the serial nature, accept it and move on

Leasing in combination with trying to queue wu's are not helpfull to the project. Distributed.net uses uniproccos clients which use the old approach of scientific research. The new hpc clients can use another approach based on a Markov Chain. That is why it is serial and the things you propose are not feasible.

Well...see...I don't believe in that 100% because you're running on a distributed computing platform. Therefore; if B depends on A, then there's no way that you can issue new "A" units until some of the earlier "A" units are done.

e.g. suppose you have 4 WUs, A1, A2, B1, B2

There are a number scenarios I can think of:
i) A2 depends on A1
ii) A2 is independent of A1
iii) B1 AND B2 depends on A1 AND A2
iv) B1 AND B2 depends on either A1 OR A2
v) B1 OR B2 depends on A1 AND A2
vi) B1 OR B2 depends on either A1 OR A2
etc.

the point is that if you need to process of all of the "A" units first, and "B" depends on "A", then you should be able to queue up "A" units locally. It is the entire premise of the distributed computation project. But if what you guys are saying is that A2 and A1 can't be computed separately, in that A2 depends on A1, then really A2 should be renamed as B1 in order to illustrate the parent/child relation, rather than an independent peer relationship.

And while I can understand it from a data security perspective, I think that saying "it can't be done" on account of parent/child relationships is a pathetically poor excuse for data management, especially when you are talking about a distributed computing platform of this size/type and magnitude.

Besides, if anybody actually has so much time to manufacture results in order to send crap back to the F@H servers, they have WAYYY too much time on their hands, which does the project no good, and either need a job, a life, or both.

Course if the WUs are being encrypted with RSA 2ki key, you'd pretty much need to know what the encryption key is, to which I say...good luck with that.

I DEFINITELY don't buy the "it's a parent/child dependency relationship" argument. You won't be able to generate new WUs anyways. So what are you going to do? Stop distributing WUs just to wait for that one? Yea. Sorry. Anybody with half a mind's wit ought to be able to figure out the ill-logic in that.

(I don't expect the policies to change, but come on...like really? Seriously? That's the best explanation they've got?)

Even then...crypto can be handled. (If they're runinng Linux/UNIX servers on the backbone, when you start up the system for the first time, it usually generates a RSA and DSA (I think) public key pair.) Key that, along with the WU itself (during the transmission) in conjunction with the User/MachineID.

Some of the biggest financial institutions piggy back off other smaller financial institutions and investment houses because of the excess computing capacity that they've got and they send data to each other all the time. Take a whiff from their pages and be "Clinique Happy".

(I wonder what excuse is going to come next, while I accept the fact that it still going to be allowed.)

7im wrote:Additionally, the SMP servers DID NOT run out of work units for 4 months. The Stanford network has never been down more than a day or two at the most, in the entire history of the project.

If your client had stopped working, you could have come to this forum for help to figure out why the client could not connect to get new work. It is unfortunate that you lost a year of folding due to a simple misunderstanding.

I didn't even know that this forum existed back then. MOST of the times, when the client runs out of work, or couldn't connect, I leave it go for a while and it will re-establish a connection.

Don't quote me on this, but I think that it was maybe like..around the April 2006 to August 2006-ish timeframe (might have been 2007, I'm quite iffy on the actual dates because it's been so long), but the Linux SMP 5.91b/5.92b (I think it was still 5.91b at that time actually), failed to reconnect to the server during that period. No matter how many times I started, stopped, rebooted, etc. to try and pick up the server, it just wouldn't do it. It kept saying that the work unit queue was empty and this was at a time when I was one of the early adopters of the SMP client because while most people were still talking about dual-core systems, I had a quad-socket system that I use, (and neither the GPU nor the PS3 clients were out yet), which gave me a significant point advantage.

I knew that it wasn't a network issue because my systems were fine and all of the uniprocessor clients were able to get WU updates just fine.

In any case, that's history.

(I don't know if there's even a way to check the history logs as to when I stopped folding, or when there was a sudden drop in my (PPD) output. But if there's a way to pull the records up, it should clearly show when the SMP work server ran out of work (BTW...I didn't say the servers went down, I said that they ran out of work), while the uniprocessor clients were still running, and when I stopped folding altogether.)

Post by **bruce** » Wed Jan 21, 2009 3:21 am

alpha754293 wrote:Well...see...I don't believe in that 100% because you're running on a distributed computing platform. Therefore; if B depends on A, then there's no way that you can issue new "A" units until some of the earlier "A" units are done.

Well, you seem to believe that you know more about FAH that the scientists that designed it -- but perhaps we can let you off the hook because of your "100%" statement.

Each project calculates a certain number of trajectories. They issue one WU for the first segment of time for that trajectory. Once those WUs are issued, no further progress can be accomplished until somebody returns a result. After the first segment of time has been processed, a WU for the next segment of time can be created. That means WUs are both serial and parallel. The serial sequence of time segments is a very significant issue when considering FAH's assignment methodology.

Suppose you download a WU from N different trajectories. Since you can only process a single WU at a time, that means that you are preventing the other (N-1) trajectories from progressing because nobody else can work on those trajectories while they are assigned to you. That added delay for (N-1) trajectories is "expensive" from a scientific standpoint.

FAH uses the resources that you provide with minimal waste. That means that they attempt to minimize the number of WUs that are assigned more than once (but they still have to reissue those which are lost). It also means that the number of WUs available is kept to a bare minimum, so when you hog more than one WU, somebody else may not be able to get something to work on.

You should always return any assignment you receive as promptly as is possible, given the limitations of your hardware.

alpha754293 · Post by **alpha754293** » Wed Jan 21, 2009 8:41 pm

bruce wrote:
alpha754293 wrote:Well...see...I don't believe in that 100% because you're running on a distributed computing platform. Therefore; if B depends on A, then there's no way that you can issue new "A" units until some of the earlier "A" units are done.
Well, you seem to believe that you know more about FAH that the scientists that designed it -- but perhaps we can let you off the hook because of your "100%" statement.

It is good to question the status quo, n'est pas?

bruce wrote:Each project calculates a certain number of trajectories. They issue one WU for the first segment of time for that trajectory. Once those WUs are issued, no further progress can be accomplished until somebody returns a result. After the first segment of time has been processed, a WU for the next segment of time can be created. That means WUs are both serial and parallel. The serial sequence of time segments is a very significant issue when considering FAH's assignment methodology.

Suppose you download a WU from N different trajectories. Since you can only process a single WU at a time, that means that you are preventing the other (N-1) trajectories from progressing because nobody else can work on those trajectories while they are assigned to you. That added delay for (N-1) trajectories is "expensive" from a scientific standpoint.

FAH uses the resources that you provide with minimal waste. That means that they attempt to minimize the number of WUs that are assigned more than once (but they still have to reissue those which are lost). It also means that the number of WUs available is kept to a bare minimum, so when you hog more than one WU, somebody else may not be able to get something to work on.

You should always return any assignment you receive as promptly as is possible, given the limitations of your hardware.

Actually, I have 3 systems right now that are folding (total of 5 clients running).

As I mentioned before, I do plan on going to a 16-core or 32-core system probably within the next year or so, at which point; I'd have to start-up upto 8 clients at the same time. So if I were to bank say 10 WUs, the current projection estimates would either mean that it would be crunching upto 8 WUs every 22-25 hours (on average), OR that it would be crunching 4 WUs every 11-12.5 hours.

Therefore; in the event of a network outage; that wouldn't stop my systems from folding. (I'm currently at 99% network uptime, averaging 4% packet loss).

I would think that once you start getting into like 64-cores and 128-cores folding, it'd be preferred if there would be one system that handles the data communications rather than trying to have the individual clients communicate with the F@H servers since the inbound/outbound bandwidth is much less than the LAN bandwidth.

It's a thought. And I'm just thinking/planning ahead.

Folding Forum

How can I keep WU's queued up?

How can I keep WU's queued up?

Re: How can I keep WU's queued up?

Re: How can I keep WU's queued up?

Re: How can I keep WU's queued up?

Re: How can I keep WU's queued up?

Re: How can I keep WU's queued up?

Re: How can I keep WU's queued up?

Re: How can I keep WU's queued up?

Re: How can I keep WU's queued up?

Re: How can I keep WU's queued up?

Re: How can I keep WU's queued up?

Re: How can I keep WU's queued up?

Re: How can I keep WU's queued up?

Re: How can I keep WU's queued up?

Re: How can I keep WU's queued up?