How can I keep WU's queued up?

Moderators: Site Moderators, FAHC Science Team

Post Reply
legoman666
Posts: 15
Joined: Sat Dec 22, 2007 6:26 pm
Hardware configuration: Gigabyte GA-X38-DS4
Q6600 @ 3.1ghz
1x HD3870 @ 840/1125
1x HD4850 @ Stock
4gb DDR2 1066
XP SP3
2x GPU2 + 1 SMP

GIGABYTE GA-MA69G-S3H
A64 X2 4000+ @ 2.45ghz
1x HD2400 @ Stock
1gb DDR2 1000
XP SP2
1 SMP

MSI K9N6SGM-V
A64 3000+ @ 1.6ghz
7300GS @ Stock
512mb DDR2 800
XP SP2
1 CPU

Lenovo R61
T7200 @ 2.0ghz
2gb DDR2 667
XP SP2
1 SMP

Lenovo ThinkCentre
Q6700 @ 2.67ghz
2gb DDR2 800
XP SP2
4x CPU

Toshiba Laptop
CD T2060 @ 1.6ghz
1gb DDR2 667
Vista
2x CPU

Toshiba Laptop
Pentium M 1.6ghz
512mb DDR 333
XP SP1?
1 CPU

How can I keep WU's queued up?

Post by legoman666 »

It seems that half the time, when one of my machines finishes a WU, it either cannot upload the completed WU or it cannot download a new WU. So it just sits there and wastes time. Is there a way to keep 1 or 2 WU's in a queue to avoid downtime? I looked through the advanced options but did not find anything. Any help would be appreciated.
Image
metal03326
Posts: 43
Joined: Fri May 30, 2008 12:36 pm
Hardware configuration: Mainboard: Asus M2R32-MVP (AMD 580X CrossFire Chipset)
CPU: AMD Athlon X2 3800+@2.4GHz
CPU cooler: Arctic Cooling Freezer 64 Pro PWM
RAM: 2x1 GB Team XTreme Dark 800MHz@5-5-5-16 in Dual Channel mode
Video Card: Sapphire HD 2600 XT@850MHz GDDR4@2200MHz
HDD: WESTERN DIGITAL 640GB Caviar® Black™ SATA2 7200rpm 32MB
OS: Microsoft Windows 7 Ultimate x64
Clients (always latest version): SMP2
-----
Dell Inspiron 1000 (Celeron@2.2GHz)
Location: Vratsa, Bulgaria.
Contact:

Re: How can I keep WU's queued up?

Post by metal03326 »

Unfortunately, there isn't such an option.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: How can I keep WU's queued up?

Post by P5-133XL »

no, Stanford does not allow people to queue-up WU's. As a practical matter, you could always have multiple clients collect WU's and then simply run one client at a time and then when a client stalls, then start up then next client (which already has a WU). It will require constant monitoring, excessive hand holding, and the rotating of the clients to prevent WU's expiring.

While that may work, I'm absolutely sure it is not the way Stanford wants you to run the clients because you will be likely to be missing deadlines on a regular basis: If you don't run into a stall, then the extra clients are not needed and those WU's are likely to expire if you are not rotating the clients or if you can't finish multiple WU's in the time needed for one. Missed deadlines are very costly in terms of the science. Even delayed finishing is bad/costly for the science because if you (and others) are holding WU's for future use, then the next generation of WU's can not be released until the held WU's are finished.
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: How can I keep WU's queued up?

Post by bruce »

P5-133XL wrote:no, Stanford does not allow people to queue-up WU's. As a practical matter, you could always have multiple clients collect WU's and then simply run one client at a time and then when a client stalls, then start up then next client (which already has a WU). It will require constant monitoring, excessive hand holding, and the rotating of the clients to prevent WU's expiring.

While that may work, I'm absolutely sure it is not the way Stanford wants you to run the clients because you will be likely to be missing deadlines on a regular basis: If you don't run into a stall, then the extra clients are not needed and those WU's are likely to expire if you are not rotating the clients or if you can't finish multiple WU's in the time needed for one. Missed deadlines are very costly in terms of the science. Even delayed finishing is bad/costly for the science because if you (and others) are holding WU's for future use, then the next generation of WU's can not be released until the held WU's are finished.
Moreover, when you have any WU that is not being worked on, you're delaying the project, even if you don't miss any deadlines. Nobody else can work on that trajectory while it's on your machine -- except after it expires.
legoman666 wrote:It seems that half the time, when one of my machines finishes a WU, it either cannot upload the completed WU or it cannot download a new WU. So it just sits there and wastes time.
This isn't strictly true.

If an upload fails, the client will move on to download a new WU. Once you get a new WU, your machine will be working on that WU and will not be wasting time, even though the previous WU is still waiting in the queue to upload.

Vijay has mentioned that the server code has been rewritten and that it should be possible to see a roll-out of this code soon. I'm confident that this new code will go a very long way toward solving the upload and download problems that we have been seeing.
legoman666
Posts: 15
Joined: Sat Dec 22, 2007 6:26 pm
Hardware configuration: Gigabyte GA-X38-DS4
Q6600 @ 3.1ghz
1x HD3870 @ 840/1125
1x HD4850 @ Stock
4gb DDR2 1066
XP SP3
2x GPU2 + 1 SMP

GIGABYTE GA-MA69G-S3H
A64 X2 4000+ @ 2.45ghz
1x HD2400 @ Stock
1gb DDR2 1000
XP SP2
1 SMP

MSI K9N6SGM-V
A64 3000+ @ 1.6ghz
7300GS @ Stock
512mb DDR2 800
XP SP2
1 CPU

Lenovo R61
T7200 @ 2.0ghz
2gb DDR2 667
XP SP2
1 SMP

Lenovo ThinkCentre
Q6700 @ 2.67ghz
2gb DDR2 800
XP SP2
4x CPU

Toshiba Laptop
CD T2060 @ 1.6ghz
1gb DDR2 667
Vista
2x CPU

Toshiba Laptop
Pentium M 1.6ghz
512mb DDR 333
XP SP1?
1 CPU

Re: How can I keep WU's queued up?

Post by legoman666 »

Ah, thanks for the clarification. Still, it'd be nice if this was possible. I'd mainly like this for the GPU2 work units, which have a dead line of 3 days and my 4850's can munch through in about 3 hours. I can see how it wouldn't be practical on big SMP units though.
Image
whynot
Posts: 91
Joined: Wed Mar 26, 2008 9:02 pm
Location: Kyiv, Ukraine

Re: How can I keep WU's queued up?

Post by whynot »

P5-133XL wrote:While that may work, I'm absolutely sure it is not the way Stanford wants you to run the clients because you will be likely to be missing deadlines on a regular basis: If you don't run into a stall, then the extra clients are not needed and those WU's are likely to expire if you are not rotating the clients or if you can't finish multiple WU's in the time needed for one. Missed deadlines are very costly in terms of the science. Even delayed finishing is bad/costly for the science because if you (and others) are holding WU's for future use, then the next generation of WU's can not be released until the held WU's are finished.
IMHO from cruncher POV that can be beaten with "let's invent a kind of reputation rating, then when cruncher (me actually) would reach some trust level she/he would be allowed to queue". But from F@H POV inventing new complexities adds nothing to the goal. Really, the usual cycle (downolad - crunch - upload) works almost all the time. When something goes wrong on upload servers (mostly -- blocks) that's just a matter of patience.
--
I'm counting for science.
Points just make me sick.
MtM
Posts: 1579
Joined: Fri Jun 27, 2008 2:20 pm
Hardware configuration: Q6600 - 8gb - p5q deluxe - gtx275 - hd4350 ( not folding ) win7 x64 - smp:4 - gpu slot
E6600 - 4gb - p5wdh deluxe - 9600gt - 9600gso - win7 x64 - smp:2 - 2 gpu slots
E2160 - 2gb - ?? - onboard gpu - win7 x32 - 2 uniprocessor slots
T5450 - 4gb - ?? - 8600M GT 512 ( DDR2 ) - win7 x64 - smp:2 - gpu slot
Location: The Netherlands
Contact:

Re: How can I keep WU's queued up?

Post by MtM »

What does a trust level have to do with queuing?

Btw, adding to the controversie, the 3rd party tools list contains a tool which can still cache wu's on windows machines..

I think queueing is bad, but there are some cases in which I will not speak out against it. Being that you're running on a connection which isn't going to allow you to dl, crunch, upload with prior knowledge you got a connection at the time the wu will be done. In that case, the deadlineless wu's have been discontinued I think ( and there never been any for hpc anyway ), and cycling through 3 or so slots when you know that's about the interval you got with your connection is something which is officially frowned upon but from the one person I know of who does this I can't say I'd rather have him donate cycles to another dc project.

I understand it's not the way it's ment to be, and I don't/won't actually recommend this.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: How can I keep WU's queued up?

Post by alpha754293 »

Can't there be WUs that do not have a deadline?

IF the WU server runs out of work to do, how do we know if and when it will have more work, especially for the beta clients?
7im
Posts: 10189
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: How can I keep WU's queued up?

Post by 7im »

alpha754293 wrote:Can't there be WUs that do not have a deadline?

IF the WU server runs out of work to do, how do we know if and when it will have more work, especially for the beta clients?
They tried WUs without deadlines for a while. It only works for a very small part of the folding science. But because each work unit is a part of a long chain of data, each generation of work unit builds on the next. The generation B work unit isn't even created until generation A is completed and returned to Stanford. What would happen if generation B sat on someone's computer for a long time because there was no deadline? It would hold up all of the future generations of that work. And that's why deadlines are not the issue here. The issue is getting each work unit back as fast as possible, so that the next one in line can be processed.

Besides, the deadlines on the CPU client are already VERY generous. A P3-500 only has to fold 8 hours a day to make the deadline. If you can't make that deadline with a P4 or newer, then you've got problems. So again, the deadline isn't the issue.

If the Server runs out of work, your client is either assigned to a different server and gets more work, or it gets assigned to the default server of 0.0.0.0 and the client sits idle waiting for new work units to be loaded on the server. When new WUs are available, the client will start folding again. Since the client starts up again on its own, knowing when there are more WUs isn't necessary. While it does happen, it is very rare. And even then, someone posts about it and Stanford fixes it quickly. There will always be more work units to process.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: How can I keep WU's queued up?

Post by alpha754293 »

7im wrote:
alpha754293 wrote:Can't there be WUs that do not have a deadline?

IF the WU server runs out of work to do, how do we know if and when it will have more work, especially for the beta clients?
They tried WUs without deadlines for a while. It only works for a very small part of the folding science. But because each work unit is a part of a long chain of data, each generation of work unit builds on the next. The generation B work unit isn't even created until generation A is completed and returned to Stanford. What would happen if generation B sat on someone's computer for a long time because there was no deadline? It would hold up all of the future generations of that work. And that's why deadlines are not the issue here. The issue is getting each work unit back as fast as possible, so that the next one in line can be processed.

Besides, the deadlines on the CPU client are already VERY generous. A P3-500 only has to fold 8 hours a day to make the deadline. If you can't make that deadline with a P4 or newer, then you've got problems. So again, the deadline isn't the issue.

If the Server runs out of work, your client is either assigned to a different server and gets more work, or it gets assigned to the default server of 0.0.0.0 and the client sits idle waiting for new work units to be loaded on the server. When new WUs are available, the client will start folding again. Since the client starts up again on its own, knowing when there are more WUs isn't necessary. While it does happen, it is very rare. And even then, someone posts about it and Stanford fixes it quickly. There will always be more work units to process.
when I was beta-ing the 5.91/5.92 Linux SMP client, I ran out of WU's for like 4 months. I know that there wasn't an issue with my network connection or anything like that because the non-SMP Windows clients were running without any problems.

As a result of that, I stopped folding for about a year.

Do they always go straight from A -> B or do they process all of the "A" data first before publishing "B" data to make it available for people to crunch?

I just think that it might be nice to "lease" the WU's from a server so that if you uninstall the program (and all of the programs, including the console clients should have an "uninstall" option), it should be able to send the WUs back.

I'm not sure how distributed.net does theirs although they do allow having your own proxy client and you can keep like a WU store/buffer while your system works on it.

I don't know what would be the best solution.
MtM
Posts: 1579
Joined: Fri Jun 27, 2008 2:20 pm
Hardware configuration: Q6600 - 8gb - p5q deluxe - gtx275 - hd4350 ( not folding ) win7 x64 - smp:4 - gpu slot
E6600 - 4gb - p5wdh deluxe - 9600gt - 9600gso - win7 x64 - smp:2 - 2 gpu slots
E2160 - 2gb - ?? - onboard gpu - win7 x32 - 2 uniprocessor slots
T5450 - 4gb - ?? - 8600M GT 512 ( DDR2 ) - win7 x64 - smp:2 - gpu slot
Location: The Netherlands
Contact:

Re: How can I keep WU's queued up?

Post by MtM »

He already explained the serial nature, accept it and move on ;)

Leasing in combination with trying to queue wu's are not helpfull to the project. Distributed.net uses uniproccos clients which use the old approach of scientific research. The new hpc clients can use another approach based on a Markov Chain. That is why it is serial and the things you propose are not feasible.
7im
Posts: 10189
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: How can I keep WU's queued up?

Post by 7im »

Additionally, the SMP servers DID NOT run out of work units for 4 months. The Stanford network has never been down more than a day or two at the most, in the entire history of the project.

If your client had stopped working, you could have come to this forum for help to figure out why the client could not connect to get new work. It is unfortunate that you lost a year of folding due to a simple misunderstanding.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: How can I keep WU's queued up?

Post by alpha754293 »

MtM wrote:He already explained the serial nature, accept it and move on ;)

Leasing in combination with trying to queue wu's are not helpfull to the project. Distributed.net uses uniproccos clients which use the old approach of scientific research. The new hpc clients can use another approach based on a Markov Chain. That is why it is serial and the things you propose are not feasible.
Well...see...I don't believe in that 100% because you're running on a distributed computing platform. Therefore; if B depends on A, then there's no way that you can issue new "A" units until some of the earlier "A" units are done.

e.g. suppose you have 4 WUs, A1, A2, B1, B2

There are a number scenarios I can think of:
i) A2 depends on A1
ii) A2 is independent of A1
iii) B1 AND B2 depends on A1 AND A2
iv) B1 AND B2 depends on either A1 OR A2
v) B1 OR B2 depends on A1 AND A2
vi) B1 OR B2 depends on either A1 OR A2
etc.

the point is that if you need to process of all of the "A" units first, and "B" depends on "A", then you should be able to queue up "A" units locally. It is the entire premise of the distributed computation project. But if what you guys are saying is that A2 and A1 can't be computed separately, in that A2 depends on A1, then really A2 should be renamed as B1 in order to illustrate the parent/child relation, rather than an independent peer relationship.

And while I can understand it from a data security perspective, I think that saying "it can't be done" on account of parent/child relationships is a pathetically poor excuse for data management, especially when you are talking about a distributed computing platform of this size/type and magnitude.

Besides, if anybody actually has so much time to manufacture results in order to send crap back to the F@H servers, they have WAYYY too much time on their hands, which does the project no good, and either need a job, a life, or both.

Course if the WUs are being encrypted with RSA 2ki key, you'd pretty much need to know what the encryption key is, to which I say...good luck with that.

I DEFINITELY don't buy the "it's a parent/child dependency relationship" argument. You won't be able to generate new WUs anyways. So what are you going to do? Stop distributing WUs just to wait for that one? Yea. Sorry. Anybody with half a mind's wit ought to be able to figure out the ill-logic in that.

(I don't expect the policies to change, but come on...like really? Seriously? That's the best explanation they've got?)

Even then...crypto can be handled. (If they're runinng Linux/UNIX servers on the backbone, when you start up the system for the first time, it usually generates a RSA and DSA (I think) public key pair.) Key that, along with the WU itself (during the transmission) in conjunction with the User/MachineID.

Some of the biggest financial institutions piggy back off other smaller financial institutions and investment houses because of the excess computing capacity that they've got and they send data to each other all the time. Take a whiff from their pages and be "Clinique Happy".

(I wonder what excuse is going to come next, while I accept the fact that it still going to be allowed.)
7im wrote:Additionally, the SMP servers DID NOT run out of work units for 4 months. The Stanford network has never been down more than a day or two at the most, in the entire history of the project.

If your client had stopped working, you could have come to this forum for help to figure out why the client could not connect to get new work. It is unfortunate that you lost a year of folding due to a simple misunderstanding.
I didn't even know that this forum existed back then. MOST of the times, when the client runs out of work, or couldn't connect, I leave it go for a while and it will re-establish a connection.

Don't quote me on this, but I think that it was maybe like..around the April 2006 to August 2006-ish timeframe (might have been 2007, I'm quite iffy on the actual dates because it's been so long), but the Linux SMP 5.91b/5.92b (I think it was still 5.91b at that time actually), failed to reconnect to the server during that period. No matter how many times I started, stopped, rebooted, etc. to try and pick up the server, it just wouldn't do it. It kept saying that the work unit queue was empty and this was at a time when I was one of the early adopters of the SMP client because while most people were still talking about dual-core systems, I had a quad-socket system that I use, (and neither the GPU nor the PS3 clients were out yet), which gave me a significant point advantage.

I knew that it wasn't a network issue because my systems were fine and all of the uniprocessor clients were able to get WU updates just fine.

In any case, that's history.

(I don't know if there's even a way to check the history logs as to when I stopped folding, or when there was a sudden drop in my (PPD) output. But if there's a way to pull the records up, it should clearly show when the SMP work server ran out of work (BTW...I didn't say the servers went down, I said that they ran out of work), while the uniprocessor clients were still running, and when I stopped folding altogether.)
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: How can I keep WU's queued up?

Post by bruce »

alpha754293 wrote:Well...see...I don't believe in that 100% because you're running on a distributed computing platform. Therefore; if B depends on A, then there's no way that you can issue new "A" units until some of the earlier "A" units are done.
Well, you seem to believe that you know more about FAH that the scientists that designed it -- but perhaps we can let you off the hook because of your "100%" statement.

Each project calculates a certain number of trajectories. They issue one WU for the first segment of time for that trajectory. Once those WUs are issued, no further progress can be accomplished until somebody returns a result. After the first segment of time has been processed, a WU for the next segment of time can be created. That means WUs are both serial and parallel. The serial sequence of time segments is a very significant issue when considering FAH's assignment methodology.

Suppose you download a WU from N different trajectories. Since you can only process a single WU at a time, that means that you are preventing the other (N-1) trajectories from progressing because nobody else can work on those trajectories while they are assigned to you. That added delay for (N-1) trajectories is "expensive" from a scientific standpoint.

FAH uses the resources that you provide with minimal waste. That means that they attempt to minimize the number of WUs that are assigned more than once (but they still have to reissue those which are lost). It also means that the number of WUs available is kept to a bare minimum, so when you hog more than one WU, somebody else may not be able to get something to work on.

You should always return any assignment you receive as promptly as is possible, given the limitations of your hardware.
alpha754293
Posts: 383
Joined: Sun Jan 18, 2009 1:13 am

Re: How can I keep WU's queued up?

Post by alpha754293 »

bruce wrote:
alpha754293 wrote:Well...see...I don't believe in that 100% because you're running on a distributed computing platform. Therefore; if B depends on A, then there's no way that you can issue new "A" units until some of the earlier "A" units are done.
Well, you seem to believe that you know more about FAH that the scientists that designed it -- but perhaps we can let you off the hook because of your "100%" statement.
It is good to question the status quo, n'est pas?
bruce wrote:Each project calculates a certain number of trajectories. They issue one WU for the first segment of time for that trajectory. Once those WUs are issued, no further progress can be accomplished until somebody returns a result. After the first segment of time has been processed, a WU for the next segment of time can be created. That means WUs are both serial and parallel. The serial sequence of time segments is a very significant issue when considering FAH's assignment methodology.

Suppose you download a WU from N different trajectories. Since you can only process a single WU at a time, that means that you are preventing the other (N-1) trajectories from progressing because nobody else can work on those trajectories while they are assigned to you. That added delay for (N-1) trajectories is "expensive" from a scientific standpoint.

FAH uses the resources that you provide with minimal waste. That means that they attempt to minimize the number of WUs that are assigned more than once (but they still have to reissue those which are lost). It also means that the number of WUs available is kept to a bare minimum, so when you hog more than one WU, somebody else may not be able to get something to work on.

You should always return any assignment you receive as promptly as is possible, given the limitations of your hardware.
Actually, I have 3 systems right now that are folding (total of 5 clients running).

As I mentioned before, I do plan on going to a 16-core or 32-core system probably within the next year or so, at which point; I'd have to start-up upto 8 clients at the same time. So if I were to bank say 10 WUs, the current projection estimates would either mean that it would be crunching upto 8 WUs every 22-25 hours (on average), OR that it would be crunching 4 WUs every 11-12.5 hours.

Therefore; in the event of a network outage; that wouldn't stop my systems from folding. (I'm currently at 99% network uptime, averaging 4% packet loss).

I would think that once you start getting into like 64-cores and 128-cores folding, it'd be preferred if there would be one system that handles the data communications rather than trying to have the individual clients communicate with the F@H servers since the inbound/outbound bandwidth is much less than the LAN bandwidth.

It's a thought. And I'm just thinking/planning ahead.
Post Reply