Project: 3062 (Run 2, Clone 93, Gen 8)

Moderators: Site Moderators, FAHC Science Team

Post Reply
314159
Posts: 232
Joined: Sun Dec 02, 2007 2:46 am
Location: http://www.teammacosx.org/

Project: 3062 (Run 2, Clone 93, Gen 8)

Post by 314159 »

While I must admit that the vast majority of SMP WUs assigned to my 20 SMP boxen complete normally, it CONTINUES to appall me that there appears to be absolutely NO error trapping in these.

Opinionated as I am :) I would also suggest that this be placed as a HIGH PRIORITY ISSUE by the Pande Group.
To me, it is inexcusable.

I might add that I have had the pleasure of running a "defective" WU three times with Comm error at step 80 plus (on another machine), received several subsequently that completed normally, and THEN WAS ASSIGNED THE IDENTICAL defective WU once again on that same machine.

At minimum, SOME form of communication should be originated in a 0Xwhatever case to at least indicate that I had not "dumped" an assigned WU.

Not a happy camper here. :(

(perhaps better described as frustrated and confused by the Project's apparent priorities, that at least to me do not properly address their #1 Asset - i.e. their VOLUNTEERS)

Code: Select all


[04:10:23] + Results successfully sent
[04:10:23] Thank you for your contribution to Folding@Home.
[04:14:30] 
[04:14:30] *------------------------------*
[04:14:30] Folding@Home Gromacs SMP Core
[04:14:30] Version 1.74 (November 27, 2006)
[04:14:30] 
[04:14:30] Preparing to commence simulation
[04:14:30] - Ensuring status. Please wait.
[04:14:47] - Assembly optimizations manually forced on.
[04:14:47] - Not checking prior termination.
[04:14:47] - Expanded 609362 -> 3263133 (decompressed 535.4 percent)
[04:14:47] - Starting from initial work packet
[04:14:47] 
[04:14:47] Project: 3062 (Run 2, Clone 93, Gen 8)
[04:14:47] 
[04:14:47] Assembly optimizations on if available.
[04:14:47] Entering M.D.
[04:14:53] Rejecting checkpoint
[04:14:53] Protein: p3062_lambda5_99sbExtra SSE boost OK.
[04:14:53] 
[04:14:53] Extra SSE boost OK.
[04:14:53] Writing local files
[04:14:53] Completed 0 out of 5000000 steps  (0 percent)
[04:25:01] Writing local files
[04:25:01] Completed 50000 out of 5000000 steps  (1 percent)
[20:43:22] Writing local files
[20:43:22] Completed 4900000 out of 5000000 steps  (98 percent)
[20:53:29] Writing local files
[20:53:29] Completed 4950000 out of 5000000 steps  (99 percent)
[20:53:50] CoreStatus = 1 (1)
[20:53:50] Client-core communications error: ERROR 0x1
[20:53:50] Deleting current work unit & continuing...
John (from the central part of the Commonwealth of Virginia, U.S.A.)

A friendly visitor to what hopefully will remain a friendly Forum.
With thanks to all of the dedicated volunteers on the staff here!!
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project 3026 (2,93,8)

Post by bruce »

Yes, when there's an error, the MPI services die and there's nothing to trap or to recover.

I don't suppose you've tried the error trapping in the v5.92 beta. Many things that fail in 5.91 do not fail in 5.92.

If you've got a copy of that WU, send it to me and I'll see what it does in 5.92 -- or you can try it yourself.
314159
Posts: 232
Joined: Sun Dec 02, 2007 2:46 am
Location: http://www.teammacosx.org/

Re: Project 3026 (2,93,8)

Post by 314159 »

Thank you for the comments Bruce.
Rumor has it that you share at least some of my "general" concerns. 8-)
Wow! that's unusual, if true! :)

You are aware that I am an "individual" folder and not one of our 800 lb. Gorilla Corporate friends.
Monitoring and administrating a modest (plus) and reasonably productive "farm", given my personal situation, is quite difficult.

The Pande Group has apparently elected not to cooperate in a manner, that many feel is mandatory, with respect to our talented and highly respected third party developers of monitoring software. Pande does not provide these tools in any usable form - the queue command in some clients is an anachronism, at least IMHO, which I might mentioned is shared by the majority of folders with more than a "few" dedicated machines with whom I maintain contact. No one uses it anyway. :ewink:

Without the application "InCrease" (Mac OSX), I would probably not be in a position to participate in the project at all. It runs on my primary server and monitors virtually everything (OSX, WIN, Mainly Linux, - oops, not the PS3's),

I ramble but the fact is:
I have elected to run the same clients on each class of WU for now, be they Win, Linux, or PS3.
(haven't tried GPU due to potential heat problem that has existed for some time (now keep ALL rooms at constant 78F MAX) - & would you believe that the PS3's are located on an attractive 6 foot chrome rack in our Master Bedroom due to the potential but non-existent heat issues?) :eugeek:

I have absolutely no interest in running a client other than a console version.
(Can you tell that I did not take the time to look at the version that you recommended? - I am using the current Linux Console.) :)
The simple fact is that with my programming experience, I consider the present SMP Beta to be an Alpha release, BUT under no circumstances should anyone take this comment wrong. I understand the reason for it and have no problems with the approach other than as indicated elsewhere. As mentioned earlier, it is relatively dependable - BUT incomplete.

I want to log a formal protest:
1. Error trapping - why not save a checkpoint (i.e. two vs. one) and have the client attempt continuation from the former point if the error is as you described?
Note that I have baby-sat MANY 0x? SMP's to completion on their second run simply by backing up, exiting client, restarting client (several times). 100% success to date. If that additional checkpoint had been available, I would bet that the stack error or whatever it was that caused the 0x? would NOT have occurred.This technique might be something that could be applied to all classes of WUs besides the SMPs (perhaps not the PS3's). Coding would be trivial.

2. At minimum, modify the client to report the 0x? back to our friends at Stanford AND, instead of forcing our folks to rerun the same WU once again, assign a new one. I do NOT (Duh) have access to any data that would put me in a position to opine on the overall effect of doing this, but intuitively, I think that it would result in a better % of WU returns. The only downside that I can see is that the particular R/C/G would be delayed until someone completed it. It would, however, have been reported back to Pande as a 0x?.

I may write to you privately about another issue that I consider absolutely critical but not yet appropriate for inclusion in a public Forum.

In ending, I fall on my sword. When this 0x? occurred at frame 99, I (ACCIDENTALLY) deleted my first SMP WU.
I hope that Dr. Pande will understand this given my advanced age, tenure with the project, and EXTREME frustration. :oops:
Last edited by 314159 on Wed Apr 02, 2008 8:00 am, edited 2 times in total.
John (from the central part of the Commonwealth of Virginia, U.S.A.)

A friendly visitor to what hopefully will remain a friendly Forum.
With thanks to all of the dedicated volunteers on the staff here!!
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Re: Project: 3062 (Run 2, Clone 93, Gen 8)

Post by Ivoshiee »

Generally these features are good to have. The 0x0 reporting will require major server side changes. Multiple checkpoints will require major policy shift - data integrity vs user experience.
314159
Posts: 232
Joined: Sun Dec 02, 2007 2:46 am
Location: http://www.teammacosx.org/

Re: Project: 3062 (Run 2, Clone 93, Gen 8)

Post by 314159 »

If Ivo sees this, (Hi!) and it is my personal opinion that the modifications on the server side would be mildly cumbersome. An experienced programmer might require a week to code and thoroughly test them.

On the client side. I contend that the changes would be substantially easier and solve many problems that we have all had to contend with for ages.
I have the Gromacs (not SMP) source here locally.
I stick with the concept that the overall science produced would be enhanced BUT of course cannot prove this without the proper dataset.
John (from the central part of the Commonwealth of Virginia, U.S.A.)

A friendly visitor to what hopefully will remain a friendly Forum.
With thanks to all of the dedicated volunteers on the staff here!!
314159
Posts: 232
Joined: Sun Dec 02, 2007 2:46 am
Location: http://www.teammacosx.org/

Re: Project: 3062 (Run 2, Clone 93, Gen 8)

Post by 314159 »

# SMP Client ##################################################################
###############################################################################

Folding@Home Client Version 6.01beta2

http://folding.stanford.edu

###############################################################################
###############################################################################

FYI, I run NO Windows SMP clients. That's probably why your 5.x reference confused me.

Linux rules! :D

John
John (from the central part of the Commonwealth of Virginia, U.S.A.)

A friendly visitor to what hopefully will remain a friendly Forum.
With thanks to all of the dedicated volunteers on the staff here!!
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Re: Project: 3062 (Run 2, Clone 93, Gen 8)

Post by Ivoshiee »

314159 wrote: If Ivo sees this, (Hi!) and it is my personal opinion that the modifications on the server side would be mildly cumbersome. An experienced programmer might require a week to code and thoroughly test them.

On the client side. I contend that the changes would be substantially easier and solve many problems that we have all had to contend with for ages.
I have the Gromacs (not SMP) source here locally.
I stick with the concept that the overall science produced would be enhanced BUT of course cannot prove this without the proper dataset.
The multiple checkpoints has been the topic for years, but I do not recall about reading why it is not in use. The same is with "0x1" reporting.

For me personally these issues are not that important at the moment, because as long as the computer is still usable to the users who need to use that computer, the FAH client can "EUE" or "0x1" as much as it will like (*).
(*) Only if it is software issue. If it is being caused by the hardware then I am pretty upset and try to fix the box, but the FAH client is solely in Pande Group domain. I'll report the issues, try to find out why it is doing what it is doing and try to propose ("obvious") fixes or workarounds.

The FAH development is somewhat different than you may be familiar with. There is no single FAH client entity what is being used by project scientists. In fact there are several branch of code base being hacked away who ever has a need it for his/her project at the moment. All coding is almost exclusively for the sole purpose - to get MY project going and finished as soon as possible. All the rest is not (that) important and may come (much) later if ever. Random donator probably does not see the project that way, but that is my experience. What to do then? You have to sell the idea you are promoting, to the public at large, to the mods and more importantly to the Pande Group. Try to engage direct connection to the Pande Group members, it may prove fruitful. Good luck. And after couple of years of pushing you'll have the feature you seek.
314159
Posts: 232
Joined: Sun Dec 02, 2007 2:46 am
Location: http://www.teammacosx.org/

Re: Project: 3062 (Run 2, Clone 93, Gen 8)

Post by 314159 »

Ivoshiee wrote:The multiple checkpoints has been the topic for years, but I do not recall about reading why it is not in use. The same is with "0x1" reporting
Ok my friend. Can you come up with just one logical reason why neither has been implemented?
Can you identify one reason why either would not be in the BEST interest of the project, and its #1 ASSET - the VOLUNTEERS?
You know how to code. I do also, but sold my business years ago. That was back in the assembler days. We NEVER had a customer complaint (or code bug) and sold our products mainly to MAJOR U.S. and Foreign Universities and Corporations. Wish that I could still code like that. Age and technological advances just caught up on me. :D
Ivoshiee wrote:For me personally these issues are not that important at the moment, because as long as the computer is still usable to the users who need to use that computer, the FAH client can "EUE" or "0x1" as much as it will like (*).
All but 7 of my machines are TOTALLY dedicated to folding. They do ABSOLUTELY nothing else.
Our perspectives may be different.
I fold mainly in rememberance of my father who was diagnosed with a nasty disease back in the G@H days, and passed away recently.
Ivoshiee wrote:(*) Only if it is software issue. If it is being caused by the hardware then I am pretty upset and try to fix the box, but the FAH client is solely in Pande Group domain. I'll report the issues, try to find out why it is doing what it is doing and try to propose ("obvious") fixes or workarounds.
The issues that I am trying to address are software issues as you know. As I understand it, many people running Quads or better are running multiple instances. To eliminate possible problems created by what is cobbled, (but pretty good)TRUE Beta Software, I run only one instance per Quad.

I purchased ALL of my equipment from personal funds, gave virtually everyone in my family computers for Christmas presents, birthdays, etc. - AND suspect that my emotions sometimes influence my posts when I see a potential alternative or better solution. This Forum is one that I refer to as a "love and kisses" Forum since issues involving the slightest bit of controversy do not appear, and if they do, they quickly disappear.

I am not certain that the Pande Group truly understands, or cares, what its "temporary" volunteers (in many cases) think, although I have the highest respect for Dr. Pande and the basic research.

In numerous communications over the years, I have heard the comment that Scientists are not necessarily well schooled in public relations. :D
Dr. Pande's news has been a major improvement and I applaud his effort.
Ivoshiee wrote:The FAH development is somewhat different than you may be familiar with. There is no single FAH client entity what is being used by project scientists.
Disagree here but only from my experience. There are certainly a number of clients - Win, OSX, Linux, flavors of same for SMP, etc. I have never received, and to my knowledge the project does not have the capability of sending me a "new" client to correspond to a new WU assignment. I run all WU types except GPU and WIN SMP in my little farm.

Cores are a separate issue BUT my excellent third party monitoring app displays both the Core version and gives me ready access to FAHlog.txt where download of a new core is obvious.

Educate me.
Ivoshiee wrote: In fact there are several branch of code base being hacked away who ever has a need it for his/her project at the moment. All coding is almost exclusively for the sole purpose - to get MY project going and finished as soon as possible. All the rest is not (that) important and may come (much) later if ever.
Agree to some extent but strictly related to the particular WU class. I assume that you are addressing the actual WU in your comment.
Ivoshiee wrote: Random donator probably does not see the project that way, but that is my experience. What to do then? You have to sell the idea you are promoting, to the public at large, to the mods and more importantly to the Pande Group. Try to engage direct connection to the Pande Group members, it may prove fruitful. Good luck. And after couple of years of pushing you'll have the feature you seek.
Your last sentence is the one that upsets me constantly. I have spent a LOT of time giving presentations to whomever will listen concerning joining the project. I hang around at the local Apple Store passing out cards and explaining the project. I do not recruit for my Team Association, but for the project as a whole.

In addition, prior to my illness, I had been VERY active in recruiting for my specific Team.

I had the pleasure of making many friends in the process. I was alarmed by the attrition rate, and when I asked, I was told several different stories which I need not post here. The major reason for leaving was, however, frustration with some uncontrollable aspect of the project - mainly 0x0's, being relegated to what they perceived as a second or third class folder due to PUBLIC STATEMENTS, obsolence of new computers in less that 1 1/2 years (for those folding for points). I am sure that you get the picture. To me, it is sad, but due to my reasons for participation as stated above, I will continue and do what I can to alert those "in power" to the situation.

Got some free time? Look at the attrition rates in the top 10 or 20 Teams. As a previous Senior Bank Executive, if we had lost that magnitude of customers, some heads would have rolled. We never did because we had a department exclusively dedicated to customer development and retention. That, IMHO, is lacking here.

I was so pleased to see your post in this thread Ivo. I have very high respect for you and your contributions. You also opened up an opportunity to address some issues separate from my post.

Other comments and opinions are totally welcomed.

If I am wrong, tell me so (nicely, please - I am an old guy) :)
As an old guy, be aware that I am a MAJOR proponent of the K.I.S.S. principle, a concept that occasionally escapes the Pande folks, at least IMHO.
John (from the central part of the Commonwealth of Virginia, U.S.A.)

A friendly visitor to what hopefully will remain a friendly Forum.
With thanks to all of the dedicated volunteers on the staff here!!
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: Project: 3062 (Run 2, Clone 93, Gen 8)

Post by VijayPande »

There are two key issues addressed here. One is related to a server code bug: we've been looking into a server bug with certain SMP WU's over the last week and pressing hard on it. I think we've made some progress yesterday and last night and we've been testing the new code this morning. I'll give an update when we know more. Peter has put a lot of time on this, but to speed the process, I've been working on it personally at night.

The other is what Bruce mentions regarding MPI and its limits on error reporting.

As for KISS, I'm a big fan of KISS as well. However, I'm also a big fan of doing cutting edge science (if we weren't interested in doing that, FAH wouldn't exist). We were pretty bleeding edge when we started FAH and we've been constantly pushing. A year ago, I set an internal stop on any new clients until we push out our existing issues (GPU2 and SMP). We've been making alternating pushes on the GPU and SMP, making progress on both albeit slowly (it's been tough). We'll continue to refine these clients before any new ones come out. GPU2 is looking like it's shaping up. SMP will (surprisingly in some ways) be much harder, mainly due to MPI libraries on Windows (although Peter's recent Deino builds are looking promissing).

I agree we can't always be pushing, both for donors (and the needed client stability) and just for my group's sanity -- we put lots and lots of time into clients, especially over the last year -- and it's very challenging work. I look forward to GPU2 looking like it's stabilizing and then hopefully nailing down SMP/Windows (SMP Linux and SMP OSX look much more robust even now).
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: Project 3026 (2,93,8)

Post by VijayPande »

PS Some more specific comments
314159 wrote:
I want to log a formal protest:
1. Error trapping - why not save a checkpoint (i.e. two vs. one) and have the client attempt continuation from the former point if the error is as you described?
Note that I have baby-sat MANY 0x? SMP's to completion on their second run simply by backing up, exiting client, restarting client (several times). 100% success to date. If that additional checkpoint had been available, I would bet that the stack error or whatever it was that caused the 0x? would NOT have occurred.This technique might be something that could be applied to all classes of WUs besides the SMPs (perhaps not the PS3's). Coding would be trivial.
This has come up in our dev discussions. However, getting SMP/Windows to run more stably has taken precedence here. We've been putting dev time into this, since there are so manhy potential SMP/Windows clients out there, but we can't tap them until SMP/Windows becomes more stable. If you feel like this issue is as significant in donor base and donor interest as a stable SMP/Windows client, let me know and we can consider rearranging priorities.
2. At minimum, modify the client to report the 0x? back to our friends at Stanford AND, instead of forcing our folks to rerun the same WU once again, assign a new one. I do NOT (Duh) have access to any data that would put me in a position to opine on the overall effect of doing this, but intuitively, I think that it would result in a better % of WU returns. The only downside that I can see is that the particular R/C/G would be delayed until someone completed it. It would, however, have been reported back to Pande as a 0x?.
While I see your argument here from the donor perspective, this turns out to be a very, very bad idea from the science-side. We used to do this and what would happen is that unstable machines would halt a project, by marking large factions of trajectories as having bad WU's. Even losing 20% can have a big impact on what we want to do at times. Perhaps we need to have been communication and education about this issue so donors see this point of view? We've thought about doing something more sophisticated, where we assess whether a given machine is a "trouble maker" and based on this stop resending a WU earlier. However, I've usually nix'd this on KISS concerns, but I'm curious to reopen this discussion.

Finally, we've completely revamped our server code, which got to be quite a mess over the years. It's written from scratch and should allow us to move more quickly on server code mods and bugs in the future. The code is still not quite done, but will likely go into testing in ~2 months.
I hope that Dr. Pande will understand this given my advanced age, tenure with the project, and EXTREME frustration. :oops:
Thanks for letting me know about these issues. I'm sorry for the frustration for you (and other) donors. My team is literally working day and night on FAH and I've even started to hire contract professional programmers to help (eg the new server code is a contract code), but all of this takes time. Hopefully with a limit on new clients, GPU2 looking promissing, and SMP/Win shaping up, we'll have more time to address these other serious concerns.
7im
Posts: 10189
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Project 3026 (2,93,8)

Post by 7im »

314159 wrote:...
The Pande Group has apparently elected not to cooperate in a manner, that many feel is mandatory...
IMO, you could help your case by asking these many people to express their "mandatory" concerns on this forum in addition to wherever they had expressed them to you.

And as you have seen from Vijay's comments, a lack of change in one area is not the same as electing to ignore those issues. Other areas have taken a higher priority, and not all improvements are readily apparent.

Helpful feedback or constructive criticism is always welcome.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Post Reply