other than FAH work

Please confine these topics to things that would be of general interest to those who are interested in FAH which don't fall into any other category.

Moderator: Site Moderators

Post Reply
beer
Posts: 179
Joined: Tue Dec 13, 2011 11:18 am

other than FAH work

Post by beer »

as far as I know (pleace correct me if I am wrong) FAH is nearly only using floating operations of the CPU. I am wondering if someone has tried to use folding@home smp and another non-floating-operations-heavy program for science?
Napoleon
Posts: 887
Joined: Wed May 26, 2010 2:31 pm
Hardware configuration: Atom330 (overclocked):
Windows 7 Ultimate 64bit
Intel Atom330 dualcore (4 HyperThreads)
NVidia GT430, core_15 work
2x2GB Kingston KVR1333D3N9K2/4G 1333MHz memory kit
Asus AT3IONT-I Deluxe motherboard
Location: Finland

Re: other than FAH work

Post by Napoleon »

I haven't done so specifically for scientific apps, but I've been pleasantly surprised how well HyperThreading does on my 2C/4T Atom330, allowing me to run my occasional ALU-intensive workloads concurrently with CPU folding. Not quite as good as having four real CPU cores, no, but I can get quite close in some specific real life scenarios of mine. One recent example in viewtopic.php?f=66&t=20407&start=30#p203413 / scenario 7. CPU TPF increased roughly 13min => 17min, combined compression speed dropped about 800kBps => 600kBps, a far cry from 13min => 26min and 800kBps => 400kBps I would expect without HT doing its magic. :D

Then again, my scenario 7 was a nice symmetric workload, quite friendly for FAH and HT. Got to take another look at BOINC, for example, in case there is something similar available.
Win7 64bit, FAH v7, OC'd
2C/4T Atom330 3x667MHz - GT430 2x832.5MHz - ION iGPU 3x466.7MHz
NaCl - Core_15 - display
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: other than FAH work

Post by bruce »

beer wrote:as far as I know (pleace correct me if I am wrong) FAH is nearly only using floating operations of the CPU. I am wondering if someone has tried to use folding@home smp and another non-floating-operations-heavy program for science?
Napoleon has explained it very well. Here's some additional detail:

There are two limiting factors here. One is whether the FPU and the ALU can both be active at the same time and the other is whether the OS can manage your workload.

Hyperthreading (or bulldozer) gives the OS the capability of running two threads that have to share the same FPU. If both tasks need the FPU, they compete with each other and both slow down to about half speed. If only one needs the FPU and the other uses the ALU, both tasks can run at (almost) full speed.

Assume a Quad plus HT which gives your OS 8 threads to work with. A) Run SMP8 and there's nothing free to run anything else. Add 4 ALU tasks and they'll have to compete for OS resources, slowing things down. B) Run SMP4 plus 4 other tasks that use just the ALU and (if the OS assigns them in the optimum order) it's possible that all 8 will run at "normal" speed. In one case, HT gives you no extra performance; in the other case, you get twice the througput as you would without HT. That's why the advertisements for HT are very careful to use the words "depending on..." or "as much as..."

In fact, in an empty machine, SMP8 does give maybe 15% faster results than SMP4 but that's because no code uses ONLY the FPU. One FAH task uses maybe 90% of the FPU and uses the ALU the rest of the time. Another SMP thread can use the unused resources. My "about half speed" allows for that extra capacity.
Napoleon
Posts: 887
Joined: Wed May 26, 2010 2:31 pm
Hardware configuration: Atom330 (overclocked):
Windows 7 Ultimate 64bit
Intel Atom330 dualcore (4 HyperThreads)
NVidia GT430, core_15 work
2x2GB Kingston KVR1333D3N9K2/4G 1333MHz memory kit
Asus AT3IONT-I Deluxe motherboard
Location: Finland

Re: other than FAH work

Post by Napoleon »

Unless I'm mistaken, distributed.net OGR is ALU-only. This particular math puzzle seems to have some interesting practical applications:
OGR's have many applications including sensor placements for X-ray crystallography and radio astronomy. Golomb rulers can also play a significant role in combinatorics, coding theory and communications, and Dr. Golomb was one of the first to analyze them for use in these areas.
X-ray crystallography is used in protein studies. Who knows, OGR might even benefit FAH indirectly. According to Wikipedia, one of the coordinators of distributed.net in its current form is certain Adam L. Beberg. Hmm, why does the name Beberg sound vaguely familiar... :ewink:

For starters, I was surprised to see that running 4 OGR crunchers instead of just 2 gave me over 1.5x performance boost. Then again, ALUs are much simpler than FPUs and HyperThreads have some resources duplicated, so I presume the ALU side of my 2C/4T gets closer to a true quad than the FPU side. Consider AMD BullDozer, for example; ALU side is true octocore but it actually has only 4 FPUs... Without further ado, let's have FAH and OGR duke it out on my 2C/4T CPU.

OGR only:
  • 28 Mnodes / s == 36ms / Mnode (2 crunchers, 50% CPU)
  • 43 Mnodes / s == 23ms / Mnode (4 crunchers, 100% CPU)
2x P6892 uniprocessor CPU WUs only:
  • 27min TPF (50% CPU)
2x P6892 + 2x OGR :
  • 29min TPF + 20Mnodes / s == 50ms / Mnode (50% uni + 48% OGR, 100% total)
2x P6892, P5770 and P7630:
  • 29min TPF (50% uni + 0.5% GPU2 + 3.5% GPU3)
2x P6892, P5770 and P7630 + 2x OGR:
  • 34min TPF + 17 Mnodes /s == 59ms / Mnode (50% uni + 0.5% GPU2 + 3.5% GPU + 43% OGR, 100% total)
I use Process Lasso to tweak priorities and affinities. Here are some further details and observations:
  • OGR is Low priority and running on cores 0 and 3 along with GPU cores, ensuring access to both physical ALUs
  • Uniprocessor slots are Above Normal priority and running on cores 1 and 2, ensuring access to both physical FPUs as well as minimal preemption from normal processes
  • GPU2 slot is High priority and running on core 0, preempting just about everything on it
  • GPU3 slot is High priority and running on core 3, --- "" ----
  • GPU folding performance remains unaffected in all cases, no surprises there
  • GPU folding requires also some kernel time because it needs to access the GPU hardware through drivers
  • CPU kernel time produced by GPU folding seems to stick to cores 0 and 3 according to Task Manager graphs
HT is providing decent concurrency in the 2x P6892 + 2x OGR case. CPU is fully utilized and P6892 frame times increase only about (29-27) / 27 * 100% == 7.4%. Since FAH is my priority charity, it's nice to see that uniprocessor slots are going strong while OGR takes the bigger hit in milliseconds per Mnode, (50-36) / 36 * 100% == 39%, . I don't quite understand why uniprocessor frame times increase from 29min to 34min when I run everything concurrently. Maybe Task Manager doesn't show every little detail after all, and with 2x uni + 2x GPU + 2x OGR there shall certainly be frequent scheduling clashes on cores 0 and 3. OGR's 50 ==> 59 ms / Mnode increase is easily explained by the CPU overhead from GPU folding, though.

Conclusion: I'm going to stick with 2x uni + 2x GPU + 2x OGR. About (34-29) / 29 * 100% == 17% increase in uniprocessor TPF introduced by adding OGR to the mix isn't that bad. Uniprocessor WUs have long deadlines anyway, and for some reason I never get any A4 WUs, so it's not like I'm losing any QRBes either.
Last edited by Napoleon on Mon Jan 23, 2012 6:59 pm, edited 5 times in total.
Win7 64bit, FAH v7, OC'd
2C/4T Atom330 3x667MHz - GT430 2x832.5MHz - ION iGPU 3x466.7MHz
NaCl - Core_15 - display
gwildperson
Posts: 450
Joined: Tue Dec 04, 2007 8:36 pm

Re: other than FAH work

Post by gwildperson »

Napoleon wrote:Unless I'm mistaken, distributed.net OGR is ALU-only, and they even have a 64bit client available.
I don't understand why you mentioned 64bit. If we're still talking about HT and sharing ALU-only with FPU-mostly code, it shouldn't matter whether it's 32bit or 64bit.

Perhaps you're responding to the discussions about the absence of a 64bit v7 client for F@h. If so, perhaps someone needs to remind you that F@h has a 64bit SMP core which works with the 32bit client and covers those bigadv cases where large amounts of RAM are needed.

Perhaps you meant something else.
Napoleon
Posts: 887
Joined: Wed May 26, 2010 2:31 pm
Hardware configuration: Atom330 (overclocked):
Windows 7 Ultimate 64bit
Intel Atom330 dualcore (4 HyperThreads)
NVidia GT430, core_15 work
2x2GB Kingston KVR1333D3N9K2/4G 1333MHz memory kit
Asus AT3IONT-I Deluxe motherboard
Location: Finland

Re: other than FAH work

Post by Napoleon »

Okay, I edited the offending sentence and corrected a typo or two. What I meant is that distributed.net has both 32bit and 64bit client versions available. All the rest is strictly about "HT and sharing ALU-only with FPU-mostly code", as you put it. My choice of words would have been "HT and running ALU-only code concurrently with FPU-mostly code".

Gee, I had no idea that merely mentioning 64bit in a single subordinate clause could make all the other sentences in my post appear to be offtopic. Better now? Bear with me, english isn't my native language. :roll:
Win7 64bit, FAH v7, OC'd
2C/4T Atom330 3x667MHz - GT430 2x832.5MHz - ION iGPU 3x466.7MHz
NaCl - Core_15 - display
Amaruk
Posts: 254
Joined: Fri Jun 20, 2008 3:57 am
Location: Watching from the Woods

Re: other than FAH work

Post by Amaruk »

Napoleon wrote:...english isn't my native language...
Trust me, your english is much better than many 'native' speakers... :wink:
Image
Post Reply