Testing domain decomposition for high CPU counts

Moderators: Site Moderators, FAHC Science Team

Re: Testing domain decomposition for high CPU counts

Postby _r2w_ben » Mon Apr 20, 2020 10:54 pm

Data for two 5x5x4 projects. Thread support is the same expect two data points: only p13851 can use 85 threads while only p16422 can use 78 threads.

p13851 - max 5x5x4 - PME load 0.19
Code: Select all
  2 = 2x1x1
  3 = 3x1x1
  4 = 4x1x1
  5 = 5x1x1
  6 = 3x2x1
  8 = 4x2x1
  9 = 3x3x1
 10 = 5x2x1
 12 = 4x3x1
 15 = 5x3x1
 16 = 4x4x1
 18 = 2x3x3
 20 = 4x4x1  16 +  4 PME
 21 = 4x4x1  16 +  5 PME
 24 = 3x2x3  18 +  6 PME
 25 = 5x4x1  20 +  5 PME
 27 = 2x3x3  18 +  9 PME
 28 = 5x4x1  20 +  8 PME
 30 = 3x4x2  24 +  6 PME
 32 = 3x4x2  24 +  8 PME
 35 = 5x5x1  25 + 10 PME
 36 = 3x3x3  27 +  9 PME
 40 = 4x4x2  32 +  8 PME
 42 = 4x4x2  32 + 10 PME
 44 = 3x4x3  36 +  8 PME
 45 = 4x3x3  36 +  9 PME
 48 = 4x3x3  36 + 12 PME
 50 = 4x5x2  40 + 10 PME
 52 = 5x4x2  40 + 12 PME
 54 = 3x3x4  36 + 18 PME
 55 = 3x5x3  45 + 10 PME
 56 = 5x4x2  40 + 16 PME
 60 = 4x3x4  48 + 12 PME
 64 = 4x4x3  48 + 16 PME
 65 = 5x5x2  50 + 15 PME
 75 = 5x3x4  60 + 15 PME
 80 = 4x4x4  64 + 16 PME
 85 = 5x4x3  60 + 25 PME
 95 = 5x5x3  75 + 20 PME
 98 = 4x5x4  80 + 18 PME
100 = 5x4x4  80 + 20 PME
125 = 5x5x4 100 + 25 PME


p16422 - max 5x5x4 - PME load 0.18
Code: Select all
  2 = 2x1x1
  3 = 3x1x1
  4 = 4x1x1
  5 = 5x1x1
  6 = 3x2x1
  8 = 4x2x1
  9 = 3x3x1
 10 = 5x2x1
 12 = 4x3x1
 15 = 5x3x1
 16 = 4x4x1
 18 = 3x3x2
 20 = 4x4x1  16 +  4 PME
 21 = 4x4x1  16 +  5 PME
 24 = 3x2x3  18 +  6 PME
 25 = 5x4x1  20 +  5 PME
 27 = 3x3x2  18 +  9 PME
 28 = 4x5x1  20 +  8 PME
 30 = 4x3x2  24 +  6 PME
 32 = 4x2x3  24 +  8 PME
 35 = 5x5x1  25 + 10 PME
 36 = 3x3x3  27 +  9 PME
 40 = 4x4x2  32 +  8 PME
 42 = 4x4x2  32 + 10 PME
 44 = 4x3x3  36 +  8 PME
 45 = 3x3x4  36 +  9 PME
 48 = 4x3x3  36 + 12 PME
 50 = 5x4x2  40 + 10 PME
 52 = 5x4x2  40 + 12 PME
 54 = 3x3x4  36 + 18 PME
 55 = 5x3x3  45 + 10 PME
 56 = 4x5x2  40 + 16 PME
 60 = 4x3x4  48 + 12 PME
 64 = 4x4x3  48 + 16 PME
 65 = 5x5x2  50 + 15 PME
 75 = 5x3x4  60 + 15 PME
 78 = 4x4x4  64 + 14 PME
 80 = 4x4x4  64 + 16 PME
 95 = 5x5x3  75 + 20 PME
 98 = 4x5x4  80 + 18 PME
100 = 5x4x4  80 + 20 PME
125 = 5x5x4 100 + 25 PME


All data for 1-128 threads. 1 indicates success while blank is failure. The previously unknown project has the same box and PME load as p14542 and has been relabelled.
Code: Select all
# Threads p14336   p14365   p16501   p14542  p13832  p14378  p13851  p16422  p16423  p14574  p14576
Box       18x18x18 13x13x13 11x11x10 6x6x5   6x6x5   5x5x5   5x5x4   5x5x4   4x4x4   4x4x3   4x4x3
PME       0.22     0.36     0.1      0.2     0.19    0.37    0.19    0.18    0.19    0.18    0.17
  1       1        1        1        1       1       1       1       1       1       1       1
  2       1        1        1        1       1       1       1       1       1       1       1
  3       1        1        1        1       1       1       1       1       1       1       1
  4       1        1        1        1       1       1       1       1       1       1       1
  5       1        1        1        1       1       1       1       1                       
  6       1        1        1        1       1       1       1       1       1       1       1
  7       1        1        1                                                                 
  8       1        1        1        1       1       1       1       1       1       1       1
  9       1        1        1        1       1       1       1       1       1       1       1
 10       1        1        1        1       1       1       1       1                       
 11       1        1        1                                                                 
 12       1        1        1        1       1       1       1       1       1       1       1
 13                                                                                           
 14                                                                                           
 15       1        1        1        1       1       1       1       1                       
 16       1        1        1        1       1       1       1       1       1       1       1
 17                                                                                           
 18       1        1        1        1       1       1       1       1       1       1       1
 19                                                                                           
 20       1        1        1        1       1       1       1       1       1       1       1
 21       1        1        1        1       1       1       1       1       1       1       1
 22                                                                                           
 23                                                                                           
 24       1        1        1        1       1       1       1       1       1       1       
 25       1        1        1        1       1       1       1       1                       
 26                                                                                           
 27       1        1        1        1       1       1       1       1       1       1       1
 28       1        1        1        1       1       1       1       1                       
 29                                                                                           
 30       1        1        1        1       1       1       1       1       1       1       
 31                                                                                           
 32       1        1        1        1       1       1       1       1       1       1       1
 33                                                                                           
 34                                                                                           
 35       1        1        1        1       1       1       1       1                       
 36       1        1        1        1       1       1       1       1       1       1       
 37                                                                                           
 38                                                                                           
 39                                                                                           
 40       1        1        1        1       1       1       1       1       1       1       1
 41                                                                                           
 42       1        1        1        1       1       1       1       1       1       1       1
 43                                                                                           
 44       1        1        1        1       1       1       1       1       1       1       1
 45       1        1        1        1       1       1       1       1       1       1       1
 46                                                                                           
 47                                                                                           
 48       1        1        1        1       1       1       1       1       1       1       
 49                                                                                           
 50       1        1        1        1       1       1       1       1                       
 51                                                                                           
 52       1        1        1        1       1       1       1       1                       
 53                                                                                           
 54       1        1        1        1       1       1       1       1       1       1       
 55       1        1        1        1       1       1       1       1                       
 56       1        1        1        1       1       1       1       1                       
 57                                                                                           
 58                                                                                           
 59                                                                                           
 60       1        1        1        1       1       1       1       1       1       1       
 61                                                                                           
 62                                                                                           
 63       1        1        1                        1                                       
 64       1        1        1        1       1       1       1       1       1       1       1
 65       1        1        1        1       1       1       1       1                       
 66       1        1        1        1       1       1                                       
 67                                                                                           
 68                                                                                           
 69                                                                                           
 70       1        1        1                        1                                       
 71                                                                                           
 72       1        1        1        1       1       1                                       
 73                                                                                           
 74                                                                                           
 75       1        1        1        1       1       1       1       1                       
 76                                                                                           
 77       1        1        1                                                                 
 78       1        1        1                1       1               1                       
 79                                                                                           
 80       1        1        1        1       1       1       1       1       1               
 81       1        1        1        1       1       1                                       
 82                                                                                           
 83                                                                                           
 84       1        1        1                                                                 
 85       1        1        1        1       1       1       1                               
 86                                                                                           
 87                                                                                           
 88       1        1        1                1                                               
 89                                                                                           
 90       1        1        1        1       1                                               
 91       1        1        1                                                                 
 92                                                                                           
 93                                                                                           
 94                                                                                           
 95       1        1        1        1       1       1       1       1                       
 96       1        1        1        1       1       1                                       
 97                                                                                           
 98       1        1        1                1               1       1                       
 99       1        1        1        1                                                       
100       1        1        1        1       1       1       1       1                       
101                                                                                           
102       1        1        1                                                                 
103                                                                                           
104       1        1        1                        1                                       
105       1        1        1                                                                 
106                                                                                           
107                                                                                           
108       1        1        1                        1                                       
109                                                                                           
110       1        1        1        1       1                                               
111                                                                                           
112       1        1        1                                                                 
113                                                                                           
114       1        1        1        1       1                                               
115       1        1        1        1       1                                               
116                                                                                           
117       1        1        1        1       1       1                                       
118                                                                                           
119       1        1        1                                                                 
120       1        1        1        1       1       1                                       
121                                                                                           
122                                                                                           
123                                                                                           
124                                                                                           
125       1        1        1        1       1       1       1       1                       
126       1        1        1                                                                 
127                                                                                           
128       1        1        1        1       1       1                                       
_r2w_ben
 
Posts: 278
Joined: Wed Apr 23, 2008 4:11 pm

Re: Testing domain decomposition for high CPU counts

Postby uyaem » Mon Apr 20, 2020 11:38 pm

Neil-B wrote:Currently for 7 cpu using a single 6 slot is probably better for the science than a 4 slot and a 3 slot .. for 11 cpu an 8 slot and a 3 slot would most likely be best for the science .. but you are free to make own choices :)

I switched from 1x12 + 1x8 to 1x21. Yes, one extra core, but +10% PPD and lower CPU usage under Windows. The scheduler plays a role too :)
Image
CPU: Ryzen 9 3900X (1x21 CPUs) ~ GPU: nVidia GeForce GTX 1660 Super (Asus)
uyaem
 
Posts: 222
Joined: Sat Mar 21, 2020 8:35 pm
Location: Esslingen, Germany

Re: Testing domain decomposition for high CPU counts

Postby _r2w_ben » Tue Apr 21, 2020 12:10 am

uyaem wrote:
Neil-B wrote:Currently for 7 cpu using a single 6 slot is probably better for the science than a 4 slot and a 3 slot .. for 11 cpu an 8 slot and a 3 slot would most likely be best for the science .. but you are free to make own choices :)

I switched from 1x12 + 1x8 to 1x21. Yes, one extra core, but +10% PPD and lower CPU usage under Windows. The scheduler plays a role too :)

Those are interesting observations. Thanks for sharing!

QRB would give part of the 10%. Having all cores working on the same task might help a bit with L1D/L2/L3 cache hits. Does the lower CPU usage translate into a slightly higher average clock speed?
_r2w_ben
 
Posts: 278
Joined: Wed Apr 23, 2008 4:11 pm

Re: Testing domain decomposition for high CPU counts

Postby Neil-B » Tue Apr 21, 2020 8:08 am

I'd be interested to see how the 21core slot works for stability over time being a multiple of seven ... traditional thinking would be to use an 18core ... maybe with a second 4core in your case
1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent, Quadro K420 1GB, FAH 7.6.13
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro, Quadro M1000M 2GB, FAH 7.6.13
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro, GTX 750Ti 2GB, FAH 7.6.13
Neil-B
 
Posts: 1339
Joined: Sun Mar 22, 2020 6:52 pm
Location: UK

Re: Testing domain decomposition for high CPU counts

Postby muziqaz » Tue Apr 21, 2020 10:40 am

Neil-B wrote:I'd be interested to see how the 21core slot works for stability over time being a multiple of seven ... traditional thinking would be to use an 18core ... maybe with a second 4core in your case


21 threads has been single most stable set up ;)
I had couple 20thread failures, a lot of 10 and 15 thread failures
User avatar
muziqaz
 
Posts: 684
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Re: Testing domain decomposition for high CPU counts

Postby muziqaz » Tue Apr 21, 2020 10:45 am

By the way, if you guys are doing these type of testings, to save yourselves some time, skip on everything bellow 24 threads, as that is being tested by researchers before it hits public. Everything above 24 threads is most welcome :)
User avatar
muziqaz
 
Posts: 684
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Re: Testing domain decomposition for high CPU counts

Postby Neil-B » Tue Apr 21, 2020 11:13 am

By "Conventional Wisdom" (which may well be less than wise) is that F@H has difficulty with large primes and their multiples when core counts are set, where 5 is sometimes large and where 7 and larger primes are "always" large … This however is a very "simplified approach" and doesn't mean any of the larger counts will always fail - just that they are more "at risk" … I have wondered if the number of factors available to a core count play into this - if a number has a large prime factor, but also has a number of lower non-large prime factors available if this makes of less likely to fail? … _r2w_ben is expanding the general knowledge about with his research on what actually happens in the Gromacs coding … even so the traditional core of "safe" core counts would be 1, 2, 3, 4, 6, 8, 9, 12, 16, 18, 20, 24, 27, 32 … with 5, 10, 15, 20, 25 as "possibles" … and the rest riskier.

I have seen a number of logs where the client has stepped 21cores down to 20 to get them to work - and also some where 20 has then also had issues … I know from past experience that there have been some projects that had high failure rates on 28 due to it being a multiple of 7 - hence the reason I am interested in your stability … Much does depend however on how each individual Project is set up - it appears there may have been a run of them recently that were sensitive to multiples of 5.

I am hoping that _r2w_ben's work might help future Cores/Projects be more stable/adaptable and help researchers choose configurations that will minimise these types of errors and maximise their returns.

… and @muziqas … I test on 24 and 32 in Beta … I am happy to change my slots to other settings (25+31, 26+30, 27+29, 2x28s) if that would help - even if it means aa few/some/most/all of my beta WUs fail.
Neil-B
 
Posts: 1339
Joined: Sun Mar 22, 2020 6:52 pm
Location: UK

Re: Testing domain decomposition for high CPU counts

Postby _r2w_ben » Tue Apr 21, 2020 11:40 am

Neil-B wrote:I'd be interested to see how the 21core slot works for stability over time being a multiple of seven ... traditional thinking would be to use an 18core ... maybe with a second 4core in your case

Since 21 is greater than 18, some threads are allocated to PME. Conventional thought was that this would result in 21 being used as 3x7x1, which would prevent many small work units with a maximum box size like 4x4x4 or 6x6x5 from running. With PME, 21 ends up being used as 18, 16, or 12 threads with the remainder used for PME. These breakdown nicely to multiples of 2, 3, and 4.
_r2w_ben
 
Posts: 278
Joined: Wed Apr 23, 2008 4:11 pm

Re: Testing domain decomposition for high CPU counts

Postby Neil-B » Tue Apr 21, 2020 11:42 am

That makes sense and also supports (in an odd way) the traditional wisdom as well :)
Neil-B
 
Posts: 1339
Joined: Sun Mar 22, 2020 6:52 pm
Location: UK

Re: Testing domain decomposition for high CPU counts

Postby muziqaz » Tue Apr 21, 2020 12:24 pm

Neil-B wrote:
… and @muziqas … I test on 24 and 32 in Beta … I am happy to change my slots to other settings (25+31, 26+30, 27+29, 2x28s) if that would help - even if it means aa few/some/most/all of my beta WUs fail.


Every current and future project is/will be tested up to 24 threads. If you want (and in no way you are obliged to do this), you can test from 25 to 32 threads. And you can always post your finding to project owners thread in beta forum, that way, project owner sets constraints to exclude failing thread counts.
Keep in mind that thread counts which do not fail, but automatically and gracefully lower the thread count to working count will not be excluded. For example:
Code: Select all
[b]p00000[/b]
SMP24 (-nt 24) - Folds
SMP23 (-nt 23) - Reduces to 22 threads, which reduces to 21 threads, which folds
SMP22 (-nt 22) - Reduces to 21 threads, which folds
SMP21 (-nt 21) - Folds
<...>
SMP15 (-nt 15) - Fails (There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.xxxxx nm)
<...>
SMP11 (-nt 11) - Reduces to 10, which fails
SMP10 (-nt 10) - Fails (There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.xxxxx nm)

Etc, etc.
So given similar report, project owner will exclude that project from being assigned to slots with SMP10, 11 and 15 before it goes Advanced and FAH

if anyone else wants to do similar reports with 33 and higher thread counts are also more than welcome to do so, but I understand these type of testings, especially with higher core counts are time consuming, and on top of that if your slot fails, you have a chance to download some other project, which kinda makes the whole testing thing very difficult for single project :) to avoid that it would be advisable to test thread counts which most likely will work, to confirm they work, and then when majority of working thread counts are confirmed working, then go on and start breaking WUs to confirm which thread counts break the project :D

Again, I repeat, these kind of tests are very time consuming and you in now way are obliged to do them :)
User avatar
muziqaz
 
Posts: 684
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Re: Testing domain decomposition for high CPU counts

Postby _r2w_ben » Wed Apr 22, 2020 1:04 am

Neil-B wrote:… and @muziqas … I test on 24 and 32 in Beta … I am happy to change my slots to other settings (25+31, 26+30, 27+29, 2x28s) if that would help - even if it means aa few/some/most/all of my beta WUs fail.

2x28 would catch 3 different problematic varieties: (the last 3 columns in the data for 1-128 threads)
4x4x4 PME 0.19
4x4x3 PME 0.18
4x4x3 PME 0.17

muziqaz wrote:if anyone else wants to do similar reports with 33 and higher thread counts are also more than welcome to do so, but I understand these type of testings, especially with higher core counts are time consuming, and on top of that if your slot fails, you have a chance to download some other project, which kinda makes the whole testing thing very difficult for single project :) to avoid that it would be advisable to test thread counts which most likely will work, to confirm they work, and then when majority of working thread counts are confirmed working, then go on and start breaking WUs to confirm which thread counts break the project :D

Again, I repeat, these kind of tests are very time consuming and you in now way are obliged to do them :)

The time factor was the motivation for this research. My modified GROMACS code takes about a minute to test 2-128 threads.
_r2w_ben
 
Posts: 278
Joined: Wed Apr 23, 2008 4:11 pm

Re: Testing domain decomposition for high CPU counts

Postby muziqaz » Wed Apr 22, 2020 1:10 am

_r2w_ben wrote:The time factor was the motivation for this research. My modified GROMACS code takes about a minute to test 2-128 threads.


That would free up a lot of time for researchers :)
User avatar
muziqaz
 
Posts: 684
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Re: Testing domain decomposition for high CPU counts

Postby Neil-B » Wed Apr 22, 2020 5:28 am

Given the issues some people seem to be having with 24cores at the moment on the Forums I'll probably leave my slots as they are for the moment … 28s have been a pain in the past for me - happy to not go there for the moment, however of the four pairings (25+31, 26+30, 27+29, 2x28s) which would you reckon has the best chance of proving slots useful were I to use them in beta testing? … I am leaning towards the idea that 26 + 30 ought to have the best chance of success - or more realistically 25+30 given 26 looks "rubbish" in your modelling.
Neil-B
 
Posts: 1339
Joined: Sun Mar 22, 2020 6:52 pm
Location: UK

Re: Testing domain decomposition for high CPU counts

Postby muziqaz » Wed Apr 22, 2020 12:10 pm

24 thread failures are purely due to some weird cloning issue, for projects which were directly cloned from existing projects. If the project is created in conventional means, it does not fail at 24 threads, nor it drops thread count from 24 to 20.
As far as I'm aware researchers are dialing down on cloning method, until they find out what is causing the issue, or if they really need to clone, they run clones through testing as well.
What I mean by cloning:
Researcher creates p11111, it is tested from 24 to 2 threads and does not fail at any of them. You have few reductions in thread count, but things keep folding.
Now for some scientific reasons other researcher asks for that p11111 (maybe for simulated trajectories), that he/she can follow up with their own ideas. And because they cannot use p11111 designation, they give it another project number (p11112), make few very minor tweaks, and release it to beta.
Since p11111 was tested and did not fail at none of the thread counts, it is assumed that its clone p11112 will not fail as well.
But then comes in imperfect cloning process (damn you Dolly) and somehow messes up decomposition.
I'm not a scientist, so all above is from my peasant's perspective ;)
User avatar
muziqaz
 
Posts: 684
Joined: Sun Dec 16, 2007 7:22 pm
Location: London

Re: Testing domain decomposition for high CPU counts

Postby _r2w_ben » Wed Apr 22, 2020 12:17 pm

Neil-B wrote:Given the issues some people seem to be having with 24cores at the moment on the Forums I'll probably leave my slots as they are for the moment … 28s have been a pain in the past for me - happy to not go there for the moment, however of the four pairings (25+31, 26+30, 27+29, 2x28s) which would you reckon has the best chance of proving slots useful were I to use them in beta testing? … I am leaning towards the idea that 26 + 30 ought to have the best chance of success - or more realistically 25+30 given 26 looks "rubbish" in your modelling.

Do you want to catch as many domain decomposition failures as possible? 2x28 would do that. 28 fails at the same point as 24 plus a couple more.

For maximum throughput, 24+32 would catch the odd failure and test as many work units as possible. Those other combinations are going to auto-reduce the core count to avoid primes and leave idle cores.
_r2w_ben
 
Posts: 278
Joined: Wed Apr 23, 2008 4:11 pm

PreviousNext

Return to CPU Projects - released FAHCores _a4 & _a7

Who is online

Users browsing this forum: No registered users and 2 guests

cron