Detected instruction sets incorrect?

A forum for discussing FAH-related hardware choices and info on actual products (not speculation).

Moderator: Site Moderators

Forum rules
Please read the forum rules before posting.
Post Reply
puuteknikko
Posts: 177
Joined: Thu Mar 19, 2020 6:20 am

Detected instruction sets incorrect?

Post by puuteknikko »

I happened to take a look at the files written by FAH and noticed this strange message in md.log

Code: Select all

Detecting CPU SIMD instructions.
Present hardware specification:
Vendor: AuthenticAMD
Brand:  AMD Ryzen 9 3900X 12-Core Processor            
Family: 23  Model: 113  Stepping:  0
Features: aes apic avx clfsh cmov cx8 cx16 f16c fma htt lahf_lm misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4a sse4.1 sse4.2 ssse3
SIMD instructions most likely to fit this hardware: AVX_128_FMA
SIMD instructions selected at GROMACS compile time: AVX_256


Binary not matching hardware - you might be losing performance.
SIMD instructions most likely to fit this hardware: AVX_128_FMA
SIMD instructions selected at GROMACS compile time: AVX_256
Is it really not detecting the new Zen's AVX unit properly? It is AVX-256..
Ryzen 3900X, 12c/24t @ 3.8GHz
JimboPalmer
Posts: 2573
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: Detected instruction sets incorrect?

Post by JimboPalmer »

This is what i I think I know.
All Zen chips can do avx_256
Zen 1 and Zen + did avx 128 bits at a time, so it took two instructions to do 256 worth of avx. Zen 2 does 256 bits at a time so takes half as long.

"Like the Zen/Zen+ microarchitecture, the Zen 2 floating point unit utilizes a coprocessor architectural model comprising a dedicated rename unit, a single 4-issue, out-of-order scheduler, a 160-entry physical register file (PRF), and four execution pipelines. The in-order retire queue is shared with the integer unit. The FPU handles x87, MMX, SSE, and AVX instructions. FP loads and stores co-opt the EX unit for address calculations and the LS unit for memory accesses.

In the Zen/Zen+ microarchitecture the floating point physical registers, execution units, and data paths are 128 bits wide. For efficiency AVX-256 instructions which perform the same operation on the 128-bit upper and lower half of a YMM register are decoded into two macro-ops which pass through the FPU individually as execution resources become available and retire together. Accordingly the peak throughput is four SSE/AVX-128 instructions or two AVX-256 instructions per cycle.

Zen 2 doubles the width of the physical registers, execution units, and data paths to 256 bits. The L1 data cache bandwidth was doubled to match. The number of micro-ops issued by the FP scheduler remains four, implying most AVX-256 instructions decode to a single macro-op which conserves queue entries and reduces pressure on RCU and scheduling resources. AMD did not disclose how the FPU was restructured. Die shots suggest two execution blocks splitting the PRF and FP ALUs, one operating on the lower 128 bits of a YMM register, executing x87, MMX, SSE, and AVX instructions, the other on the upper 128 bits for AVX-256 instructions. This improvement doubles the peak throughput of AVX-256 instructions to four per cycle, or in other words, up to 32 FLOPs/cycle in single precision or up to 16 FLOPs/cycle in double precision. Another improvement reduces the latency of double-precision vector multiplications from 4 to 3 cycles, equal to the latency of single-precision multiplications. The latency of fused multiply-add (FMA) instructions remains 5 cycles." - https://en.wikichip.org/wiki/amd/microa ... Point_Unit

What i do not know is if the version of GROMACS in current use can tell Zen 2 from Zen 1/+. You would get the same answers even if you couldn't, but it might not have the same optimization. I would guess the change log over at GROMACS.org would show when/if that optimization was added.
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
puuteknikko
Posts: 177
Joined: Thu Mar 19, 2020 6:20 am

Re: Detected instruction sets incorrect?

Post by puuteknikko »

http://manual.gromacs.org/documentation ... hlight=zen

Looks like the core is not using a recent enough source to cover Zen 2. There's a quite nice speed boost right there if you do..

EDIT: or hopefully it's the other way around -- that is just a warning and 256-bit instructions are used nevertheless.
Ryzen 3900X, 12c/24t @ 3.8GHz
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Detected instruction sets incorrect?

Post by PantherX »

I know that currently, FahCore_a7 has two "paths" to choose:
SSE
AVX_256

The question is if the FahCore_a7 is upgraded to support Zen 2, will it be an additional choice or not:
SSE (for CPUs without AVX support)
AVX_256 (for CPUs with AVX support)
AVX_128_FMA (For Zen 2)

Having multiple versions of FahCore_a7 to support might not be ideal if the gains aren't scientifically justifiable based on the resources available.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
JimboPalmer
Posts: 2573
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: Detected instruction sets incorrect?

Post by JimboPalmer »

I believe Core_a7 uses avx_256, on any Zen.

Changes in opcodes per instruction, and cycles per opcode may mean the code is not as fast as it might be if it were tunes specifically for Zen 2.
"Also the non-bonded kernel parameters have been tuned for Zen 2. This has a significant impact on performance."
So far as I know, Core_a7 is not this new, it dates back to 2017.
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: Detected instruction sets incorrect?

Post by _r2w_ben »

FAH uses GROMACS 5.0.4, which was released in 2014 and predates Zen. The message about AVX_128_FMA is relevant to Bulldozer/Piledriver that were available at the time. For those architectures, AVX_128_FMA > AVX_256 > SSE2.

With Zen 2 the message doesn't appear to be correct. The code probably checks if the CPU supports AVX_128_FMA and outputs the message because that was an accurate test at the time.

A researcher using GROMACS might compile the code on their computer and then run it on a cluster with a different CPU architecture. This message is a friendly reminder to choose compiler flags for optimum performance.
Joe_H
Site Admin
Posts: 7870
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Detected instruction sets incorrect?

Post by Joe_H »

Basically the version of Gromacs in use by the A7 core does not include the ability to use run time optimizations. So a separate core executable is needed for each different architecture. The decision was made to create one that uses SSE2 to support older CPUs that do not have AVX, and the second is a generic AVX_256 core to be used by newer processors that support that.

This may mean that on a particular system there is some loss of efficiency, but from tests done at the time it was in the range of a few percent. Gained was an easier to support distribution system with just two different folding cores to be kept in development and synchronized.

I haven't followed the later versions of Gromacs to see if they added the ability to include run time code selection back in.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Post Reply