3.7.1.4 : Compilation O3

Let's call maqao to analyse the hadamard_product function :


Here is the full output :
maqao.intel64 cqa fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_O3
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).

Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long) ========================================================================================

Code for this function has been compiled to run on any x86-64 processor (SSE2, 2004). It is not optimized for later processors (AVX etc.). These loops are supposed to be defined in: Examples/1-HadamardProduct/main.cpp

Section 1.1: Source loop ending at line 20 ==========================================

Composition and unrolling ------------------------- It is composed of the following loops [ID (first-last source line)]: - 0 (19-20) - 1 (20-20) and is unrolled by 4 (including vectorization).

The following loops are considered as: - unrolled and/or vectorized main: 1 - peel or tail: 0 The analysis will be displayed for the unrolled and/or vectorized loops: 1

Section 1.1.1: Binary (unrolled and/or vectorized) loop #1 ==========================================================

The loop is defined in Examples/1-HadamardProduct/main.cpp:20-20.

It is main loop of related source loop which is unrolled by 4 (including vectorization). 12% of peak computational performance is used (4.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Vectorization ------------- Your loop is vectorized, but using only 128 out of 256 bits (SSE/AVX-128 instructions on AVX/AVX2 processors). By fully vectorizing your loop, you can lower the cost of an iteration from 1.00 to 0.50 cycles (2.00x speedup). All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers). Since your execution units are vector units, only a fully vectorized loop can use their full power.

Workaround(s): - Recompile with march=skylake. CQA target is Core_7x_V2 (Intel Kaby Lake Core Processors) but specialization flags are -march=x86-64 - Use vector aligned instructions: 1) align your arrays on 32 bytes boundaries: replace void *p = malloc (size); with void *p; posix_memalign (&p, 32, size); . 2) inform your compiler that your arrays are vector aligned: if array 'foo' is 32 bytes-aligned, define a pointer 'p_foo' as __builtin_assume_aligned (foo, 32) and use it instead of 'foo' in the loop.

Execution units bottlenecks --------------------------- Performance is limited by: - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck) - reading data from caches/RAM (load units are a bottleneck) - writing data to caches/RAM (the store unit is a bottleneck)

Workaround(s): - Reduce the number of FP multiply/FMA instructions - Read less array elements - Write less array elements - Provide more information to your compiler: * hardcode the bounds of the corresponding 'for' loop



All innermost loops were analyzed.

Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports.


Let's rerun it with the conf=all option (this is mainly for experts but you know how to get this information) :


Here is the full output :
maqao.intel64 cqa conf=all  fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_O3
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).

Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long) ========================================================================================

Code for this function has been compiled to run on any x86-64 processor (SSE2, 2004). It is not optimized for later processors (AVX etc.). These loops are supposed to be defined in: Examples/1-HadamardProduct/main.cpp

Section 1.1: Source loop ending at line 20 ==========================================

Composition and unrolling ------------------------- It is composed of the following loops [ID (first-last source line)]: - 0 (19-20) - 1 (20-20) and is unrolled by 4 (including vectorization).

The following loops are considered as: - unrolled and/or vectorized main: 1 - peel or tail: 0 The analysis will be displayed for the unrolled and/or vectorized loops: 1

Section 1.1.1: Binary (unrolled and/or vectorized) loop #1 ==========================================================

The loop is defined in Examples/1-HadamardProduct/main.cpp:20-20.

It is main loop of related source loop which is unrolled by 4 (including vectorization). 12% of peak computational performance is used (4.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Vectorization ------------- Your loop is vectorized, but using only 128 out of 256 bits (SSE/AVX-128 instructions on AVX/AVX2 processors). By fully vectorizing your loop, you can lower the cost of an iteration from 1.00 to 0.50 cycles (2.00x speedup). All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers). Since your execution units are vector units, only a fully vectorized loop can use their full power.

Workaround(s): - Recompile with march=skylake. CQA target is Core_7x_V2 (Intel Kaby Lake Core Processors) but specialization flags are -march=x86-64 - Use vector aligned instructions: 1) align your arrays on 32 bytes boundaries: replace void *p = malloc (size); with void *p; posix_memalign (&p, 32, size); . 2) inform your compiler that your arrays are vector aligned: if array 'foo' is 32 bytes-aligned, define a pointer 'p_foo' as __builtin_assume_aligned (foo, 32) and use it instead of 'foo' in the loop.

Execution units bottlenecks --------------------------- Performance is limited by: - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck) - reading data from caches/RAM (load units are a bottleneck) - writing data to caches/RAM (the store unit is a bottleneck)

Workaround(s): - Reduce the number of FP multiply/FMA instructions - Read less array elements - Write less array elements - Provide more information to your compiler: * hardcode the bounds of the corresponding 'for' loop

Vector unaligned load/store instructions ---------------------------------------- Detected 2 suboptimal vector unaligned load/store instructions.

- MOVUPS: 2 occurrences

Workaround(s): - Recompile with march=skylake. CQA target is Core_7x_V2 (Intel Kaby Lake Core Processors) but specialization flags are -march=x86-64 - Use vector aligned instructions: 1) align your arrays on 32 bytes boundaries: replace void *p = malloc (size); with void *p; posix_memalign (&p, 32, size); . 2) inform your compiler that your arrays are vector aligned: if array 'foo' is 32 bytes-aligned, define a pointer 'p_foo' as __builtin_assume_aligned (foo, 32) and use it instead of 'foo' in the loop.

Type of elements and instruction set ------------------------------------ 1 SSE or AVX instructions are processing arithmetic or math operations on single precision FP elements in vector mode (four at a time).

Matching between your loop (in the source code) and the binary loop ------------------------------------------------------------------- The binary loop is composed of 4 FP arithmetical operations: - 4: multiply The binary loop is loading 32 bytes (8 single precision FP elements). The binary loop is storing 16 bytes (4 single precision FP elements).

Arithmetic intensity -------------------- Arithmetic intensity is 0.08 FP operations per loaded or stored byte.

ASM code -------- In the binary file, the address of the loop is: 1110

Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput ---------------------------------------------------------------------------------------------------------------------------------------- MOVUPS (%R11,%RAX,1),%XMM0 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 ADD $0x1,%R9 | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 MULPS (%RBX,%RAX,1),%XMM0 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 1 MOVUPS %XMM0,(%R8,%RAX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 1 | 1 ADD $0x10,%RAX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 CMP %RBP,%R9 | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 JB 1110 <_Z16hadamard_productPfPKfS1_m+0xd0> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50

General properties ------------------ nb instructions : 7 nb uops : 6 loop length : 27 used x86 registers : 6 used mmx registers : 0 used xmm registers : 1 used ymm registers : 0 used zmm registers : 0 nb stack references: 0

Front-end --------- ASSUMED MACRO FUSION FIT IN UOP CACHE micro-operation queue: 1.00 cycles front end : 1.00 cycles

Back-end -------- | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 -------------------------------------------------------------- uops | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 cycles | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00

Cycles executing div or sqrt instructions: NA Longest recurrence chain latency (RecMII): 1.00

Cycles summary -------------- Front-end : 1.00 Dispatch : 1.00 Data deps.: 1.00 Overall L1: 1.00

Vectorization ratios -------------------- all : 100% load : 100% store : 100% mul : 100% add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions)

Vector efficiency ratios ------------------------ all : 50% load : 50% store : 50% mul : 50% add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage --------------------------------- Assuming all data fit into the L1 cache, each iteration of the binary loop takes 1.00 cycles. At this rate: - 50% of peak load performance is reached (32.00 out of 64.00 bytes loaded per cycle (GB/s @ 1GHz)) - 50% of peak store performance is reached (16.00 out of 32.00 bytes stored per cycle (GB/s @ 1GHz))

Front-end bottlenecks --------------------- Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck).



All innermost loops were analyzed.