Classical intrinsic interleaved version

3.7.3.2 : Classical intrinsic interleaved version

Let's call maqao to analyse the hadamard_product function :

Here is the full output :

maqao.intel64 cqa fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_intrinsics_interleaved2 
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).


Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m
Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long)
========================================================================================


Code for this function has been specialized for Broadwell. For execution on another machine, recompile on it or with explicit target
(example for a Haswell machine: use -march=haswell, see compiler manual for full list).
These loops are supposed to be defined in: Examples/1-HadamardProduct/main_intrinsics_interleaved2.cpp


Section 1.1: Source loop ending at line 24
==========================================


Composition and unrolling
-------------------------
It is composed of the loop 0
and is not unrolled or unrolled with no peel/tail loop.


Section 1.1.1: Binary loop #0
=============================


The loop is defined in:
 - /usr/lib/gcc/x86_64-linux-gnu/7/include/avxintrin.h: 319-879
 - Examples/1-HadamardProduct/main_intrinsics_interleaved2.cpp: 24-24



The related source loop is not unrolled or unrolled with no peel/tail loop.
25% of peak computational performance is used (8.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))


Vectorization
-------------
Your loop is fully vectorized, using full register length.


All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).



Execution units bottlenecks
---------------------------
Performance is limited by:
 - reading data from caches/RAM (load units are a bottleneck)
 - writing data to caches/RAM (the store unit is a bottleneck)


Workaround(s):
 - Read less array elements
 - Write less array elements
 - Provide more information to your compiler:
  * hardcode the bounds of the corresponding 'for' loop





All innermost loops were analyzed.


Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports.

Let's rerun it with the conf=all option (this is mainly for experts but you know how to get this information) :

Here is the full output :

maqao.intel64 cqa conf=all  fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_intrinsics_interleaved2
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).


Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m
Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long)
========================================================================================


Code for this function has been specialized for Broadwell. For execution on another machine, recompile on it or with explicit target
(example for a Haswell machine: use -march=haswell, see compiler manual for full list).
These loops are supposed to be defined in: Examples/1-HadamardProduct/main_intrinsics_interleaved2.cpp


Section 1.1: Source loop ending at line 24
==========================================


Composition and unrolling
-------------------------
It is composed of the loop 0
and is not unrolled or unrolled with no peel/tail loop.


Section 1.1.1: Binary loop #0
=============================


The loop is defined in:
 - /usr/lib/gcc/x86_64-linux-gnu/7/include/avxintrin.h: 319-879
 - Examples/1-HadamardProduct/main_intrinsics_interleaved2.cpp: 24-24



The related source loop is not unrolled or unrolled with no peel/tail loop.
25% of peak computational performance is used (8.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))


Vectorization
-------------
Your loop is fully vectorized, using full register length.


All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).



Execution units bottlenecks
---------------------------
Performance is limited by:
 - reading data from caches/RAM (load units are a bottleneck)
 - writing data to caches/RAM (the store unit is a bottleneck)


Workaround(s):
 - Read less array elements
 - Write less array elements
 - Provide more information to your compiler:
  * hardcode the bounds of the corresponding 'for' loop



Type of elements and instruction set
------------------------------------
2 AVX instructions are processing arithmetic or math operations on single precision FP elements in vector mode (eight at a time).



Matching between your loop (in the source code) and the binary loop
-------------------------------------------------------------------
The binary loop is composed of 16 FP arithmetical operations:
 - 16: multiply
The binary loop is loading 128 bytes (32 single precision FP elements).
The binary loop is storing 64 bytes (16 single precision FP elements).


Arithmetic intensity
--------------------
Arithmetic intensity is 0.08 FP operations per loaded or stored byte.


Unroll opportunity
------------------
Loop is data access bound.
Workaround(s):
Unroll your loop if trip count is significantly higher than target unroll factor and if some data references are common to consecutive iterations. This can be done manually.
Or by recompiling with -funroll-loops and/or -floop-unroll-and-jam.


ASM code
--------
In the binary file, the address of the loop is: 1020


Instruction                                   | Nb FU | P0   | P1   | P2   | P3   | P4 | P5   | P6   | P7   | Latency | Recip. throughput
-----------------------------------------------------------------------------------------------------------------------------------------
VMOVAPS (%RSI,%RAX,1),%YMM0                   | 1     | 0    | 0    | 0.50 | 0.50 | 0  | 0    | 0    | 0    | 3       | 0.50
VMULPS (%RDX,%RAX,1),%YMM0,%YMM0              | 1     | 0.50 | 0.50 | 0.50 | 0.50 | 0  | 0    | 0    | 0    | 4       | 0.50
VMOVAPS %YMM0,(%RDI,%RAX,1)                   | 1     | 0    | 0    | 0.33 | 0.33 | 1  | 0    | 0    | 0.33 | 3       | 1
VMOVAPS 0x20(%RSI,%RAX,1),%YMM0               | 1     | 0    | 0    | 0.50 | 0.50 | 0  | 0    | 0    | 0    | 3       | 0.50
VMULPS 0x20(%RDX,%RAX,1),%YMM0,%YMM0          | 1     | 0.50 | 0.50 | 0.50 | 0.50 | 0  | 0    | 0    | 0    | 4       | 0.50
VMOVAPS %YMM0,0x20(%RDI,%RAX,1)               | 1     | 0    | 0    | 0.33 | 0.33 | 1  | 0    | 0    | 0.33 | 3       | 1
ADD $0x40,%RAX                                | 1     | 0.25 | 0.25 | 0    | 0    | 0  | 0.25 | 0.25 | 0    | 1       | 0.25
CMP %RAX,%RCX                                 | 1     | 0.25 | 0.25 | 0    | 0    | 0  | 0.25 | 0.25 | 0    | 1       | 0.25
JNE 1020 <_Z16hadamard_productPfPKfS1_m+0x10> | 1     | 0.50 | 0    | 0    | 0    | 0  | 0    | 0.50 | 0    | 0       | 0.50-1



General properties
------------------
nb instructions    : 9
nb uops            : 8
loop length        : 42
used x86 registers : 5
used mmx registers : 0
used xmm registers : 0
used ymm registers : 1
used zmm registers : 0
nb stack references: 0



Front-end
---------
ASSUMED MACRO FUSION
FIT IN UOP CACHE
micro-operation queue: 1.67 cycles
front end            : 1.67 cycles



Back-end
--------
       | P0   | P1   | P2   | P3   | P4   | P5   | P6   | P7
--------------------------------------------------------------
uops   | 1.00 | 1.00 | 2.00 | 2.00 | 2.00 | 1.00 | 1.00 | 2.00
cycles | 1.00 | 1.00 | 2.00 | 2.00 | 2.00 | 1.00 | 1.00 | 2.00


Cycles executing div or sqrt instructions: NA
Longest recurrence chain latency (RecMII): 1.00



Cycles summary
--------------
Front-end : 1.67
Dispatch  : 2.00
Data deps.: 1.00
Overall L1: 2.00



Vectorization ratios
--------------------
all    : 100%
load   : 100%
store  : 100%
mul    : 100%
add-sub: NA (no add-sub vectorizable/vectorized instructions)
other  : NA (no other vectorizable/vectorized instructions)



Vector efficiency ratios
------------------------
all    : 100%
load   : 100%
store  : 100%
mul    : 100%
add-sub: NA (no add-sub vectorizable/vectorized instructions)
other  : NA (no other vectorizable/vectorized instructions)



Cycles and memory resources usage
---------------------------------
Assuming all data fit into the L1 cache, each iteration of the binary loop takes 2.00 cycles. At this rate:
 - 100% of peak load performance is reached (64.00 out of 64.00 bytes loaded per cycle (GB/s @ 1GHz))
 - 100% of peak store performance is reached (32.00 out of 32.00 bytes stored per cycle (GB/s @ 1GHz))



Front-end bottlenecks
---------------------
Found no such bottlenecks.



All innermost loops were analyzed.