Vectorization compilation analysis with maqao

3.7.2 : Vectorization compilation analysis with maqao

Let's call maqao to analyse the hadamard_product function :

Here is the full output :

maqao.intel64 cqa fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_vectorize
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).


Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m
Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long)
========================================================================================


Code for this function has been specialized for Broadwell. For execution on another machine, recompile on it or with explicit target
(example for a Haswell machine: use -march=haswell, see compiler manual for full list).
These loops are supposed to be defined in: Examples/1-HadamardProduct/main_vectorize.cpp


Section 1.1: Source loop ending at line 24
==========================================


Composition and unrolling
-------------------------
It is composed of the loop 0
and is not unrolled or unrolled with no peel/tail loop.


Section 1.1.1: Binary loop #0
=============================


The loop is defined in Examples/1-HadamardProduct/main_vectorize.cpp:24-24.


The related source loop is not unrolled or unrolled with no peel/tail loop.
21% of peak computational performance is used (6.86 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))


Code clean check
----------------
Detected a slowdown caused by scalar integer instructions (typically used for address computation).
By removing them, you can lower the cost of an iteration from 1.17 to 1.00 cycles (1.17x speedup).
Workaround(s):
 - Try to reorganize arrays of structures to structures of arrays
 - Consider to permute loops (see vectorization gain report)



Vectorization
-------------
Your loop is fully vectorized, using full register length.


All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).



Execution units bottlenecks
---------------------------
Found no such bottlenecks but see expert reports for more complex bottlenecks.



All innermost loops were analyzed.


Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports.

Let's rerun it with the conf=all option (this is mainly for experts but you know how to get this information) :

Here is the full output :

maqao.intel64 cqa conf=all  fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_vectorize
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).


Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m
Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long)
========================================================================================


Code for this function has been specialized for Broadwell. For execution on another machine, recompile on it or with explicit target
(example for a Haswell machine: use -march=haswell, see compiler manual for full list).
These loops are supposed to be defined in: Examples/1-HadamardProduct/main_vectorize.cpp


Section 1.1: Source loop ending at line 24
==========================================


Composition and unrolling
-------------------------
It is composed of the loop 0
and is not unrolled or unrolled with no peel/tail loop.


Section 1.1.1: Binary loop #0
=============================


The loop is defined in Examples/1-HadamardProduct/main_vectorize.cpp:24-24.


The related source loop is not unrolled or unrolled with no peel/tail loop.
21% of peak computational performance is used (6.86 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))


Code clean check
----------------
Detected a slowdown caused by scalar integer instructions (typically used for address computation).
By removing them, you can lower the cost of an iteration from 1.17 to 1.00 cycles (1.17x speedup).
Workaround(s):
 - Try to reorganize arrays of structures to structures of arrays
 - Consider to permute loops (see vectorization gain report)



Vectorization
-------------
Your loop is fully vectorized, using full register length.


All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).



Execution units bottlenecks
---------------------------
Found no such bottlenecks but see expert reports for more complex bottlenecks.


Type of elements and instruction set
------------------------------------
1 AVX instructions are processing arithmetic or math operations on single precision FP elements in vector mode (eight at a time).



Matching between your loop (in the source code) and the binary loop
-------------------------------------------------------------------
The binary loop is composed of 8 FP arithmetical operations:
 - 8: multiply
The binary loop is loading 64 bytes (16 single precision FP elements).
The binary loop is storing 32 bytes (8 single precision FP elements).


Arithmetic intensity
--------------------
Arithmetic intensity is 0.08 FP operations per loaded or stored byte.


Unroll opportunity
------------------
Loop body is too small to efficiently use resources.
Workaround(s):
Unroll your loop if trip count is significantly higher than target unroll factor. This can be done manually.
Or by recompiling with -funroll-loops and/or -floop-unroll-and-jam.


ASM code
--------
In the binary file, the address of the loop is: 1038


Instruction                                  | Nb FU | P0   | P1   | P2   | P3   | P4 | P5   | P6   | P7   | Latency | Recip. throughput
----------------------------------------------------------------------------------------------------------------------------------------
VMOVAPS (%RSI,%RAX,1),%YMM0                  | 1     | 0    | 0    | 0.50 | 0.50 | 0  | 0    | 0    | 0    | 3       | 0.50
ADD $0x1,%R8                                 | 1     | 0.25 | 0.25 | 0    | 0    | 0  | 0.25 | 0.25 | 0    | 1       | 0.25
VMULPS (%RDX,%RAX,1),%YMM0,%YMM0             | 1     | 0.50 | 0.50 | 0.50 | 0.50 | 0  | 0    | 0    | 0    | 4       | 0.50
VMOVAPS %YMM0,(%RDI,%RAX,1)                  | 1     | 0    | 0    | 0.33 | 0.33 | 1  | 0    | 0    | 0.33 | 3       | 1
ADD $0x20,%RAX                               | 1     | 0.25 | 0.25 | 0    | 0    | 0  | 0.25 | 0.25 | 0    | 1       | 0.25
CMP %R9,%R8                                  | 1     | 0.25 | 0.25 | 0    | 0    | 0  | 0.25 | 0.25 | 0    | 1       | 0.25
JB 1038 <_Z16hadamard_productPfPKfS1_m+0x28> | 1     | 0.50 | 0    | 0    | 0    | 0  | 0    | 0.50 | 0    | 0       | 0.50



General properties
------------------
nb instructions    : 7
nb uops            : 6
loop length        : 28
used x86 registers : 6
used mmx registers : 0
used xmm registers : 0
used ymm registers : 1
used zmm registers : 0
nb stack references: 0



Front-end
---------
ASSUMED MACRO FUSION
FIT IN UOP CACHE
micro-operation queue: 1.17 cycles
front end            : 1.17 cycles



Back-end
--------
       | P0   | P1   | P2   | P3   | P4   | P5   | P6   | P7
--------------------------------------------------------------
uops   | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00
cycles | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00


Cycles executing div or sqrt instructions: NA
Longest recurrence chain latency (RecMII): 1.00



Cycles summary
--------------
Front-end : 1.17
Dispatch  : 1.00
Data deps.: 1.00
Overall L1: 1.17



Vectorization ratios
--------------------
all    : 100%
load   : 100%
store  : 100%
mul    : 100%
add-sub: NA (no add-sub vectorizable/vectorized instructions)
other  : NA (no other vectorizable/vectorized instructions)



Vector efficiency ratios
------------------------
all    : 100%
load   : 100%
store  : 100%
mul    : 100%
add-sub: NA (no add-sub vectorizable/vectorized instructions)
other  : NA (no other vectorizable/vectorized instructions)



Cycles and memory resources usage
---------------------------------
Assuming all data fit into the L1 cache, each iteration of the binary loop takes 1.17 cycles. At this rate:
 - 85% of peak load performance is reached (54.86 out of 64.00 bytes loaded per cycle (GB/s @ 1GHz))
 - 85% of peak store performance is reached (27.43 out of 32.00 bytes stored per cycle (GB/s @ 1GHz))



Front-end bottlenecks
---------------------
Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck).


By removing all these bottlenecks, you can lower the cost of an iteration from 1.17 to 1.00 cycles (1.17x speedup).





All innermost loops were analyzed.