3.7.2 : Vectorization compilation analysis with maqao

Let's call maqao to analyse the hadamard_product function :


Here is the full output :
maqao.intel64 cqa fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_vectorize
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).

Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long) ========================================================================================

Code for this function has been specialized for Broadwell. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list). These loops are supposed to be defined in: Examples/1-HadamardProduct/main_vectorize.cpp

Section 1.1: Source loop ending at line 24 ==========================================

Composition and unrolling ------------------------- It is composed of the loop 0 and is not unrolled or unrolled with no peel/tail loop.

Section 1.1.1: Binary loop #0 =============================

The loop is defined in Examples/1-HadamardProduct/main_vectorize.cpp:24-24.

The related source loop is not unrolled or unrolled with no peel/tail loop. 21% of peak computational performance is used (6.86 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Code clean check ---------------- Detected a slowdown caused by scalar integer instructions (typically used for address computation). By removing them, you can lower the cost of an iteration from 1.17 to 1.00 cycles (1.17x speedup). Workaround(s): - Try to reorganize arrays of structures to structures of arrays - Consider to permute loops (see vectorization gain report)

Vectorization ------------- Your loop is fully vectorized, using full register length.

All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).

Execution units bottlenecks --------------------------- Found no such bottlenecks but see expert reports for more complex bottlenecks.

All innermost loops were analyzed.

Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports.


Let's rerun it with the conf=all option (this is mainly for experts but you know how to get this information) :


Here is the full output :
maqao.intel64 cqa conf=all  fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_vectorize
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).

Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long) ========================================================================================

Code for this function has been specialized for Broadwell. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list). These loops are supposed to be defined in: Examples/1-HadamardProduct/main_vectorize.cpp

Section 1.1: Source loop ending at line 24 ==========================================

Composition and unrolling ------------------------- It is composed of the loop 0 and is not unrolled or unrolled with no peel/tail loop.

Section 1.1.1: Binary loop #0 =============================

The loop is defined in Examples/1-HadamardProduct/main_vectorize.cpp:24-24.

The related source loop is not unrolled or unrolled with no peel/tail loop. 21% of peak computational performance is used (6.86 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Code clean check ---------------- Detected a slowdown caused by scalar integer instructions (typically used for address computation). By removing them, you can lower the cost of an iteration from 1.17 to 1.00 cycles (1.17x speedup). Workaround(s): - Try to reorganize arrays of structures to structures of arrays - Consider to permute loops (see vectorization gain report)

Vectorization ------------- Your loop is fully vectorized, using full register length.

All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).

Execution units bottlenecks --------------------------- Found no such bottlenecks but see expert reports for more complex bottlenecks.

Type of elements and instruction set ------------------------------------ 1 AVX instructions are processing arithmetic or math operations on single precision FP elements in vector mode (eight at a time).

Matching between your loop (in the source code) and the binary loop ------------------------------------------------------------------- The binary loop is composed of 8 FP arithmetical operations: - 8: multiply The binary loop is loading 64 bytes (16 single precision FP elements). The binary loop is storing 32 bytes (8 single precision FP elements).

Arithmetic intensity -------------------- Arithmetic intensity is 0.08 FP operations per loaded or stored byte.

Unroll opportunity ------------------ Loop body is too small to efficiently use resources. Workaround(s): Unroll your loop if trip count is significantly higher than target unroll factor. This can be done manually. Or by recompiling with -funroll-loops and/or -floop-unroll-and-jam.

ASM code -------- In the binary file, the address of the loop is: 1038

Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput ---------------------------------------------------------------------------------------------------------------------------------------- VMOVAPS (%RSI,%RAX,1),%YMM0 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 ADD $0x1,%R8 | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 VMULPS (%RDX,%RAX,1),%YMM0,%YMM0 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVAPS %YMM0,(%RDI,%RAX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 ADD $0x20,%RAX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 CMP %R9,%R8 | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 JB 1038 <_Z16hadamard_productPfPKfS1_m+0x28> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50

General properties ------------------ nb instructions : 7 nb uops : 6 loop length : 28 used x86 registers : 6 used mmx registers : 0 used xmm registers : 0 used ymm registers : 1 used zmm registers : 0 nb stack references: 0

Front-end --------- ASSUMED MACRO FUSION FIT IN UOP CACHE micro-operation queue: 1.17 cycles front end : 1.17 cycles

Back-end -------- | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 -------------------------------------------------------------- uops | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 cycles | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00

Cycles executing div or sqrt instructions: NA Longest recurrence chain latency (RecMII): 1.00

Cycles summary -------------- Front-end : 1.17 Dispatch : 1.00 Data deps.: 1.00 Overall L1: 1.17

Vectorization ratios -------------------- all : 100% load : 100% store : 100% mul : 100% add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions)

Vector efficiency ratios ------------------------ all : 100% load : 100% store : 100% mul : 100% add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage --------------------------------- Assuming all data fit into the L1 cache, each iteration of the binary loop takes 1.17 cycles. At this rate: - 85% of peak load performance is reached (54.86 out of 64.00 bytes loaded per cycle (GB/s @ 1GHz)) - 85% of peak store performance is reached (27.43 out of 32.00 bytes stored per cycle (GB/s @ 1GHz))

Front-end bottlenecks --------------------- Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck).

By removing all these bottlenecks, you can lower the cost of an iteration from 1.17 to 1.00 cycles (1.17x speedup).



All innermost loops were analyzed.