3.7.3.1 : Classical intrinsic version

Let's call maqao to analyse the hadamard_product function :


Here is the full output :
maqao.intel64 cqa fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_intrinsics
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).

Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long) ========================================================================================

Code for this function has been specialized for Broadwell. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list). These loops are supposed to be defined in: Examples/1-HadamardProduct/main_intrinsics.cpp

Section 1.1: Source loop ending at line 24 ==========================================

Composition and unrolling ------------------------- It is composed of the loop 0 and is not unrolled or unrolled with no peel/tail loop.

Section 1.1.1: Binary loop #0 =============================

The loop is defined in: - /usr/lib/gcc/x86_64-linux-gnu/7/include/avxintrin.h: 319-879 - Examples/1-HadamardProduct/main_intrinsics.cpp: 24-24

The related source loop is not unrolled or unrolled with no peel/tail loop. 25% of peak computational performance is used (8.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Vectorization ------------- Your loop is fully vectorized, using full register length.

All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).

Execution units bottlenecks --------------------------- Performance is limited by: - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck) - reading data from caches/RAM (load units are a bottleneck) - writing data to caches/RAM (the store unit is a bottleneck)

Workaround(s): - Reduce the number of FP multiply/FMA instructions - Read less array elements - Write less array elements - Provide more information to your compiler: * hardcode the bounds of the corresponding 'for' loop



All innermost loops were analyzed.

Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports.


Let's rerun it with the conf=all option (this is mainly for experts but you know how to get this information) :


Here is the full output :
maqao.intel64 cqa conf=all  fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_intrinsics
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).

Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long) ========================================================================================

Code for this function has been specialized for Broadwell. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list). These loops are supposed to be defined in: Examples/1-HadamardProduct/main_intrinsics.cpp

Section 1.1: Source loop ending at line 24 ==========================================

Composition and unrolling ------------------------- It is composed of the loop 0 and is not unrolled or unrolled with no peel/tail loop.

Section 1.1.1: Binary loop #0 =============================

The loop is defined in: - /usr/lib/gcc/x86_64-linux-gnu/7/include/avxintrin.h: 319-879 - Examples/1-HadamardProduct/main_intrinsics.cpp: 24-24

The related source loop is not unrolled or unrolled with no peel/tail loop. 25% of peak computational performance is used (8.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Vectorization ------------- Your loop is fully vectorized, using full register length.

All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).

Execution units bottlenecks --------------------------- Performance is limited by: - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck) - reading data from caches/RAM (load units are a bottleneck) - writing data to caches/RAM (the store unit is a bottleneck)

Workaround(s): - Reduce the number of FP multiply/FMA instructions - Read less array elements - Write less array elements - Provide more information to your compiler: * hardcode the bounds of the corresponding 'for' loop

Type of elements and instruction set ------------------------------------ 1 AVX instructions are processing arithmetic or math operations on single precision FP elements in vector mode (eight at a time).

Matching between your loop (in the source code) and the binary loop ------------------------------------------------------------------- The binary loop is composed of 8 FP arithmetical operations: - 8: multiply The binary loop is loading 64 bytes (16 single precision FP elements). The binary loop is storing 32 bytes (8 single precision FP elements).

Arithmetic intensity -------------------- Arithmetic intensity is 0.08 FP operations per loaded or stored byte.

Unroll opportunity ------------------ Loop body is too small to efficiently use resources. Workaround(s): Unroll your loop if trip count is significantly higher than target unroll factor. This can be done manually. Or by recompiling with -funroll-loops and/or -floop-unroll-and-jam.

ASM code -------- In the binary file, the address of the loop is: 1020

Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput ----------------------------------------------------------------------------------------------------------------------------------------- VMOVAPS (%RSI,%RAX,1),%YMM0 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 VMULPS (%RDX,%RAX,1),%YMM0,%YMM0 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVAPS %YMM0,(%RDI,%RAX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 ADD $0x20,%RAX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 CMP %RAX,%RCX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 JNE 1020 <_Z16hadamard_productPfPKfS1_m+0x10> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50-1

General properties ------------------ nb instructions : 6 nb uops : 5 loop length : 24 used x86 registers : 5 used mmx registers : 0 used xmm registers : 0 used ymm registers : 1 used zmm registers : 0 nb stack references: 0

Front-end --------- ASSUMED MACRO FUSION FIT IN UOP CACHE micro-operation queue: 1.00 cycles front end : 1.00 cycles

Back-end -------- | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 -------------------------------------------------------------- uops | 1.00 | 0.75 | 1.00 | 1.00 | 1.00 | 0.75 | 0.50 | 1.00 cycles | 1.00 | 0.75 | 1.00 | 1.00 | 1.00 | 0.75 | 0.50 | 1.00

Cycles executing div or sqrt instructions: NA Longest recurrence chain latency (RecMII): 1.00

Cycles summary -------------- Front-end : 1.00 Dispatch : 1.00 Data deps.: 1.00 Overall L1: 1.00

Vectorization ratios -------------------- all : 100% load : 100% store : 100% mul : 100% add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions)

Vector efficiency ratios ------------------------ all : 100% load : 100% store : 100% mul : 100% add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage --------------------------------- Assuming all data fit into the L1 cache, each iteration of the binary loop takes 1.00 cycles. At this rate: - 100% of peak load performance is reached (64.00 out of 64.00 bytes loaded per cycle (GB/s @ 1GHz)) - 100% of peak store performance is reached (32.00 out of 32.00 bytes stored per cycle (GB/s @ 1GHz))

Front-end bottlenecks --------------------- Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck).

By removing all these bottlenecks, you can lower the cost of an iteration from 1.00 to 0.75 cycles (1.33x speedup).



All innermost loops were analyzed.