Previous Intrinsics compilation analysis with maqao |
Parent Intrinsics compilation analysis with maqao |
Outline | Next Classical intrinsic interleaved version |
maqao.intel64 cqa fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_intrinsics Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).
Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long) ========================================================================================
Code for this function has been specialized for Broadwell. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list). These loops are supposed to be defined in: Examples/1-HadamardProduct/main_intrinsics.cpp
Section 1.1: Source loop ending at line 24 ==========================================
Composition and unrolling ------------------------- It is composed of the loop 0 and is not unrolled or unrolled with no peel/tail loop.
Section 1.1.1: Binary loop #0 =============================
The loop is defined in: - /usr/lib/gcc/x86_64-linux-gnu/7/include/avxintrin.h: 319-879 - Examples/1-HadamardProduct/main_intrinsics.cpp: 24-24
The related source loop is not unrolled or unrolled with no peel/tail loop. 25% of peak computational performance is used (8.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))
Vectorization ------------- Your loop is fully vectorized, using full register length.
All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).
Execution units bottlenecks --------------------------- Performance is limited by: - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck) - reading data from caches/RAM (load units are a bottleneck) - writing data to caches/RAM (the store unit is a bottleneck)
Workaround(s): - Reduce the number of FP multiply/FMA instructions - Read less array elements - Write less array elements - Provide more information to your compiler: * hardcode the bounds of the corresponding 'for' loop
All innermost loops were analyzed.
Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports.
maqao.intel64 cqa conf=all fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_intrinsics Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).
Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long) ========================================================================================
Code for this function has been specialized for Broadwell. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list). These loops are supposed to be defined in: Examples/1-HadamardProduct/main_intrinsics.cpp
Section 1.1: Source loop ending at line 24 ==========================================
Composition and unrolling ------------------------- It is composed of the loop 0 and is not unrolled or unrolled with no peel/tail loop.
Section 1.1.1: Binary loop #0 =============================
The loop is defined in: - /usr/lib/gcc/x86_64-linux-gnu/7/include/avxintrin.h: 319-879 - Examples/1-HadamardProduct/main_intrinsics.cpp: 24-24
The related source loop is not unrolled or unrolled with no peel/tail loop. 25% of peak computational performance is used (8.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))
Vectorization ------------- Your loop is fully vectorized, using full register length.
All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).
Execution units bottlenecks --------------------------- Performance is limited by: - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck) - reading data from caches/RAM (load units are a bottleneck) - writing data to caches/RAM (the store unit is a bottleneck)
Workaround(s): - Reduce the number of FP multiply/FMA instructions - Read less array elements - Write less array elements - Provide more information to your compiler: * hardcode the bounds of the corresponding 'for' loop
Type of elements and instruction set ------------------------------------ 1 AVX instructions are processing arithmetic or math operations on single precision FP elements in vector mode (eight at a time).
Matching between your loop (in the source code) and the binary loop ------------------------------------------------------------------- The binary loop is composed of 8 FP arithmetical operations: - 8: multiply The binary loop is loading 64 bytes (16 single precision FP elements). The binary loop is storing 32 bytes (8 single precision FP elements).
Arithmetic intensity -------------------- Arithmetic intensity is 0.08 FP operations per loaded or stored byte.
Unroll opportunity ------------------ Loop body is too small to efficiently use resources. Workaround(s): Unroll your loop if trip count is significantly higher than target unroll factor. This can be done manually. Or by recompiling with -funroll-loops and/or -floop-unroll-and-jam.
ASM code -------- In the binary file, the address of the loop is: 1020
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput ----------------------------------------------------------------------------------------------------------------------------------------- VMOVAPS (%RSI,%RAX,1),%YMM0 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 VMULPS (%RDX,%RAX,1),%YMM0,%YMM0 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVAPS %YMM0,(%RDI,%RAX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 ADD $0x20,%RAX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 CMP %RAX,%RCX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 JNE 1020 <_Z16hadamard_productPfPKfS1_m+0x10> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50-1
General properties ------------------ nb instructions : 6 nb uops : 5 loop length : 24 used x86 registers : 5 used mmx registers : 0 used xmm registers : 0 used ymm registers : 1 used zmm registers : 0 nb stack references: 0
Front-end --------- ASSUMED MACRO FUSION FIT IN UOP CACHE micro-operation queue: 1.00 cycles front end : 1.00 cycles
Back-end -------- | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 -------------------------------------------------------------- uops | 1.00 | 0.75 | 1.00 | 1.00 | 1.00 | 0.75 | 0.50 | 1.00 cycles | 1.00 | 0.75 | 1.00 | 1.00 | 1.00 | 0.75 | 0.50 | 1.00
Cycles executing div or sqrt instructions: NA Longest recurrence chain latency (RecMII): 1.00
Cycles summary -------------- Front-end : 1.00 Dispatch : 1.00 Data deps.: 1.00 Overall L1: 1.00
Vectorization ratios -------------------- all : 100% load : 100% store : 100% mul : 100% add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions)
Vector efficiency ratios ------------------------ all : 100% load : 100% store : 100% mul : 100% add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions)
Cycles and memory resources usage --------------------------------- Assuming all data fit into the L1 cache, each iteration of the binary loop takes 1.00 cycles. At this rate: - 100% of peak load performance is reached (64.00 out of 64.00 bytes loaded per cycle (GB/s @ 1GHz)) - 100% of peak store performance is reached (32.00 out of 32.00 bytes stored per cycle (GB/s @ 1GHz))
Front-end bottlenecks --------------------- Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck).
By removing all these bottlenecks, you can lower the cost of an iteration from 1.00 to 0.75 cycles (1.33x speedup).
All innermost loops were analyzed.
Previous Intrinsics compilation analysis with maqao |
Parent Intrinsics compilation analysis with maqao |
Outline | Next Classical intrinsic interleaved version |