Previous Classical compilation options analysis with maqao |
Parent Classical compilation options analysis with maqao |
Outline | Next Compilation O1 |
maqao.intel64 cqa fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_O0 Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).
Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long) ========================================================================================
Code for this function has been compiled to run on any x86-64 processor (SSE2, 2004). It is not optimized for later processors (AVX etc.). These loops are supposed to be defined in: Examples/1-HadamardProduct/main.cpp
Section 1.1: Source loop ending at line 20 ==========================================
Composition and unrolling ------------------------- It is composed of the loop 0 and is not unrolled or unrolled with no peel/tail loop.
Section 1.1.1: Binary loop #0 =============================
The loop is defined in Examples/1-HadamardProduct/main.cpp:19-20.
The related source loop is not unrolled or unrolled with no peel/tail loop. 0% of peak computational performance is used (0.18 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))
Code clean check ---------------- Detected a slowdown caused by scalar integer instructions (typically used for address computation). By removing them, you can lower the cost of an iteration from 5.50 to 5.00 cycles (1.10x speedup). Workaround(s): - Try to reorganize arrays of structures to structures of arrays - Consider to permute loops (see vectorization gain report)
Vectorization ------------- Your loop is not vectorized. Only 15% of vector register length is used (average across all SSE/AVX instructions). By vectorizing your loop, you can lower the cost of an iteration from 5.50 to 0.75 cycles (7.33x speedup). All SSE/AVX instructions are used in scalar version (process only one data element in vector registers). Since your execution units are vector units, only a vectorized loop can use their full power.
Workaround(s): - Try another compiler or update/tune your current one: * recompile with ftree-vectorize (included in O3) to enable loop vectorization and with fassociative-math (included in Ofast or ffast-math) to extend vectorization to FP reductions. - Remove inter-iterations dependences from your loop and make it unit-stride: * If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly * If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA)
Execution units bottlenecks --------------------------- Performance is limited by: - reading data from caches/RAM (load units are a bottleneck) - writing data to caches/RAM (the store unit is a bottleneck)
By removing all these bottlenecks, you can lower the cost of an iteration from 5.50 to 3.67 cycles (1.50x speedup).
Workaround(s): - Read less array elements - Write less array elements - Provide more information to your compiler: * hardcode the bounds of the corresponding 'for' loop
All innermost loops were analyzed.
Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports.
maqao.intel64 cqa conf=all fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_O0 Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).
Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long) ========================================================================================
Code for this function has been compiled to run on any x86-64 processor (SSE2, 2004). It is not optimized for later processors (AVX etc.). These loops are supposed to be defined in: Examples/1-HadamardProduct/main.cpp
Section 1.1: Source loop ending at line 20 ==========================================
Composition and unrolling ------------------------- It is composed of the loop 0 and is not unrolled or unrolled with no peel/tail loop.
Section 1.1.1: Binary loop #0 =============================
The loop is defined in Examples/1-HadamardProduct/main.cpp:19-20.
The related source loop is not unrolled or unrolled with no peel/tail loop. 0% of peak computational performance is used (0.18 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))
Code clean check ---------------- Detected a slowdown caused by scalar integer instructions (typically used for address computation). By removing them, you can lower the cost of an iteration from 5.50 to 5.00 cycles (1.10x speedup). Workaround(s): - Try to reorganize arrays of structures to structures of arrays - Consider to permute loops (see vectorization gain report)
Vectorization ------------- Your loop is not vectorized. Only 15% of vector register length is used (average across all SSE/AVX instructions). By vectorizing your loop, you can lower the cost of an iteration from 5.50 to 0.75 cycles (7.33x speedup). All SSE/AVX instructions are used in scalar version (process only one data element in vector registers). Since your execution units are vector units, only a vectorized loop can use their full power.
Workaround(s): - Try another compiler or update/tune your current one: * recompile with ftree-vectorize (included in O3) to enable loop vectorization and with fassociative-math (included in Ofast or ffast-math) to extend vectorization to FP reductions. - Remove inter-iterations dependences from your loop and make it unit-stride: * If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly * If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA)
Execution units bottlenecks --------------------------- Performance is limited by: - reading data from caches/RAM (load units are a bottleneck) - writing data to caches/RAM (the store unit is a bottleneck)
By removing all these bottlenecks, you can lower the cost of an iteration from 5.50 to 3.67 cycles (1.50x speedup).
Workaround(s): - Read less array elements - Write less array elements - Provide more information to your compiler: * hardcode the bounds of the corresponding 'for' loop
Complex instructions -------------------- Detected COMPLEX INSTRUCTIONS.
These instructions generate more than one micro-operation and only one of them can be decoded during a cycle and the extra micro-operations increase pressure on execution units. - ADD: 1 occurrences
Type of elements and instruction set ------------------------------------ 1 SSE or AVX instructions are processing arithmetic or math operations on single precision FP elements in scalar mode (one at a time).
Matching between your loop (in the source code) and the binary loop ------------------------------------------------------------------- The binary loop is composed of 1 FP arithmetical operations: - 1: multiply The binary loop is loading 80 bytes (20 single precision FP elements). The binary loop is storing 12 bytes (3 single precision FP elements).
Arithmetic intensity -------------------- Arithmetic intensity is 0.01 FP operations per loaded or stored byte.
Unroll opportunity ------------------ Loop is data access bound. Workaround(s): Unroll your loop if trip count is significantly higher than target unroll factor and if some data references are common to consecutive iterations. This can be done manually. Or by recompiling with -funroll-loops and/or -floop-unroll-and-jam.
ASM code -------- In the binary file, the address of the loop is: da6
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput ---------------------------------------------------------------------------------------------------------------------------------------- MOV -0x8(%RBP),%RAX | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 2 | 0.50 CMP -0x30(%RBP),%RAX | 1 | 0.25 | 0.25 | 0.50 | 0.50 | 0 | 0.25 | 0.25 | 0 | 1 | 0.50 JAE e00 <_Z16hadamard_productPfPKfS1_m+0x76> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50-1 MOV -0x8(%RBP),%RAX | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 2 | 0.50 LEA (,%RAX,4),%RDX | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 MOV -0x20(%RBP),%RAX | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 2 | 0.50 ADD %RDX,%RAX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 MOVSS (%RAX),%XMM1 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 MOV -0x8(%RBP),%RAX | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 2 | 0.50 LEA (,%RAX,4),%RDX | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 MOV -0x28(%RBP),%RAX | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 2 | 0.50 ADD %RDX,%RAX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 MOVSS (%RAX),%XMM0 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 MOV -0x8(%RBP),%RAX | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 2 | 0.50 LEA (,%RAX,4),%RDX | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 MOV -0x18(%RBP),%RAX | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 2 | 0.50 ADD %RDX,%RAX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 MULSS %XMM1,%XMM0 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 1 MOVSS %XMM0,(%RAX) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 ADDQ $0x1,-0x8(%RBP) | 2 | 0.50 | 0.50 | 0.83 | 0.83 | 1 | 0.50 | 0.50 | 0.33 | 5 | 1 JMP da6 <_Z16hadamard_productPfPKfS1_m+0x1c> | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1-2
General properties ------------------ nb instructions : 21 nb uops : 22 loop length : 90 used x86 registers : 3 used mmx registers : 0 used xmm registers : 2 used ymm registers : 0 used zmm registers : 0 nb stack references: 5
Front-end --------- MACRO FUSION NOT POSSIBLE FIT IN UOP CACHE micro-operation queue: 3.67 cycles front end : 3.67 cycles
Back-end -------- | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 -------------------------------------------------------------- uops | 3.00 | 3.00 | 5.50 | 5.50 | 2.00 | 3.00 | 3.00 | 2.00 cycles | 3.00 | 3.00 | 5.50 | 5.50 | 2.00 | 3.00 | 3.00 | 2.00
Cycles executing div or sqrt instructions: NA Longest recurrence chain latency (RecMII): 0.00
Cycles summary -------------- Front-end : 3.67 Dispatch : 5.50 Data deps.: 0.00 Overall L1: 5.50
Vectorization ratios -------------------- INT all : 0% load : 0% store : 0% mul : NA (no mul vectorizable/vectorized instructions) add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions) FP all : 0% load : 0% store : 0% mul : 0% add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions) INT+FP all : 0% load : 0% store : 0% mul : 0% add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions)
Vector efficiency ratios ------------------------ INT all : 25% load : 25% store : 25% mul : NA (no mul vectorizable/vectorized instructions) add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions) FP all : 12% load : 12% store : 12% mul : 12% add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions) INT+FP all : 15% load : 16% store : 18% mul : 12% add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions)
Cycles and memory resources usage --------------------------------- Assuming all data fit into the L1 cache, each iteration of the binary loop takes 5.50 cycles. At this rate: - 22% of peak load performance is reached (14.55 out of 64.00 bytes loaded per cycle (GB/s @ 1GHz)) - 6% of peak store performance is reached (2.18 out of 32.00 bytes stored per cycle (GB/s @ 1GHz))
Front-end bottlenecks --------------------- Found no such bottlenecks.
All innermost loops were analyzed.
Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m
Previous Classical compilation options analysis with maqao |
Parent Classical compilation options analysis with maqao |
Outline | Next Compilation O1 |