Previous Compilation Ofast |
Parent Code analysis with maqao |
Outline | Next Intrinsics compilation analysis with maqao |
maqao.intel64 cqa fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_vectorize Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).
Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long) ========================================================================================
Code for this function has been specialized for Broadwell. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list). These loops are supposed to be defined in: Examples/1-HadamardProduct/main_vectorize.cpp
Section 1.1: Source loop ending at line 24 ==========================================
Composition and unrolling ------------------------- It is composed of the loop 0 and is not unrolled or unrolled with no peel/tail loop.
Section 1.1.1: Binary loop #0 =============================
The loop is defined in Examples/1-HadamardProduct/main_vectorize.cpp:24-24.
The related source loop is not unrolled or unrolled with no peel/tail loop. 21% of peak computational performance is used (6.86 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))
Code clean check ---------------- Detected a slowdown caused by scalar integer instructions (typically used for address computation). By removing them, you can lower the cost of an iteration from 1.17 to 1.00 cycles (1.17x speedup). Workaround(s): - Try to reorganize arrays of structures to structures of arrays - Consider to permute loops (see vectorization gain report)
Vectorization ------------- Your loop is fully vectorized, using full register length.
All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).
Execution units bottlenecks --------------------------- Found no such bottlenecks but see expert reports for more complex bottlenecks.
All innermost loops were analyzed.
Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports.
maqao.intel64 cqa conf=all fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_vectorize Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).
Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long) ========================================================================================
Code for this function has been specialized for Broadwell. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list). These loops are supposed to be defined in: Examples/1-HadamardProduct/main_vectorize.cpp
Section 1.1: Source loop ending at line 24 ==========================================
Composition and unrolling ------------------------- It is composed of the loop 0 and is not unrolled or unrolled with no peel/tail loop.
Section 1.1.1: Binary loop #0 =============================
The loop is defined in Examples/1-HadamardProduct/main_vectorize.cpp:24-24.
The related source loop is not unrolled or unrolled with no peel/tail loop. 21% of peak computational performance is used (6.86 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))
Code clean check ---------------- Detected a slowdown caused by scalar integer instructions (typically used for address computation). By removing them, you can lower the cost of an iteration from 1.17 to 1.00 cycles (1.17x speedup). Workaround(s): - Try to reorganize arrays of structures to structures of arrays - Consider to permute loops (see vectorization gain report)
Vectorization ------------- Your loop is fully vectorized, using full register length.
All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).
Execution units bottlenecks --------------------------- Found no such bottlenecks but see expert reports for more complex bottlenecks.
Type of elements and instruction set ------------------------------------ 1 AVX instructions are processing arithmetic or math operations on single precision FP elements in vector mode (eight at a time).
Matching between your loop (in the source code) and the binary loop ------------------------------------------------------------------- The binary loop is composed of 8 FP arithmetical operations: - 8: multiply The binary loop is loading 64 bytes (16 single precision FP elements). The binary loop is storing 32 bytes (8 single precision FP elements).
Arithmetic intensity -------------------- Arithmetic intensity is 0.08 FP operations per loaded or stored byte.
Unroll opportunity ------------------ Loop body is too small to efficiently use resources. Workaround(s): Unroll your loop if trip count is significantly higher than target unroll factor. This can be done manually. Or by recompiling with -funroll-loops and/or -floop-unroll-and-jam.
ASM code -------- In the binary file, the address of the loop is: 1038
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput ---------------------------------------------------------------------------------------------------------------------------------------- VMOVAPS (%RSI,%RAX,1),%YMM0 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 ADD $0x1,%R8 | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 VMULPS (%RDX,%RAX,1),%YMM0,%YMM0 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVAPS %YMM0,(%RDI,%RAX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 ADD $0x20,%RAX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 CMP %R9,%R8 | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 JB 1038 <_Z16hadamard_productPfPKfS1_m+0x28> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50
General properties ------------------ nb instructions : 7 nb uops : 6 loop length : 28 used x86 registers : 6 used mmx registers : 0 used xmm registers : 0 used ymm registers : 1 used zmm registers : 0 nb stack references: 0
Front-end --------- ASSUMED MACRO FUSION FIT IN UOP CACHE micro-operation queue: 1.17 cycles front end : 1.17 cycles
Back-end -------- | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 -------------------------------------------------------------- uops | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 cycles | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00
Cycles executing div or sqrt instructions: NA Longest recurrence chain latency (RecMII): 1.00
Cycles summary -------------- Front-end : 1.17 Dispatch : 1.00 Data deps.: 1.00 Overall L1: 1.17
Vectorization ratios -------------------- all : 100% load : 100% store : 100% mul : 100% add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions)
Vector efficiency ratios ------------------------ all : 100% load : 100% store : 100% mul : 100% add-sub: NA (no add-sub vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions)
Cycles and memory resources usage --------------------------------- Assuming all data fit into the L1 cache, each iteration of the binary loop takes 1.17 cycles. At this rate: - 85% of peak load performance is reached (54.86 out of 64.00 bytes loaded per cycle (GB/s @ 1GHz)) - 85% of peak store performance is reached (27.43 out of 32.00 bytes stored per cycle (GB/s @ 1GHz))
Front-end bottlenecks --------------------- Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck).
By removing all these bottlenecks, you can lower the cost of an iteration from 1.17 to 1.00 cycles (1.17x speedup).
All innermost loops were analyzed.
Previous Compilation Ofast |
Parent Code analysis with maqao |
Outline | Next Intrinsics compilation analysis with maqao |