Code analysis with maqao

Chapter 3.7 : Code analysis with maqao

Maqao is a program which analyses binaries and check if a given loop in a function is vectorized or not. If the number of load versus the number of write is efficient. Here is the maqao main page. You can download the binaries for Linux here.

Maqao can be used on the fly (such as Valgrind) but one of its powerfull funtionnalities allows us to analysis the binaries produced by the compiler without executing them. It is a static binary analysis performed by the cqa tool (Code Quality Analyser).

We will use one parameter :

fct-loops : specifies the name of the function to be analysed

The Maqao's cqa calls will be something like :

maqao.intel64 cqa fct-loops=MyFunction ./myBinary

The last parameter is always for the binary (it can be a library or a program).

Note : Maqao does not want to analyse a function if it has more than 8 branching. So you have to use it on kernel only (where the performance is needed).

Note : such as valgrind, Maqao will use the debugging symbols to give you feedback about the function and loops you want to analyse.

The report of Maqao will give several information :

Composition and unrolling : is the loop vectorized ? is there some peel or tail
Code clean check : is there some bottlenecks ? can be due to scalar instruction
Vectorization : is the code vectorized ? but in detail with number of read vs number of write
Workaround : tricks to speed up execution (is it possible to change the data structure or the compilation option ?)
Execution units bottlenecks : list of the bottlenecks of the code

Note : For each workaround, Maqao will try to evaluate the speed up of the change it suggests. But generally, this evaluation does not take account the CPU pipeline, so the real speed up can be different.

In the advanced mode, other informations are available :

Complex instructions : list of complex instructions
Arithmetic intensity : compute the arithmetic intensity of the computations
Unroll opportunity : is it possible to unroll the loop
Assembler instruction : detail of the binary implementation
Vectorization ratios : ratio of vectorized and not vectorized computation
Vector efficiency ratios : is the vectoisation efficient ? (with detail for each operator)
Cycles and memory resources usage : do you use well the L1 cache ?

Generally, the default report is fine, but the complete report is important for a ninja programmer. In the following we will take a look at both report but in our case the advanced report is really good to estimate what is going one and why this compilation is better than this one.

Sections Vector efficiency ratios and Cycles and memory resources usage will be particularly important for us to evaluate the efficiency of our functions.