Outline

2 : Basic use of CMake

2.1 : What is CMake ?

2.2 : Hello world with CMake

3 : Starting the project

4 : Several useful CMake functions

4.1 : The runExample function

4.2 : The runPythonExample function

4.3 : The plotPerf function

4.4 : Summary

4.5 : Functions to check Python environnement and build python module

4.5.1 : Check the environnement

4.5.2 : Make python module

4.5.3 : Summary

5 : Creation of a HPC/Timer library

5.1 : The rdtsc files

5.1.1 : The header (timer.h)

5.1.2 : The source (timer.cpp)

5.2 : The allocation/deallocation files

5.2.1 : The header (asterics_alloc.h)

5.2.2 : The source (asterics_alloc.cpp)

5.3 : The main header (asterics_hpc.h)

5.4 : The CMakeLists.txt

5.5 : The compilation

5.6 : The associated python module

5.6.1 : Wrapper of the timer

5.6.2 : Wrapper of the tables allocation

5.6.3 : Wrapper of the matrices allocation

5.6.4 : The wrapper module source : astericshpc.cpp

5.6.5 : The module configuration : setup.py

5.6.6 : The python install cmake script

5.6.7 : The CMakeLists.txt

6 : Optimisation of Hadamard product

6.1 : What is the Hadamard product ?

6.2 : Main to evaluate the Hadamard product

6.3 : The CMakeLists.txt file

6.4 : Get the performances

6.5 : The first performances

6.6 : How to vectorize the computation

6.6.1 : What is vectorization ?

6.6.2 : Automatic vectorization (by the compiler)

6.6.2.1 : Things to verify before vectorizing

6.6.2.2 : The full main_vectorize.cpp file

6.6.2.3 : The CMakeLists.txt file

6.6.2.4 : Compilation

6.6.2.5 : The performances with vectorization by the compiler

6.6.3 : Manual vectorization (by Intrinsic functions)

6.6.3.1 : Begining of the main_intrinsics.cpp file

6.6.3.2 : The hadamard_product function

6.6.3.3 : The function to evaluate performances

6.6.3.4 : The main function

6.6.3.5 : Full main_intrinsics.cpp file

6.6.3.6 : The CMakeLists.txt file

6.6.3.7 : Compilation

6.6.3.8 : The performances with Intrinsics

6.6.4 : Conclusion on vectorization

6.7 : How to create a hadamard python module

6.7.1 : The C++ kernel

6.7.2 : The wrapper function

6.7.3 : The C++ module file

6.7.4 : The setup.py file

6.7.5 : Peformances tests

6.7.5.1 : A naive implementation of the hadamard product

6.7.5.2 : Hadamard product with numpy functions

6.7.5.3 : Hadamard product with our intrinsics pitch implementation

6.7.5.4 : What hapened if I use python list instead of numpy array for naive implementation ?

6.7.6 : The CMakeLists.txt file

6.7.7 : Performances results

6.7.7.1 : Basic performances

6.7.7.2 : And the lists ?

6.7.7.3 : Summary

7 : Optimisation of saxpy

7.1 : What is a Saxpy ?

7.2 : The classical approach

7.2.1 : The main.cpp

7.2.2 : The CMakeLists.txt

7.2.3 : The compilation

7.2.4 : The performances

7.3 : The vectorization of Saxpy

7.3.1 : The main_vectorize.cpp

7.3.2 : The CMakeLists.txt

7.3.3 : The compilation

7.3.4 : The performances

7.4 : The intrinsics version of Saxpy

7.4.1 : The main_intrinsics.cpp

7.4.2 : The CMakeLists.txt

7.4.3 : The compilation

7.4.4 : The performances

7.5 : How to create a saxpy python module

7.5.1 : The C++ kernel

7.5.2 : The wrapper function

7.5.3 : The C++ module file

7.5.4 : The setup.py file

7.5.5 : Peformances tests

7.5.5.1 : A naive implementation of the saxpy

7.5.5.2 : Saxpy with numpy functions

7.5.5.3 : Saxpy with our intrinsics implementation

7.5.6 : The CMakeLists.txt file

7.5.7 : Performances results

7.5.7.1 : Basic performances

7.5.7.2 : Summary

8 : Optimisation of a reduction

8.1 : What is a reduction ?

8.2 : The classical approach

8.2.1 : The main.cpp

8.2.2 : The CMakeLists.txt

8.2.3 : The compilation

8.2.4 : The performances

8.2.5 : Solving the performance problem

8.2.5.1 : The reduction.h file

8.2.5.2 : The reduction.cpp file

8.2.5.3 : The main_reduction.cpp file

8.2.5.4 : The CMakeLists.txt file

8.2.5.5 : The compilation

8.2.5.6 : The performances

8.3 : The vectorization of reduction

8.3.1 : The reduction_vectorize.h

8.3.2 : The reduction_vectorize.cpp

8.3.3 : The main_vectorize.cpp

8.3.4 : The CMakeLists.txt

8.3.4.1 : The compilation

8.3.4.2 : The performances

8.4 : The vectorization of reduction with intrinsic functions

8.4.1 : The reduction_intrinsics.h file

8.4.2 : The reduction_intrinsics.cpp file

8.4.3 : The main_intrinsics.cpp

8.4.4 : The CMakeLists.txt file

8.4.5 : The compilation

8.4.6 : The performances

8.5 : How to optimize more

8.5.1 : Interleaving 2 times

8.5.1.1 : The reduction_intrinsics_interleave2.h file

8.5.1.2 : The reduction_intrinsics_interleave2.cpp file

8.5.1.3 : The main_intrinsics_interleave2.cpp file

8.5.1.4 : The CMakeLists.txt file

8.5.1.5 : The compilation

8.5.1.6 : The performances

8.5.2 : Interleaving 4 times

8.5.2.1 : The reduction_intrinsics_interleave4.h file

8.5.2.2 : The reduction_intrinsics_interleave4.cpp file

8.5.2.3 : The main_intrinsics_interleave4.cpp file

8.5.2.4 : The CMakeLists.txt file

8.5.2.5 : The compilation

8.5.2.6 : The performances

8.5.3 : Interleaving 8 times

8.5.3.1 : The reduction_intrinsics_interleave8.h file

8.5.3.2 : The reduction_intrinsics_interleave8.cpp file

8.5.3.3 : The main_intrinsics_interleave8.cpp file

8.5.3.4 : The CMakeLists.txt file

8.5.3.5 : The compilation

8.5.3.6 : The performances

8.5.4 : Summary

8.6 : How to create a reduction python module

8.6.1 : The wrapper function

8.6.2 : The C++ module file

8.6.3 : The setup.py file

8.6.4 : Peformances tests

8.6.4.1 : Reduction with numpy functions

8.6.4.2 : Reduction with our intrinsics implementation

8.6.5 : The CMakeLists.txt file

8.6.6 : Performances results

8.6.6.1 : Basic performances

8.6.6.2 : Summary

9 : Application/exercice : Optimisation barycentre computation

9.1 : What is a barycentre ?

9.2 : The classical approach

9.2.1 : The barycentre.h file

9.2.2 : The barycentre.cpp file

9.2.3 : The main_barycentre.cpp file

9.2.4 : The CMakeLists.txt file

9.2.5 : The compilation

9.2.6 : The performances

9.3 : The vectorization of barycentre

9.3.1 : The barycentre_vectorize.h

9.3.2 : The barycentre_vectorize.cpp

9.3.3 : The main_barycentre_vectorize.cpp

9.3.4 : The barycentre_vectorizeSplit.h

9.3.5 : The barycentre_vectorizeSplit.cpp

9.3.6 : The main_barycentre_vectorizeSplit.cpp

9.3.7 : The CMakeLists.txt

9.3.8 : The compilation

9.3.9 : The performances

9.4 : The intrinsics version of barycentre

9.4.1 : The barycentre_intrinsics.h file

9.4.2 : The barycentre_intrinsics.cpp file

9.4.3 : The CMakeLists.txt file

9.4.4 : The compilation

9.4.5 : The performances

9.5 : How to create a barycentre python module

9.5.1 : The wrapper function

9.5.2 : The C++ module file

9.5.3 : The setup.py file

9.5.4 : Peformances tests

9.5.4.1 : Barycentre with numpy functions

9.5.4.2 : Barycentre with our intrinsics implementation

9.5.5 : The CMakeLists.txt file

9.5.6 : Performances results

9.5.6.1 : Basic performances

9.5.6.2 : Summary

10 : Optimisation of Dense Matrix-Matrix multiplication

10.1 : What is a SGEMM ?

10.2 : The classical approach

10.2.1 : The sgemm.h file

10.2.2 : The sgemm.cpp file

10.2.3 : The main_sgemm.cpp file

10.2.4 : The CMakeLists.txt file

10.2.5 : The compilation

10.2.6 : The performances

10.3 : Let's swap the loops over j and k

10.3.1 : The sgemm_swap.h file

10.3.2 : The sgemm_swap.cpp file

10.3.3 : The main_sgemm_swap.cpp file

10.3.4 : The CMakeLists.txt file

10.3.5 : The compilation

10.3.6 : The performances

10.4 : Vectorization

10.4.1 : The sgemm_vectorize.h file

10.4.2 : The sgemm_vectorize.cpp file

10.4.3 : The main_sgemm_vectorize.cpp file

10.4.4 : The CMakeLists.txt file

10.4.5 : The compilation

10.4.6 : The performances

10.5 : Intrinsics implementation

10.5.1 : The sgemm_intrinsics.h file

10.5.2 : The sgemm_intrinsics.cpp file

10.5.3 : The main_sgemm_intrinsics.cpp file

10.5.4 : The CMakeLists.txt file

10.5.5 : The compilation

10.5.6 : The performances

10.6 : Intrinsics implementation with a pitch

10.6.1 : The sgemm_intrinsics_pitch.h file

10.6.2 : The sgemm_intrinsics_pitch.cpp file

10.6.3 : The main_sgemm_intrinsics_pitch.cpp file

10.6.4 : The CMakeLists.txt file

10.6.5 : The compilation

10.6.6 : The performances

10.7 : How to create a sgemm python module

10.7.1 : The wrapper function

10.7.2 : The C++ module file

10.7.3 : The setup.py file

10.7.4 : Peformances tests

10.7.4.1 : Sgemm with numpy functions

10.7.4.2 : Sgemm with our intrinsics implementation

10.7.5 : The CMakeLists.txt file

10.7.6 : Performances results

10.7.6.1 : Basic performances

10.7.6.2 : Summary

11 : What about branching ? (bonus)

11.1 : Classical implementation

11.1.1 : The main.cpp file

11.1.2 : The CMakeLists.txt file

11.1.3 : The compilation

11.1.4 : The performances

11.2 : Implementation without if

11.2.1 : The main_optimise.cpp file

11.2.2 : The CMakeLists.txt file

11.2.3 : The compilation

11.2.4 : The performances

11.3 : Vectorization Implementation

11.3.1 : The main_vectorize.cpp file

11.3.2 : The CMakeLists.txt file

11.3.3 : The compilation

11.3.4 : The performances

11.4 : Intrinsics Implementation

11.4.1 : The main_intrinsics.cpp file

11.4.2 : The CMakeLists.txt file

11.4.3 : The compilation

11.4.4 : The performances