Outline
Main Page
1 : Introduction to High Performance Computing
2 : Basic use of CMake
2.1 : What is CMake ?
2.2 : Hello world with CMake
3 : Starting the project
4 : Several useful CMake functions
4.1 : The runExample function
4.2 : The runPythonExample function
4.3 : The plotPerf function
4.4 : Summary
4.5 : Functions to check Python environnement and build python module
4.5.1 : Check the environnement
4.5.2 : Make python module
4.5.3 : Summary
5 : Creation of a HPC/Timer library
5.1 : The rdtsc files
5.1.1 : The header (timer.h)
5.1.2 : The source (timer.cpp)
5.2 : The allocation/deallocation files
5.2.1 : The header (asterics_alloc.h)
5.2.2 : The source (asterics_alloc.cpp)
5.3 : The main header (asterics_hpc.h)
5.4 : The CMakeLists.txt
5.5 : The compilation
5.6 : The associated python module
5.6.1 : Wrapper of the timer
5.6.2 : Wrapper of the tables allocation
5.6.3 : Wrapper of the matrices allocation
5.6.4 : The wrapper module source : astericshpc.cpp
5.6.5 : The module configuration : setup.py
5.6.6 : The python install cmake script
5.6.7 : The CMakeLists.txt
6 : Optimisation of Hadamard product
6.1 : What is the Hadamard product ?
6.2 : Main to evaluate the Hadamard product
6.3 : The CMakeLists.txt file
6.4 : Get the performances
6.5 : The first performances
6.6 : How to vectorize the computation
6.6.1 : What is vectorization ?
6.6.2 : Automatic vectorization (by the compiler)
6.6.2.1 : Things to verify before vectorizing
6.6.2.2 : The full main_vectorize.cpp file
6.6.2.3 : The CMakeLists.txt file
6.6.2.4 : Compilation
6.6.2.5 : The performances with vectorization by the compiler
6.6.3 : Manual vectorization (by Intrinsic functions)
6.6.3.1 : Begining of the main_intrinsics.cpp file
6.6.3.2 : The hadamard_product function
6.6.3.3 : The function to evaluate performances
6.6.3.4 : The main function
6.6.3.5 : Full main_intrinsics.cpp file
6.6.3.6 : The CMakeLists.txt file
6.6.3.7 : Compilation
6.6.3.8 : The performances with Intrinsics
6.6.4 : Conclusion on vectorization
6.7 : How to create a hadamard python module
6.7.1 : The C++ kernel
6.7.2 : The wrapper function
6.7.3 : The C++ module file
6.7.4 : The setup.py file
6.7.5 : Peformances tests
6.7.5.1 : A naive implementation of the hadamard product
6.7.5.2 : Hadamard product with numpy functions
6.7.5.3 : Hadamard product with our intrinsics pitch implementation
6.7.5.4 : What hapened if I use python list instead of numpy array for naive implementation ?
6.7.6 : The CMakeLists.txt file
6.7.7 : Performances results
6.7.7.1 : Basic performances
6.7.7.2 : And the lists ?
6.7.7.3 : Summary
7 : Optimisation of saxpy
7.1 : What is a Saxpy ?
7.2 : The classical approach
7.2.1 : The main.cpp
7.2.2 : The CMakeLists.txt
7.2.3 : The compilation
7.2.4 : The performances
7.3 : The vectorization of Saxpy
7.3.1 : The main_vectorize.cpp
7.3.2 : The CMakeLists.txt
7.3.3 : The compilation
7.3.4 : The performances
7.4 : The intrinsics version of Saxpy
7.4.1 : The main_intrinsics.cpp
7.4.2 : The CMakeLists.txt
7.4.3 : The compilation
7.4.4 : The performances
7.5 : How to create a saxpy python module
7.5.1 : The C++ kernel
7.5.2 : The wrapper function
7.5.3 : The C++ module file
7.5.4 : The setup.py file
7.5.5 : Peformances tests
7.5.5.1 : A naive implementation of the saxpy
7.5.5.2 : Saxpy with numpy functions
7.5.5.3 : Saxpy with our intrinsics implementation
7.5.6 : The CMakeLists.txt file
7.5.7 : Performances results
7.5.7.1 : Basic performances
7.5.7.2 : Summary
8 : Optimisation of a reduction
8.1 : What is a reduction ?
8.2 : The classical approach
8.2.1 : The main.cpp
8.2.2 : The CMakeLists.txt
8.2.3 : The compilation
8.2.4 : The performances
8.2.5 : Solving the performance problem
8.2.5.1 : The reduction.h file
8.2.5.2 : The reduction.cpp file
8.2.5.3 : The main_reduction.cpp file
8.2.5.4 : The CMakeLists.txt file
8.2.5.5 : The compilation
8.2.5.6 : The performances
8.3 : The vectorization of reduction
8.3.1 : The reduction_vectorize.h
8.3.2 : The reduction_vectorize.cpp
8.3.3 : The main_vectorize.cpp
8.3.4 : The CMakeLists.txt
8.3.4.1 : The compilation
8.3.4.2 : The performances
8.4 : The vectorization of reduction with intrinsic functions
8.4.1 : The reduction_intrinsics.h file
8.4.2 : The reduction_intrinsics.cpp file
8.4.3 : The main_intrinsics.cpp
8.4.4 : The CMakeLists.txt file
8.4.5 : The compilation
8.4.6 : The performances
8.5 : How to optimize more
8.5.1 : Interleaving 2 times
8.5.1.1 : The reduction_intrinsics_interleave2.h file
8.5.1.2 : The reduction_intrinsics_interleave2.cpp file
8.5.1.3 : The main_intrinsics_interleave2.cpp file
8.5.1.4 : The CMakeLists.txt file
8.5.1.5 : The compilation
8.5.1.6 : The performances
8.5.2 : Interleaving 4 times
8.5.2.1 : The reduction_intrinsics_interleave4.h file
8.5.2.2 : The reduction_intrinsics_interleave4.cpp file
8.5.2.3 : The main_intrinsics_interleave4.cpp file
8.5.2.4 : The CMakeLists.txt file
8.5.2.5 : The compilation
8.5.2.6 : The performances
8.5.3 : Interleaving 8 times
8.5.3.1 : The reduction_intrinsics_interleave8.h file
8.5.3.2 : The reduction_intrinsics_interleave8.cpp file
8.5.3.3 : The main_intrinsics_interleave8.cpp file
8.5.3.4 : The CMakeLists.txt file
8.5.3.5 : The compilation
8.5.3.6 : The performances
8.5.4 : Summary
8.6 : How to create a reduction python module
8.6.1 : The wrapper function
8.6.2 : The C++ module file
8.6.3 : The setup.py file
8.6.4 : Peformances tests
8.6.4.1 : Reduction with numpy functions
8.6.4.2 : Reduction with our intrinsics implementation
8.6.5 : The CMakeLists.txt file
8.6.6 : Performances results
8.6.6.1 : Basic performances
8.6.6.2 : Summary
9 : Application/exercice : Optimisation barycentre computation
9.1 : What is a barycentre ?
9.2 : The classical approach
9.2.1 : The barycentre.h file
9.2.2 : The barycentre.cpp file
9.2.3 : The main_barycentre.cpp file
9.2.4 : The CMakeLists.txt file
9.2.5 : The compilation
9.2.6 : The performances
9.3 : The vectorization of barycentre
9.3.1 : The barycentre_vectorize.h
9.3.2 : The barycentre_vectorize.cpp
9.3.3 : The main_barycentre_vectorize.cpp
9.3.4 : The barycentre_vectorizeSplit.h
9.3.5 : The barycentre_vectorizeSplit.cpp
9.3.6 : The main_barycentre_vectorizeSplit.cpp
9.3.7 : The CMakeLists.txt
9.3.8 : The compilation
9.3.9 : The performances
9.4 : The intrinsics version of barycentre
9.4.1 : The barycentre_intrinsics.h file
9.4.2 : The barycentre_intrinsics.cpp file
9.4.3 : The CMakeLists.txt file
9.4.4 : The compilation
9.4.5 : The performances
9.5 : How to create a barycentre python module
9.5.1 : The wrapper function
9.5.2 : The C++ module file
9.5.3 : The setup.py file
9.5.4 : Peformances tests
9.5.4.1 : Barycentre with numpy functions
9.5.4.2 : Barycentre with our intrinsics implementation
9.5.5 : The CMakeLists.txt file
9.5.6 : Performances results
9.5.6.1 : Basic performances
9.5.6.2 : Summary
10 : Optimisation of Dense Matrix-Matrix multiplication
10.1 : What is a SGEMM ?
10.2 : The classical approach
10.2.1 : The sgemm.h file
10.2.2 : The sgemm.cpp file
10.2.3 : The main_sgemm.cpp file
10.2.4 : The CMakeLists.txt file
10.2.5 : The compilation
10.2.6 : The performances
10.3 : Let's swap the loops over j and k
10.3.1 : The sgemm_swap.h file
10.3.2 : The sgemm_swap.cpp file
10.3.3 : The main_sgemm_swap.cpp file
10.3.4 : The CMakeLists.txt file
10.3.5 : The compilation
10.3.6 : The performances
10.4 : Vectorization
10.4.1 : The sgemm_vectorize.h file
10.4.2 : The sgemm_vectorize.cpp file
10.4.3 : The main_sgemm_vectorize.cpp file
10.4.4 : The CMakeLists.txt file
10.4.5 : The compilation
10.4.6 : The performances
10.5 : Intrinsics implementation
10.5.1 : The sgemm_intrinsics.h file
10.5.2 : The sgemm_intrinsics.cpp file
10.5.3 : The main_sgemm_intrinsics.cpp file
10.5.4 : The CMakeLists.txt file
10.5.5 : The compilation
10.5.6 : The performances
10.6 : Intrinsics implementation with a pitch
10.6.1 : The sgemm_intrinsics_pitch.h file
10.6.2 : The sgemm_intrinsics_pitch.cpp file
10.6.3 : The main_sgemm_intrinsics_pitch.cpp file
10.6.4 : The CMakeLists.txt file
10.6.5 : The compilation
10.6.6 : The performances
10.7 : How to create a sgemm python module
10.7.1 : The wrapper function
10.7.2 : The C++ module file
10.7.3 : The setup.py file
10.7.4 : Peformances tests
10.7.4.1 : Sgemm with numpy functions
10.7.4.2 : Sgemm with our intrinsics implementation
10.7.5 : The CMakeLists.txt file
10.7.6 : Performances results
10.7.6.1 : Basic performances
10.7.6.2 : Summary
11 : What about branching ? (bonus)
11.1 : Classical implementation
11.1.1 : The main.cpp file
11.1.2 : The CMakeLists.txt file
11.1.3 : The compilation
11.1.4 : The performances
11.2 : Implementation without if
11.2.1 : The main_optimise.cpp file
11.2.2 : The CMakeLists.txt file
11.2.3 : The compilation
11.2.4 : The performances
11.3 : Vectorization Implementation
11.3.1 : The main_vectorize.cpp file
11.3.2 : The CMakeLists.txt file
11.3.3 : The compilation
11.3.4 : The performances
11.4 : Intrinsics Implementation
11.4.1 : The main_intrinsics.cpp file
11.4.2 : The CMakeLists.txt file
11.4.3 : The compilation
11.4.4 : The performances