Hi everyone!

I have successfully passed the Google Summer of Code 2016 and below is the copy of my final report.

Dear LLVM contributors,

during the Google Summer of Code 2016 I have been working on the "Improvement of vectorization process in Polly" project.

I planned to add a mode that uses the same Polly tool flow and processes statements, which contain only matrix multiplication of the following form, in a special way.

C := AB + C, where C, A, and B are matrices of sizes m × n, m × k, and k × n, respectively.

It was also planned to implement determination of vectorization factors and profitability of vectorization.

We decided to adjust the goals and focus only on the optimization of matrix multiplication and its generalization. In the result, we optimize a class of statements (see [1] for further details), which also contains the matrix multiplication.

The result of evaluations of the corresponding code compiled with r279395 of Clang on the Intel Core i7-3820 SandyBridge, for different values of m, n, k and different compiler options can be found on the following link [2].

According to the figure [2], the algorithm for the computation of the optimal values of the parameters of matrix multiplication can be improved. One way is a new algorithm for the computation of the parameter Mc (see [1] for further details). If we, for example, manually equate the Mc to Nc we get the following results [3]. Its improvement is planned for the future.

The class also contains general matrix multiplication of the form C := αAB + βC, where C, A, and B are matrices of sizes m × n, m × k, and k × n, respectively. In case, for example, of m = n = 1056 and k = 1024, α = 32412, β = 2123, the corresponding code compiled with r278666 of Clang with following options

clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME -march=native -mllvm -polly -mllvm -polly-pattern-matching-based-opts=true -DPOLYBENCH_USE_SCALAR_LB -mllvm -polly-target-cache-level-associativity=8,8 -mllvm -polly-target-cache-level-sizes=32768,262144 -mllvm -polly-target-througput-vector-fma=1 -mllvm -polly-target-latency-vector-fma=8

takes about 0.30 on the Intel Core i7-3820 SandyBridge. In case the described optimization is disabled, it takes about 0.75 and 2.1 in case Polly is not used.

Furthermore, using the optimization we can optimize the following examples of code:

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1

Loop 2 for j = 0, …, UpperBound2 - 1 in steps of 1

Loop 3 for k = 0, …, UpperBound3 - 1 in steps of 1

C[i][j] += 32412 Op1 B[k][j]

endfor

endfor

endfor

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1

Loop 2 for j = 0, …, UpperBound2 - 1 in steps of 1

Loop 3 for k = 0, …, UpperBound3 - 1 in steps of 1

C[i][j] = C[i][j] Op1 A[i][k] Op2 B[k][j]

endfor

endfor

endfor

where Opi is an operation from the set {+, -, /, *} (see [4], [5], [6], [7] for results of their evaluations).

The described optimization is implemented in [8], [9], [10], [11], [12], [13], [14] that are already committed and [15] that is under review. For further details about the project, what code got merged, what code did not get merged, and what is left to do, please see my blog posts [1], [16].

I am planning to continue working on the project as part of my Ph.D. thesis and implement the following:

1. Prefetching of the certain cache lines

The BLIS implementation of matrix-matrix multiplication prefetches only the first cache line, before traversing a given interval of memory. This could confirm that the implementation relies on hardware preteching to prefetch the subsequent lines [17]. Consequently, there is a possibility that its utilization could improve the described optimization.

2. Reduction of the number of the parameters of a target architecture that require to be specified as command line parameters

At the moment, to be able to use the implemented optimization, one should specify parameters of a target architecture like, for example, throughput of vector instructions per clock cycle (see [1] for further details). Reduction of the number of such parameters using the information about the target architecture of the LLVM (e.g., TargetTransformInfo::getArithmeticInstrCost) could be preferable.

3. Generalization of determination of parameters of the transformation

At the moment, for this purpose, we use an algorithm described in "Analytical Modeling is Enough for High Performance BLIS" [1] that was designed to optimize a subclass of the class and could possibly cause performance regression, if we try to apply it to all statements of the class. We could try to create a new algorithm specialized for the whole class.

I would like to thank Michael Kruse, Albert Cohen, Johannes Doerfert and the LLVM community for your comments, reviews, ideas and the opportunity to work on this project.

I would like to especially thank Tobias Grosser for his encouragement and excellent support.

Refs.:

[1] - http://romangareev.blogspot.ru/2016/06/gsoc-2016-report-i.html

[2] - https://drive.google.com/file/d/0B2Wloo-931AoVTRINXp0dDZPSjQ/view?usp=sharing

[3] - https://drive.google.com/file/d/0B2Wloo-931AoWG9jTjg3aEVxRG8/view?usp=sharing

[4] - https://docs.google.com/spreadsheets/d/1RtioBokRePdX2zdFxGxOxTbtY_WnI4moYFCfbujr7S8/pubhtml?gid=0&single=true

[5] - https://docs.google.com/spreadsheets/d/1hUtXH9JsgYUhHCdydVtE7ayNupc9ViZuAG8z_JKevP0/pubhtml?gid=0&single=true

[6] - https://docs.google.com/spreadsheets/d/1NQHqCuhc1A8pVJtN3FuTHrQ2gcyXUYs5T2L8tMXqLE4/pubhtml?gid=0&single=true

[7] - https://docs.google.com/spreadsheets/d/1nr4qZpG5V--Cq5cTLX1TLA7OpfdFqhIGr6-IFuoq1yw/pubhtml?gid=0&single=true

[8] - http://reviews.llvm.org/D21140

[9] - http://reviews.llvm.org/D21491

[10] - http://reviews.llvm.org/D20575

[11] - https://reviews.llvm.org/D22187

[12] - https://reviews.llvm.org/D22828

[13] - https://reviews.llvm.org/D23740

[14] - https://reviews.llvm.org/D23759

[15] - https://reviews.llvm.org/D23260

[16] - http://romangareev.blogspot.ru/2016/08/gsoc-2016-report-ii.html

[17] - http://lists.llvm.org/pipermail/llvm-dev/2016-May/100229.html

I have successfully passed the Google Summer of Code 2016 and below is the copy of my final report.

Dear LLVM contributors,

during the Google Summer of Code 2016 I have been working on the "Improvement of vectorization process in Polly" project.

I planned to add a mode that uses the same Polly tool flow and processes statements, which contain only matrix multiplication of the following form, in a special way.

C := AB + C, where C, A, and B are matrices of sizes m × n, m × k, and k × n, respectively.

It was also planned to implement determination of vectorization factors and profitability of vectorization.

We decided to adjust the goals and focus only on the optimization of matrix multiplication and its generalization. In the result, we optimize a class of statements (see [1] for further details), which also contains the matrix multiplication.

The result of evaluations of the corresponding code compiled with r279395 of Clang on the Intel Core i7-3820 SandyBridge, for different values of m, n, k and different compiler options can be found on the following link [2].

According to the figure [2], the algorithm for the computation of the optimal values of the parameters of matrix multiplication can be improved. One way is a new algorithm for the computation of the parameter Mc (see [1] for further details). If we, for example, manually equate the Mc to Nc we get the following results [3]. Its improvement is planned for the future.

The class also contains general matrix multiplication of the form C := αAB + βC, where C, A, and B are matrices of sizes m × n, m × k, and k × n, respectively. In case, for example, of m = n = 1056 and k = 1024, α = 32412, β = 2123, the corresponding code compiled with r278666 of Clang with following options

clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME -march=native -mllvm -polly -mllvm -polly-pattern-matching-based-opts=true -DPOLYBENCH_USE_SCALAR_LB -mllvm -polly-target-cache-level-associativity=8,8 -mllvm -polly-target-cache-level-sizes=32768,262144 -mllvm -polly-target-througput-vector-fma=1 -mllvm -polly-target-latency-vector-fma=8

takes about 0.30 on the Intel Core i7-3820 SandyBridge. In case the described optimization is disabled, it takes about 0.75 and 2.1 in case Polly is not used.

Furthermore, using the optimization we can optimize the following examples of code:

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1

Loop 2 for j = 0, …, UpperBound2 - 1 in steps of 1

Loop 3 for k = 0, …, UpperBound3 - 1 in steps of 1

C[i][j] += 32412 Op1 B[k][j]

endfor

endfor

endfor

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1

Loop 2 for j = 0, …, UpperBound2 - 1 in steps of 1

Loop 3 for k = 0, …, UpperBound3 - 1 in steps of 1

C[i][j] = C[i][j] Op1 A[i][k] Op2 B[k][j]

endfor

endfor

endfor

where Opi is an operation from the set {+, -, /, *} (see [4], [5], [6], [7] for results of their evaluations).

The described optimization is implemented in [8], [9], [10], [11], [12], [13], [14] that are already committed and [15] that is under review. For further details about the project, what code got merged, what code did not get merged, and what is left to do, please see my blog posts [1], [16].

I am planning to continue working on the project as part of my Ph.D. thesis and implement the following:

1. Prefetching of the certain cache lines

The BLIS implementation of matrix-matrix multiplication prefetches only the first cache line, before traversing a given interval of memory. This could confirm that the implementation relies on hardware preteching to prefetch the subsequent lines [17]. Consequently, there is a possibility that its utilization could improve the described optimization.

2. Reduction of the number of the parameters of a target architecture that require to be specified as command line parameters

At the moment, to be able to use the implemented optimization, one should specify parameters of a target architecture like, for example, throughput of vector instructions per clock cycle (see [1] for further details). Reduction of the number of such parameters using the information about the target architecture of the LLVM (e.g., TargetTransformInfo::getArithmeticInstrCost) could be preferable.

3. Generalization of determination of parameters of the transformation

At the moment, for this purpose, we use an algorithm described in "Analytical Modeling is Enough for High Performance BLIS" [1] that was designed to optimize a subclass of the class and could possibly cause performance regression, if we try to apply it to all statements of the class. We could try to create a new algorithm specialized for the whole class.

I would like to thank Michael Kruse, Albert Cohen, Johannes Doerfert and the LLVM community for your comments, reviews, ideas and the opportunity to work on this project.

I would like to especially thank Tobias Grosser for his encouragement and excellent support.

Refs.:

[1] - http://romangareev.blogspot.ru/2016/06/gsoc-2016-report-i.html

[2] - https://drive.google.com/file/d/0B2Wloo-931AoVTRINXp0dDZPSjQ/view?usp=sharing

[3] - https://drive.google.com/file/d/0B2Wloo-931AoWG9jTjg3aEVxRG8/view?usp=sharing

[4] - https://docs.google.com/spreadsheets/d/1RtioBokRePdX2zdFxGxOxTbtY_WnI4moYFCfbujr7S8/pubhtml?gid=0&single=true

[5] - https://docs.google.com/spreadsheets/d/1hUtXH9JsgYUhHCdydVtE7ayNupc9ViZuAG8z_JKevP0/pubhtml?gid=0&single=true

[6] - https://docs.google.com/spreadsheets/d/1NQHqCuhc1A8pVJtN3FuTHrQ2gcyXUYs5T2L8tMXqLE4/pubhtml?gid=0&single=true

[7] - https://docs.google.com/spreadsheets/d/1nr4qZpG5V--Cq5cTLX1TLA7OpfdFqhIGr6-IFuoq1yw/pubhtml?gid=0&single=true

[8] - http://reviews.llvm.org/D21140

[9] - http://reviews.llvm.org/D21491

[10] - http://reviews.llvm.org/D20575

[11] - https://reviews.llvm.org/D22187

[12] - https://reviews.llvm.org/D22828

[13] - https://reviews.llvm.org/D23740

[14] - https://reviews.llvm.org/D23759

[15] - https://reviews.llvm.org/D23260

[16] - http://romangareev.blogspot.ru/2016/08/gsoc-2016-report-ii.html

[17] - http://lists.llvm.org/pipermail/llvm-dev/2016-May/100229.html