Monday, August 22, 2016

#GSoC 2016 Report II

Hi everyone!

As was mentioned previously, my next task is to implement the packing transformation. It can be described as a data-layout transformation that requires to introduce a new array, copy data to it and change memory access locations to reference the array (see [1], p. 6 for further details).

By the moment of implementation of the packing transformation, Polly supported the only one data-layout transformation, namely changing of memory access locations. The remaining tasks were to implement creation of empty arrays and copying data to it. However, it was first required to identify the locations that should be changed.

In case of the following code that could correspond to the matrix-matrix multiplication,

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1
Loop 2   for j = 0, …, UpperBound2 - 1 in steps of 1
Loop 3     for k = 0, …, UpperBound3 - 1 in steps of 1
                   C[i][j] += A[i][k] * B[k][j]

the following access function can be built:

S[i, j, k]->A[i][k], S[i, j, k]->B[k][j].

According to the algorithm used to optimize matrix multiplication and described in "Analytical Modeling is Enough for High Performance BLIS" [1] they should be changed to the following access relations:

S[O0, O1, O2, O3, O4, O5, O6, O7, O8]->Ac[O5 + K * O3, O6] and

S[O0, O1, O2, O3, O4, O5, O6, O7, O8]->Bc[O5 + K * O4, O7], respectively.

The transformation helps to access elements of operands of a matrix multiplication after creation of the BLIS micro and macro kernels. The described approach is implemented in [3] and also already
available from the main repository.

The next task was to create an empty array. To do it, we use an object of the ScopArrayInfo class
that is designed to store information (e.g., the type of the elements, dimension sizes) about arrays in the SCoP (see [2] for further details). In our case, it is not a part of the SCoP originally. Furthermore, the array would be allocated only during the work of the CodeGeneration pass that takes a mathematical model derived and optimized using Polly and translates it back to LLVM-IR. The described approach is implemented in [3] and already available from the main repository.

The last task was to copy data to the newly created arrays according to the packing transformation.

To implement it, we decided to introduce a new type of SCoP statement that has one read memory access and one write memory access (in this very order). In particular, data is loaded from the location described by the read memory access and written to the location described by the
write memory access. To describe the domain of such a statement, we use extension nodes of the integer set library [4].

The described approach is implemented in [5]. It helps to get the following execution time improvements of benchmarks from PolyBench [6], which correspond to matrix multiplications.

Performance Improvements - Execution Time                                                   Δ       Previous  Current
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm     -70.59%  7.1680    2.1080
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm     -68.85%  4.7640    1.4840
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm -66.61%  2.3720    0.7920

The results obtained from the Polly buildbot perf-x86_64-penryn-O3-polly are available on the following link [7].

Although it is already implemented, we continue to improve the patch and general Polly infrastructure related to it. In particular, we are planning to implement the ability to introduce the SCoP statements via the JSCoP that helps to test new optimizations without any knowledge about compiler internals [8]. During this GSoC we’ve already extended the JSCoP to allow the user to declare new arrays and to reference these arrays from access expressions. It is implemented in [9] and
already available from the main repository.


[1] -
[2] -
[3] -
[4] -
[5] -
[6] -
[7] -
[8] -
[9] -

No comments:

Post a Comment