Hi everyone!

As was mentioned in my previous blog post, one of the first tasks is to implement tilings and interchangings of specific loops based on the algorithm used to optimize matrix multiplication and described in "Analytical Modeling is Enough for High Performance BLIS" [1]. It was decided to do it step-by-step. This blog post is devoted to its description.

Let's assume that matrix multiplication has the following form and matrices A, B and C stored into row-major order, in contrast to column-major order used in [1].

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1

Loop 2 for j = 0, …, UpperBound2 - 1 in steps of 1

Loop 3 for k = 0, …, UpperBound3 - 1 in steps of 1

C[i][j] += A[i][k] * B[k][j]

endfor

endfor

endfor

In this case, the optimized code could look like this,

Loop 1 for ic = 0, ..., UpperBound1 − 1 in steps of mc

Loop 2 for pc = 0, ..., UpperBound3 − 1 in steps of kc

Ac = Pack(A(ic : ic + mc − 1, pc : pc + kc − 1))

Loop 3 for jc = 0, ..., UpperBound2 − 1 in steps of nc

Bc = Pack(B(pc : pc + kc − 1, jc : jc + nc − 1))

Loop 4 for ir = 0, ..., mc − 1 in steps of mr

Loop 5 for jr = 0, ..., nc − 1 in steps of nr

Loop 6 for pr = 0, ..., kc − 1 in steps of 1

Cc(ir :ir +mr 1,jr :jr +nr 1) += Ac (ir : ir + mr 1, pr ) * Bc(pr, jr:jr+nr 1)

endfor

endfor

endfor

endfor

endfor

endfor

where Loop 4 and Loop 5 form a macro-kernel around Loop 6, which contains rank-1 updates and forms a micro-kernel.

The first step was to create a micro-kernel. To do it, we tile the first two loops of the matrix multiplication and consequently unroll them. In the result we get this code:

Loop 1 for ir = 0, …, UpperBound1 - 1 in steps of mr

Loop 2 for jr = 0, …, UpperBound2 - 1 in steps of nr

Loop 3 for k = 0, …, UpperBound3 - 1 in steps of 1

C[i][j] += A[i][k] * B[k][j]

…

C[i][j + nr - 1] += A[i][k] * B[k][j + nr - 1]

…

C[i + mr - 1][j + nr - 1] += A[i + mr - 1][k] * B[k][j + nr - 1]

endfor

endfor

endfor

If auto-vectorization is enabled in LLVM, the body of Loop 3 may be replaced with rank-1 updates and we'll get a mirco-kernel. It's utilized in [2].

The second step was the creation of a macro-kernel by applying a combination of tiling of the first three loops and interchanging Loop 3 and Loop 2. This approach was implemented in [3].

However, without the optimal parameter values creation of the described kernels could lead to not highly optimized implementations or even increase execution time of generated code. To address the issue, we use the optimization model for mapping the GEMM algorithm described in [1].

It requires information about the following parameters of a target architecture:

1. Size of double-precision floating-point number.

2. Number of double-precision floating-point numbers that can be hold by a vector register.

3. Throughput of vector instructions per clock cycle.

4. Latency of instructions (i.e., the minimum number of cycles between the issuance of two dependent consecutive instructions).

5. Paramaters of cache levels (size of cache lines, associativity degrees, size, number of cache sets).

The first two parameters can be obtained from the TargetTransformInfo class. Therefore, since the product of a number of cache sets and a size of cache lines is a size divided by an associativity degree, we should determine only the four parameters. At the moment they can be specified as command line parameters.

Even if we able to get close-to-peak performance of matrix multiplication, we have to determine that the code under consideration contains matrix multiplication. To address the issue, we specified the following class of statements, which contains matrix multiplication:

1. There are exactly three input dimensions

2. All memory accesses of the statement will have stride 0 or 1, if we interchange loops (switch the variable used in the inner loop to the outer loop)

3. All memory accesses of the statement except from the last one, are read memory access and the last one is write memory access

4. All subscripts of the last memory access of the statement don't contain the variable used in the inner loop

The larger class was introduced in [4] to simplify its determination and to make a step toward generalization, which is planned for the future.

Refs.:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf

[2] - http://reviews.llvm.org/D21140

[3] - http://reviews.llvm.org/D21491

[4] - http://reviews.llvm.org/D20575

As was mentioned in my previous blog post, one of the first tasks is to implement tilings and interchangings of specific loops based on the algorithm used to optimize matrix multiplication and described in "Analytical Modeling is Enough for High Performance BLIS" [1]. It was decided to do it step-by-step. This blog post is devoted to its description.

Let's assume that matrix multiplication has the following form and matrices A, B and C stored into row-major order, in contrast to column-major order used in [1].

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1

Loop 2 for j = 0, …, UpperBound2 - 1 in steps of 1

Loop 3 for k = 0, …, UpperBound3 - 1 in steps of 1

C[i][j] += A[i][k] * B[k][j]

endfor

endfor

endfor

In this case, the optimized code could look like this,

Loop 1 for ic = 0, ..., UpperBound1 − 1 in steps of mc

Loop 2 for pc = 0, ..., UpperBound3 − 1 in steps of kc

Ac = Pack(A(ic : ic + mc − 1, pc : pc + kc − 1))

Loop 3 for jc = 0, ..., UpperBound2 − 1 in steps of nc

Bc = Pack(B(pc : pc + kc − 1, jc : jc + nc − 1))

Loop 4 for ir = 0, ..., mc − 1 in steps of mr

Loop 5 for jr = 0, ..., nc − 1 in steps of nr

Loop 6 for pr = 0, ..., kc − 1 in steps of 1

Cc(ir :ir +mr 1,jr :jr +nr 1) += Ac (ir : ir + mr 1, pr ) * Bc(pr, jr:jr+nr 1)

endfor

endfor

endfor

endfor

endfor

endfor

where Loop 4 and Loop 5 form a macro-kernel around Loop 6, which contains rank-1 updates and forms a micro-kernel.

The first step was to create a micro-kernel. To do it, we tile the first two loops of the matrix multiplication and consequently unroll them. In the result we get this code:

Loop 1 for ir = 0, …, UpperBound1 - 1 in steps of mr

Loop 2 for jr = 0, …, UpperBound2 - 1 in steps of nr

Loop 3 for k = 0, …, UpperBound3 - 1 in steps of 1

C[i][j] += A[i][k] * B[k][j]

…

C[i][j + nr - 1] += A[i][k] * B[k][j + nr - 1]

…

C[i + mr - 1][j + nr - 1] += A[i + mr - 1][k] * B[k][j + nr - 1]

endfor

endfor

endfor

If auto-vectorization is enabled in LLVM, the body of Loop 3 may be replaced with rank-1 updates and we'll get a mirco-kernel. It's utilized in [2].

The second step was the creation of a macro-kernel by applying a combination of tiling of the first three loops and interchanging Loop 3 and Loop 2. This approach was implemented in [3].

However, without the optimal parameter values creation of the described kernels could lead to not highly optimized implementations or even increase execution time of generated code. To address the issue, we use the optimization model for mapping the GEMM algorithm described in [1].

It requires information about the following parameters of a target architecture:

1. Size of double-precision floating-point number.

2. Number of double-precision floating-point numbers that can be hold by a vector register.

3. Throughput of vector instructions per clock cycle.

4. Latency of instructions (i.e., the minimum number of cycles between the issuance of two dependent consecutive instructions).

5. Paramaters of cache levels (size of cache lines, associativity degrees, size, number of cache sets).

The first two parameters can be obtained from the TargetTransformInfo class. Therefore, since the product of a number of cache sets and a size of cache lines is a size divided by an associativity degree, we should determine only the four parameters. At the moment they can be specified as command line parameters.

Even if we able to get close-to-peak performance of matrix multiplication, we have to determine that the code under consideration contains matrix multiplication. To address the issue, we specified the following class of statements, which contains matrix multiplication:

1. There are exactly three input dimensions

2. All memory accesses of the statement will have stride 0 or 1, if we interchange loops (switch the variable used in the inner loop to the outer loop)

3. All memory accesses of the statement except from the last one, are read memory access and the last one is write memory access

4. All subscripts of the last memory access of the statement don't contain the variable used in the inner loop

The larger class was introduced in [4] to simplify its determination and to make a step toward generalization, which is planned for the future.

Refs.:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf

[2] - http://reviews.llvm.org/D21140

[3] - http://reviews.llvm.org/D21491

[4] - http://reviews.llvm.org/D20575