New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Use L3 BLAS in LARFT #799

Draft

AGonzales-amd wants to merge 8 commits into ROCm:develop from AGonzales-amd:gemm_larft

+191 −19

AGonzales-amd commented Aug 20, 2024 •

edited

Loading

This PR introduces a potential optimization to the LARFT routine. The modification aims to reduce the size of the gemv computations and instead offloads the block part of the computation to a call to gemm. Additionally, in some cases the modified method performs worse than the original so the latter is dispatched instead.

Following are performance LARFT figures with configuration k=64, rocblas_forward_direction and rocblas_column_wise (config for QR factorization).

Single Precision
Double Precision
Complex Single Precision
Complex Double Precision

column_wise and forward_direction has similar performance characteristic while row_wise configurations show improvements for all data types, e.g.,

Single Precision

This also shows that the gains with the row_wise configuration is more significant (~x18 vs ~x3 speedup).

Curiously, GEQRF shows weird performance with this modification. These figures were generated with square matrices.

Single Precision
Double Precision
Complex Single Precision
Complex Double Precision

Although there is mostly performance gains, there are some cases where performance degrades with the modification.

AGonzales-amd added 6 commits

June 25, 2024 13:29


          implementation

ec8d369


          reference

a41fc7f


          Merge branch 'develop' into larft_syrk


          use gemm in larft

abf57d4


          dispatch l3 larft with bounds

3bf4974


          undo syrk changes

dc49736

tfalders added the noOptimizations label

tfalders reviewed

View reviewed changes

Collaborator

tfalders left a comment

I need to do a deep dive and remember how LARFT works, but I have some general comments right off the bat.

library/src/auxiliary/rocauxiliary_larft.hpp Outdated

@@ @@ -4,6 +4,10 @@ @@
                *     Univ. of Tennessee, Univ. of California Berkeley,
                *     Univ. of Colorado Denver and NAG Ltd..
                *     December 2016
+               * and
+               * Joffrain, Low, Quintana-Ortí, et al. (2006). Accumulating householder
+               * transformations, revisited.

Collaborator

tfalders Sep 18, 2024

It would be nice if this line were indented.

Author

AGonzales-amd Sep 24, 2024

done, please check if I did it correctly.

library/src/auxiliary/rocauxiliary_larft.hpp Outdated

Comment on lines 86 to 87

		Fp[j + i * ldf] *= -tp[i];
		Fp[j + i * ldf] += -tp[i] * Vp[i + j * ldv];

Collaborator

tfalders Sep 18, 2024

You've got two writes to global memory here, as well as two reads from tp[i]. You may want to cache tp[i] and Fp[j + i * ldf] * -tp[i] in separate local variables, which will hopefully speed up the kernel.

Author

AGonzales-amd Sep 24, 2024

changed to reduce redundant reads/writes

library/src/auxiliary/rocauxiliary_larft.hpp Outdated

                   rocblas_fill uplo;
                   rocblas_operation trans;
+                  const bool call_l3 = larft_do_l3<T>(n, direct, storev) && n > k;

Collaborator

tfalders Sep 18, 2024

I'm not a fan of these names. Maybe something like use_gemm and larft_use_gemm.

Author

AGonzales-amd Sep 24, 2024

done

library/src/auxiliary/rocauxiliary_larft.hpp

+                  if(call_l3)
+                  {
+                      if(direct == rocblas_forward_direction && storev == rocblas_column_wise)

Collaborator

tfalders Sep 18, 2024

A short comment explaining what the gemm is doing would be appreciated.

Author

AGonzales-amd Sep 24, 2024

added a comment

library/src/auxiliary/rocauxiliary_larft.hpp

+                  return value[get_index(intervals, max, dim)];
+              }
+              template <typename T, typename I, std::enable_if_t<std::is_same_v<T, rocblas_float_complex>, int> = 0>

Collaborator

tfalders Sep 18, 2024

A short comment describing why this exception for float complex exists would be appreciated.

Author

AGonzales-amd Sep 24, 2024

added a comment

AGonzales-amd added 2 commits

September 24, 2024 18:17


          fixes

c421dde


          formatting

c20db67

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

tfalders tfalders left review comments

jzuniga-amd Awaiting requested review from jzuniga-amd jzuniga-amd will be requested when the pull request is marked ready for review jzuniga-amd is a code owner

cgmb Awaiting requested review from cgmb cgmb will be requested when the pull request is marked ready for review cgmb is a code owner

qjojo Awaiting requested review from qjojo qjojo will be requested when the pull request is marked ready for review qjojo is a code owner

EdDAzevedo Awaiting requested review from EdDAzevedo EdDAzevedo will be requested when the pull request is marked ready for review EdDAzevedo is a code owner

jmachado-amd Awaiting requested review from jmachado-amd jmachado-amd will be requested when the pull request is marked ready for review jmachado-amd is a code owner

At least 2 approving reviews are required to merge this pull request.

Labels

noOptimizations