Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use L3 BLAS in LARFT #799

Draft
wants to merge 8 commits into
base: develop
Choose a base branch
from
Draft

Conversation

AGonzales-amd
Copy link

@AGonzales-amd AGonzales-amd commented Aug 20, 2024

This PR introduces a potential optimization to the LARFT routine. The modification aims to reduce the size of the gemv computations and instead offloads the block part of the computation to a call to gemm. Additionally, in some cases the modified method performs worse than the original so the latter is dispatched instead.

Following are performance LARFT figures with configuration k=64, rocblas_forward_direction and rocblas_column_wise (config for QR factorization).

  • Single Precision
    log_compare_slarftcf_m
  • Double Precision
    log_compare_dlarftcf_m
  • Complex Single Precision
    log_compare_clarftcf_m
  • Complex Double Precision
    log_compare_zlarftcf_m

column_wise and forward_direction has similar performance characteristic while row_wise configurations show improvements for all data types, e.g.,

  • Single Precision
    compare_slarftrf_m
    This also shows that the gains with the row_wise configuration is more significant (~x18 vs ~x3 speedup).

Curiously, GEQRF shows weird performance with this modification. These figures were generated with square matrices.

  • Single Precision
    compare_sgeqrf_m
  • Double Precision
    compare_dgeqrf_m
  • Complex Single Precision
    compare_cgeqrf_m
  • Complex Double Precision
    compare_zgeqrf_m

Although there is mostly performance gains, there are some cases where performance degrades with the modification.

@tfalders tfalders added the noOptimizations Disable optimized kernels for small sizes for some routines label Sep 11, 2024
Copy link
Collaborator

@tfalders tfalders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to do a deep dive and remember how LARFT works, but I have some general comments right off the bat.

@@ -4,6 +4,10 @@
* Univ. of Tennessee, Univ. of California Berkeley,
* Univ. of Colorado Denver and NAG Ltd..
* December 2016
* and
* Joffrain, Low, Quintana-Ortí, et al. (2006). Accumulating householder
* transformations, revisited.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if this line were indented.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, please check if I did it correctly.

Comment on lines 86 to 87
Fp[j + i * ldf] *= -tp[i];
Fp[j + i * ldf] += -tp[i] * Vp[i + j * ldv];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've got two writes to global memory here, as well as two reads from tp[i]. You may want to cache tp[i] and Fp[j + i * ldf] * -tp[i] in separate local variables, which will hopefully speed up the kernel.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to reduce redundant reads/writes

@@ -269,13 +390,50 @@ rocblas_status rocsolver_larft_template(rocblas_handle handle,
rocblas_fill uplo;
rocblas_operation trans;

const bool call_l3 = larft_do_l3<T>(n, direct, storev) && n > k;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of these names. Maybe something like use_gemm and larft_use_gemm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


if(call_l3)
{
if(direct == rocblas_forward_direction && storev == rocblas_column_wise)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A short comment explaining what the gemm is doing would be appreciated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment

return value[get_index(intervals, max, dim)];
}

template <typename T, typename I, std::enable_if_t<std::is_same_v<T, rocblas_float_complex>, int> = 0>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A short comment describing why this exception for float complex exists would be appreciated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
noOptimizations Disable optimized kernels for small sizes for some routines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants