-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use L3 BLAS in LARFT #799
base: develop
Are you sure you want to change the base?
Use L3 BLAS in LARFT #799
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to do a deep dive and remember how LARFT works, but I have some general comments right off the bat.
@@ -4,6 +4,10 @@ | |||
* Univ. of Tennessee, Univ. of California Berkeley, | |||
* Univ. of Colorado Denver and NAG Ltd.. | |||
* December 2016 | |||
* and | |||
* Joffrain, Low, Quintana-Ortí, et al. (2006). Accumulating householder | |||
* transformations, revisited. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if this line were indented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, please check if I did it correctly.
Fp[j + i * ldf] *= -tp[i]; | ||
Fp[j + i * ldf] += -tp[i] * Vp[i + j * ldv]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've got two writes to global memory here, as well as two reads from tp[i]
. You may want to cache tp[i]
and Fp[j + i * ldf] * -tp[i]
in separate local variables, which will hopefully speed up the kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to reduce redundant reads/writes
@@ -269,13 +390,50 @@ rocblas_status rocsolver_larft_template(rocblas_handle handle, | |||
rocblas_fill uplo; | |||
rocblas_operation trans; | |||
|
|||
const bool call_l3 = larft_do_l3<T>(n, direct, storev) && n > k; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of these names. Maybe something like use_gemm
and larft_use_gemm
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
if(call_l3) | ||
{ | ||
if(direct == rocblas_forward_direction && storev == rocblas_column_wise) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A short comment explaining what the gemm is doing would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment
return value[get_index(intervals, max, dim)]; | ||
} | ||
|
||
template <typename T, typename I, std::enable_if_t<std::is_same_v<T, rocblas_float_complex>, int> = 0> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A short comment describing why this exception for float complex exists would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment
This PR introduces a potential optimization to the LARFT routine. The modification aims to reduce the size of the gemv computations and instead offloads the block part of the computation to a call to gemm. Additionally, in some cases the modified method performs worse than the original so the latter is dispatched instead.
Following are performance LARFT figures with configuration k=64, rocblas_forward_direction and rocblas_column_wise (config for QR factorization).
column_wise and forward_direction has similar performance characteristic while row_wise configurations show improvements for all data types, e.g.,
This also shows that the gains with the row_wise configuration is more significant (~x18 vs ~x3 speedup).
Curiously, GEQRF shows weird performance with this modification. These figures were generated with square matrices.
Although there is mostly performance gains, there are some cases where performance degrades with the modification.