Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move sterf to CPU; Add experimental parallelism for sterf #462

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

mdvizov
Copy link

@mdvizov mdvizov commented Aug 23, 2022

Hi, I've been trying to improve performance of SYEVD function lately. The sterf kernel is the most time-consuming part of the code. I tried to use two ways to improve it:

  1. split tridiagonal matrix based on Gershgorin intervals (similar to ScaLAPACK, https://github.com/Reference-ScaLAPACK/scalapack/blob/master/SRC/slarre2.f);
  2. move sterf execution to CPU.

Efficiency of the first approach depends on input values of tridiagonal matrix and doesn't provide stable improvement. The second variant provides significant acceleration in all cases (acceleration in 10 times).

Dsyevd comparison

I am attaching the relevant patch.

@cgmb
Copy link
Collaborator

cgmb commented Aug 24, 2022

Thanks. I'm working on SYEVD performance myself, but this is a very different approach. I'm not able to review your changes right this moment, but I will take a look as soon as possible. I know @jzuniga-amd has talked about hybrid host/device algorithms before. Thus far, rocSOLVER has been a purely GPU library but I don't think we've ruled out hybrid approaches.

I wonder if an environment variable might be better than a compile-time flag for switching from a pure GPU algorithm to a hybrid algorithm. In general, that sort of decision may depend on the balance of CPU and GPU resources available, which is a very dynamic condition. It's a function of both the hardware capacity and the mix of other jobs running on the same hardware.

Substituting it here with a simple sorting algorithm. If more performance is required in
the future, lasrt_increasing should be debugged or another quick-sort method
could be implemented) **/
for(int ii = 1; ii < n; ii++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea may to use shell sort (https://en.wikipedia.org/wiki/Shellsort) or call lapack lasrt.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is a copy-paste from sterf (rocauxiliary_sterf.hpp). The original code contained a TODO which mentioned problems in lasrt so I decided to leave sort as is. I can implement shell sort if needed though.

In any case sorting takes a small fraction of total kernel time (most time-consuming part is QR/QL iteration) so this is probably lower priority...

to compute the eigenvalues of a symmetric tridiagonal matrix given by D
and E on CPUi, non batched version**/
template <typename T>
void sterf_cpu(const rocblas_int n,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling lapack dsterf directly?

Is there OpenMP parallelism for sterf on CPU host, perhaps "-fopenmp" supported by clang/flang ?
One may also consider OpenMP parallelism across batch on CPU host.

If it is a serial algorithm on a single cpu core, would sterf still be an expensive part or take too much time when n is large?

Copy link
Collaborator

@cgmb cgmb Aug 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling lapack dsterf directly?

That might be nice to control with a CMake configuration option.

Is there OpenMP parallelism for sterf on CPU host, perhaps "-fopenmp" supported by clang/flang ?

I'm not sure if support for CPU OpenMP with ROCm is something that just happens to work or if it's an officially supported feature. If it just happens to work, the feature might not be maintained and that could leave us in a tough spot in the future. OpenMP GPU offloading is explicitly supported with ROCm, but I'm not sure how OpenMP CPU and GPU offloading interact.

If you know more, I'd be interested to learn about how OpenMP is supposed to work in ROCm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is something like "#pragma omp parallel for" will be ignore or treated as comment if the compiler does not support that pragma. There might be a warning at compile time but should not affect correctness, just affect performance. I think the current rocm/hipcc do support "-fopenmp" OpenMP. Perhaps the hipcc/clang compiler group can verify whether OpenMP will be supported. I suspect the OpenMP for cpu host is mature, but support for OpenMP 5.0 with " target offload" to GPU may still be under development.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling lapack dsterf directly?

We could do that but dsterf comes from lapack/cblas which seems to be used only in tests and benchmarks. Is it ok to introduce such dependency in rocSOLVER proper?

If it is a serial algorithm on a single cpu core, would sterf still be an expensive part or take too much time when n is large?

Sterf on CPU seems to be much faster than GPU for all sizes that I was able to measure:
Ssyevd_ratio

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think routine dsterf() is part of lapack (liblapack.a).
One may need to call the routine with the "Fortran" version of subroutine arguments.


const rocblas_int tid = hipThreadIdx_x;

l0 = l = split_ranges[2 * tid];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor comment that the letter "el" can look like "1" one in some fonts.
Perhaps it can use another name to avoid "el" or "l0" to look like ten "10"?
Just a suggestion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sure. I can fix this.

T* h_E = new T[n];
rocblas_int h_info = 0;

hipDeviceSynchronize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there asynchronous kernel launching on streams?
If so, perhaps stream synchronize can be used instead of whole device synchronize?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I`ll fix this too.

@mdvizov
Copy link
Author

mdvizov commented Aug 25, 2022

Does anyone know how to access logs for failing builds btw? It seems they are on some internal AMD CI server.

@tfalders
Copy link
Collaborator

Does anyone know how to access logs for failing builds btw? It seems they are on some internal AMD CI server.

It looks like there are failures in several functions unrelated to sterf. This seems to be affecting other PRs too, so it doesn't appear to have been caused by your changes.

@cgmb
Copy link
Collaborator

cgmb commented Aug 29, 2022

Does anyone know how to access logs for failing builds btw? It seems they are on some internal AMD CI server.

I'm afraid it's not possible for community members to see the logs. If it were an actual issue, I'd copy the output into this thread. However, the cause of the failure has nothing to do with your changes.

rocSOLVER PRs are tested using the latest build of the corresponding rocBLAS branch and it seems there's been a bug in syr2k on the rocBLAS develop branch. The rocBLAS team is aware of the problem and they'll fix it before release, but it will cause failures in the rocSOLVER CI until it is fixed.

@mdvizov
Copy link
Author

mdvizov commented Sep 22, 2022

I believe I addressed all code review remarks. Let me know if anything else is needed.

@jzuniga-amd
Copy link
Collaborator

The team is still discussing the best ways to introduce pure-CPU or hybrid algorithms into the library. There is nothing wrong with offloading computations to the CPU (especially when the problem/algorithm is not very suitable for GPUs), either by calling internal code (sterf_cpu) or linking to another CPU library (lapack_sterf), but the change must be introduced carefully.

Currently, rocSOLVER is a 100% GPU library; calls to rocSOLVER APIs are asynchronous on the host; they can return immediately even if the computations on the device are not done yet and, in practice, no CPU cycles are used. This behavior is expected by some users and changing it could be problematic for some workflows, especially if the switch is embedded in the build process (at compile time). If users can purposely switch between CPU, GPU or hybrid modes at run time could provide more flexibility, but we need to carefully plan the design as this is also related to other upcoming features in our roadmaps.

(The part of this PR that optimizes/parallelizes the GPU code would be easier to review and merge at this time -and we can leave the CPU/hybrid stuff for a future PR-, but it is up to you whether you want to divide your contribution into two different PRs).

Normally, when we add a new function to the library, we initially focus on the correctness of the algorithm (accuracy of results, etc.) and the functionality of the API. For the optimization round, we want that the performance gain really justifies the introduction of, possibly, more complicated code and its implications on the code maintenance process. It is clear that the sterf_cpu code performs better on the tested cases, but there are other questions I would like to explore as well:

Is there any difference between the in-house sterf_cpu code and lapack_sterf? Which one is faster? CPU options are running batch problems in a sequential for-loop; is the gain in performance enough to overperform the GPU code that runs the batch in parallel? And what about the GPU+optimizations code? If this is not the case, we should find and add a switch-size for the batched and strided_batched routines, or limit CPU code to only normal non-batch executions (especially if the CPU code options are only enabled/disabled at compile time). I also wonder what is the effect of the optimizations/parallelization on the GPU code; how does it compare against the original code and against sterf_cpu or lapack_sterf? Is the performance gain enough to justify merging only the optimized GPU code for now?

We also need to think of a rationale that justifies the proposed approach on a higher level in a workflow. Advanced users may know whether their problems are better suited for CPU or GPU computations, and there is nothing preventing them to use different libraries according to their needs; rocSOLVER is not intended to be a single-entry-point fit-all-cases solution, if a user doesn’t get enough performance out of a GPU solution, they can directly call a CPU library instead (and AMD CPU library or others). Right now, with the memory model that we are using, I don’t see why a user would like to transfer data to the GPU to be able to call a rocSOLVER API that will simply copy the data back to the host to perform the computations on the CPU (and this is essentially what rocsolver_sterf will end up doing).

A scenario like this (with internal data transfers) makes more sense with a truly hybrid code, i.e. a code that processes the data via a GPU-CPU collaboration, which is not the case of the proposed rocsolver_sterf. Another scenario that, IMO, may justify the use of the pure CPU rocsolver_sterf is when the data is already on the GPU, which is the case of syevd, for example. Rocsolver_syevd needs the data on the GPU to perform the initial tri-diagonalization before calling the pure-CPU routines sterf_cpu or lapack_sterf. These kinds of “hybrid” functions could make sense but, as I said, we just need to think of the best way to integrate them and, especially, how to document them in the Users Guide.

In the meantime, please notice that syevd is not the only function that calls sterf. Sterf could be called by rocsolver_syev and rocsolver_stedc as well. These functions could be then “hybrid” in the same sense and take advantage of the more efficient sterf implementation. So, they may need the same amendments as rocsolver_syevd (in particular, looking only at the optimizations on the GPU code, I am not sure if rocsolver_syev or rocsolver_stedc will work if the library is built with EXPERIMENTAL mode as the workspace requirements of sterf are changing). This is something that would need to be addressed, either on this PR or a future one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants