Move sterf to CPU; Add experimental parallelism for sterf #462

mdvizov · 2022-08-23T12:53:52Z

Hi, I've been trying to improve performance of SYEVD function lately. The sterf kernel is the most time-consuming part of the code. I tried to use two ways to improve it:

split tridiagonal matrix based on Gershgorin intervals (similar to ScaLAPACK, https://github.com/Reference-ScaLAPACK/scalapack/blob/master/SRC/slarre2.f);
move sterf execution to CPU.

Efficiency of the first approach depends on input values of tridiagonal matrix and doesn't provide stable improvement. The second variant provides significant acceleration in all cases (acceleration in 10 times).

I am attaching the relevant patch.

cgmb · 2022-08-24T13:44:45Z

Thanks. I'm working on SYEVD performance myself, but this is a very different approach. I'm not able to review your changes right this moment, but I will take a look as soon as possible. I know @jzuniga-amd has talked about hybrid host/device algorithms before. Thus far, rocSOLVER has been a purely GPU library but I don't think we've ruled out hybrid approaches.

I wonder if an environment variable might be better than a compile-time flag for switching from a pure GPU algorithm to a hybrid algorithm. In general, that sort of decision may depend on the balance of CPU and GPU resources available, which is a very dynamic condition. It's a function of both the hardware capacity and the mix of other jobs running on the same hardware.

EdDAzevedo · 2022-08-24T18:42:36Z

library/src/auxiliary/rocauxiliary_sterf.hpp

+        Substituting it here with a simple sorting algorithm. If more performance is required in
+        the future, lasrt_increasing should be debugged or another quick-sort method
+        could be implemented) **/
+    for(int ii = 1; ii < n; ii++)


One idea may to use shell sort (https://en.wikipedia.org/wiki/Shellsort) or call lapack lasrt.

Note that this is a copy-paste from sterf (rocauxiliary_sterf.hpp). The original code contained a TODO which mentioned problems in lasrt so I decided to leave sort as is. I can implement shell sort if needed though.

In any case sorting takes a small fraction of total kernel time (most time-consuming part is QR/QL iteration) so this is probably lower priority...

EdDAzevedo · 2022-08-24T18:43:53Z

library/src/auxiliary/rocauxiliary_sterf.hpp

+    to compute the eigenvalues of a symmetric tridiagonal matrix given by D
+    and E on CPUi, non batched version**/
+template <typename T>
+void sterf_cpu(const rocblas_int n,


How about calling lapack dsterf directly?

Is there OpenMP parallelism for sterf on CPU host, perhaps "-fopenmp" supported by clang/flang ?
One may also consider OpenMP parallelism across batch on CPU host.

If it is a serial algorithm on a single cpu core, would sterf still be an expensive part or take too much time when n is large?

How about calling lapack dsterf directly?

That might be nice to control with a CMake configuration option.

Is there OpenMP parallelism for sterf on CPU host, perhaps "-fopenmp" supported by clang/flang ?

I'm not sure if support for CPU OpenMP with ROCm is something that just happens to work or if it's an officially supported feature. If it just happens to work, the feature might not be maintained and that could leave us in a tough spot in the future. OpenMP GPU offloading is explicitly supported with ROCm, but I'm not sure how OpenMP CPU and GPU offloading interact.

If you know more, I'd be interested to learn about how OpenMP is supposed to work in ROCm.

My understanding is something like "#pragma omp parallel for" will be ignore or treated as comment if the compiler does not support that pragma. There might be a warning at compile time but should not affect correctness, just affect performance. I think the current rocm/hipcc do support "-fopenmp" OpenMP. Perhaps the hipcc/clang compiler group can verify whether OpenMP will be supported. I suspect the OpenMP for cpu host is mature, but support for OpenMP 5.0 with " target offload" to GPU may still be under development.

How about calling lapack dsterf directly?

We could do that but dsterf comes from lapack/cblas which seems to be used only in tests and benchmarks. Is it ok to introduce such dependency in rocSOLVER proper?

If it is a serial algorithm on a single cpu core, would sterf still be an expensive part or take too much time when n is large?

Sterf on CPU seems to be much faster than GPU for all sizes that I was able to measure:

I think routine dsterf() is part of lapack (liblapack.a).
One may need to call the routine with the "Fortran" version of subroutine arguments.

EdDAzevedo · 2022-08-24T18:47:29Z

library/src/auxiliary/rocauxiliary_sterf.hpp

+
+    const rocblas_int tid = hipThreadIdx_x;
+
+    l0 = l = split_ranges[2 * tid];


A minor comment that the letter "el" can look like "1" one in some fonts.
Perhaps it can use another name to avoid "el" or "l0" to look like ten "10"?
Just a suggestion.

Ah, sure. I can fix this.

EdDAzevedo · 2022-08-24T18:50:39Z

library/src/auxiliary/rocauxiliary_sterf.hpp

+        T* h_E = new T[n];
+        rocblas_int h_info = 0;
+
+        hipDeviceSynchronize();


Is there asynchronous kernel launching on streams?
If so, perhaps stream synchronize can be used instead of whole device synchronize?

Yes, I`ll fix this too.

mdvizov · 2022-08-25T12:49:52Z

Does anyone know how to access logs for failing builds btw? It seems they are on some internal AMD CI server.

tfalders · 2022-08-25T17:26:46Z

Does anyone know how to access logs for failing builds btw? It seems they are on some internal AMD CI server.

It looks like there are failures in several functions unrelated to sterf. This seems to be affecting other PRs too, so it doesn't appear to have been caused by your changes.

cgmb · 2022-08-29T11:52:18Z

Does anyone know how to access logs for failing builds btw? It seems they are on some internal AMD CI server.

I'm afraid it's not possible for community members to see the logs. If it were an actual issue, I'd copy the output into this thread. However, the cause of the failure has nothing to do with your changes.

rocSOLVER PRs are tested using the latest build of the corresponding rocBLAS branch and it seems there's been a bug in syr2k on the rocBLAS develop branch. The rocBLAS team is aware of the problem and they'll fix it before release, but it will cause failures in the rocSOLVER CI until it is fixed.

mdvizov · 2022-09-22T07:47:32Z

I believe I addressed all code review remarks. Let me know if anything else is needed.

jzuniga-amd · 2022-09-22T18:47:33Z

The team is still discussing the best ways to introduce pure-CPU or hybrid algorithms into the library. There is nothing wrong with offloading computations to the CPU (especially when the problem/algorithm is not very suitable for GPUs), either by calling internal code (sterf_cpu) or linking to another CPU library (lapack_sterf), but the change must be introduced carefully.

Currently, rocSOLVER is a 100% GPU library; calls to rocSOLVER APIs are asynchronous on the host; they can return immediately even if the computations on the device are not done yet and, in practice, no CPU cycles are used. This behavior is expected by some users and changing it could be problematic for some workflows, especially if the switch is embedded in the build process (at compile time). If users can purposely switch between CPU, GPU or hybrid modes at run time could provide more flexibility, but we need to carefully plan the design as this is also related to other upcoming features in our roadmaps.

(The part of this PR that optimizes/parallelizes the GPU code would be easier to review and merge at this time -and we can leave the CPU/hybrid stuff for a future PR-, but it is up to you whether you want to divide your contribution into two different PRs).

Normally, when we add a new function to the library, we initially focus on the correctness of the algorithm (accuracy of results, etc.) and the functionality of the API. For the optimization round, we want that the performance gain really justifies the introduction of, possibly, more complicated code and its implications on the code maintenance process. It is clear that the sterf_cpu code performs better on the tested cases, but there are other questions I would like to explore as well:

Is there any difference between the in-house sterf_cpu code and lapack_sterf? Which one is faster? CPU options are running batch problems in a sequential for-loop; is the gain in performance enough to overperform the GPU code that runs the batch in parallel? And what about the GPU+optimizations code? If this is not the case, we should find and add a switch-size for the batched and strided_batched routines, or limit CPU code to only normal non-batch executions (especially if the CPU code options are only enabled/disabled at compile time). I also wonder what is the effect of the optimizations/parallelization on the GPU code; how does it compare against the original code and against sterf_cpu or lapack_sterf? Is the performance gain enough to justify merging only the optimized GPU code for now?

We also need to think of a rationale that justifies the proposed approach on a higher level in a workflow. Advanced users may know whether their problems are better suited for CPU or GPU computations, and there is nothing preventing them to use different libraries according to their needs; rocSOLVER is not intended to be a single-entry-point fit-all-cases solution, if a user doesn’t get enough performance out of a GPU solution, they can directly call a CPU library instead (and AMD CPU library or others). Right now, with the memory model that we are using, I don’t see why a user would like to transfer data to the GPU to be able to call a rocSOLVER API that will simply copy the data back to the host to perform the computations on the CPU (and this is essentially what rocsolver_sterf will end up doing).

A scenario like this (with internal data transfers) makes more sense with a truly hybrid code, i.e. a code that processes the data via a GPU-CPU collaboration, which is not the case of the proposed rocsolver_sterf. Another scenario that, IMO, may justify the use of the pure CPU rocsolver_sterf is when the data is already on the GPU, which is the case of syevd, for example. Rocsolver_syevd needs the data on the GPU to perform the initial tri-diagonalization before calling the pure-CPU routines sterf_cpu or lapack_sterf. These kinds of “hybrid” functions could make sense but, as I said, we just need to think of the best way to integrate them and, especially, how to document them in the Users Guide.

In the meantime, please notice that syevd is not the only function that calls sterf. Sterf could be called by rocsolver_syev and rocsolver_stedc as well. These functions could be then “hybrid” in the same sense and take advantage of the more efficient sterf implementation. So, they may need the same amendments as rocsolver_syevd (in particular, looking only at the optimizations on the GPU code, I am not sure if rocsolver_syev or rocsolver_stedc will work if the library is built with EXPERIMENTAL mode as the workspace requirements of sterf are changing). This is something that would need to be addressed, either on this PR or a future one.

Move sterf to CPU; add experimental parallelism for sterf

def0e2a

mdvizov requested review from jzuniga-amd, tfalders, cgmb and qjojo as code owners August 23, 2022 12:53

cgmb requested a review from EdDAzevedo August 24, 2022 11:55

EdDAzevedo reviewed Aug 24, 2022

View reviewed changes

mdvizov added 2 commits September 1, 2022 11:20

Add dsterf calling, minor fixes

20c9487

Off LAPACK flag; Minor fixes

4c0203d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move sterf to CPU; Add experimental parallelism for sterf #462

Move sterf to CPU; Add experimental parallelism for sterf #462

mdvizov commented Aug 23, 2022 •

edited

Loading

cgmb commented Aug 24, 2022

EdDAzevedo Aug 24, 2022

mdvizov Aug 25, 2022

EdDAzevedo Aug 24, 2022

cgmb Aug 27, 2022 •

edited

Loading

EdDAzevedo Aug 27, 2022

mdvizov Aug 29, 2022

EdDAzevedo Aug 29, 2022

EdDAzevedo Aug 24, 2022

mdvizov Aug 25, 2022

EdDAzevedo Aug 24, 2022

mdvizov Aug 25, 2022

mdvizov commented Aug 25, 2022

tfalders commented Aug 25, 2022

cgmb commented Aug 29, 2022

mdvizov commented Sep 22, 2022

jzuniga-amd commented Sep 22, 2022


		const rocblas_int tid = hipThreadIdx_x;

		l0 = l = split_ranges[2 * tid];

Move sterf to CPU; Add experimental parallelism for sterf #462

Are you sure you want to change the base?

Move sterf to CPU; Add experimental parallelism for sterf #462

Conversation

mdvizov commented Aug 23, 2022 • edited Loading

cgmb commented Aug 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgmb Aug 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdvizov commented Aug 25, 2022

tfalders commented Aug 25, 2022

cgmb commented Aug 29, 2022

mdvizov commented Sep 22, 2022

jzuniga-amd commented Sep 22, 2022

mdvizov commented Aug 23, 2022 •

edited

Loading

cgmb Aug 27, 2022 •

edited

Loading