Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix using BLAS for all compatible cases of memory layout #1419

Merged
merged 8 commits into from
Aug 8, 2024
Merged

Conversation

bluss
Copy link
Member

@bluss bluss commented Aug 7, 2024

With the blas (cblas) interface it supports matrices that adhere to certain
criteria. They should be contiguous on one dimension (stride=1).

We glance a little at how numpy does this to try to catch all cases.

Compute A B -> C:
We require for BLAS compatibility that: A, B, C are
"weakly" contiguous (stride=1) in their fastest dimension, but it can be
either first or second axis (either rowmajor/"c" or colmajor/"f").

The "normal case" is CblasRowMajor for cblas. Select CblasRowMajor /
CblasColMajor to fit C's memory order.

Apply transpose to A, B as needed if they differ from row major. If C
is CblasColMajor then transpose both A, B (again!)

(Weakly = contiguous with stride=1 on that fastest axis, but stride for the
other axis can be arbitrary large; to differentiate from strictly whole
array contiguous.)

A first commit simplified and corrected the logic, while still using
ndarray's reversed axes. But a further commit simplified it even further, to
a satisfying little function in mat_mul_impl as the final result.

I have kept both states (both commits) because I think the first version is
a useful guide if we would ever go to use plain BLAS instead of CBLAS(?).

Fixes #1278

@bluss bluss force-pushed the blas-layout branch 4 times, most recently from 4e25c2c to 248109d Compare August 7, 2024 11:44
@bluss bluss added this to the 0.16.x milestone Aug 7, 2024
@bluss bluss force-pushed the blas-layout branch 3 times, most recently from be22336 to 5c8b9de Compare August 7, 2024 13:56
@bluss bluss changed the title Fix using BLAS for all possible cases (of memory layout) Fix using BLAS for all compatible cases (of memory layout) Aug 7, 2024
@bluss bluss changed the title Fix using BLAS for all compatible cases (of memory layout) Fix using BLAS for all compatible cases of memory layout Aug 7, 2024
Lost in the recent workspace refactor.
We compute A B -> C with matrices A, B, C

With the blas (cblas) interface it supports matrices that adhere to
certain criteria. They should be contiguous on one dimension (stride=1).

We glance a little at how numpy does this to try to catch all cases.

In short, we accept A, B contiguous on either axis (row or column
major). We use the case where C is (weakly) row major, but if it is
column major we transpose A, B, C => A^t, B^t, C^t so that we are back
to the C row major case.

(Weakly = contiguous with stride=1 on that inner dimension, but stride
for the other dimension can be larger; to differentiate from strictly
whole array contiguous.)

Minor change to the gemv function, no functional change, only updating
due to the refactoring of blas layout functions.

Fixes #1278
If we have a matrix of dimension say 5 x 5, BLAS requires the leading
stride to be >= 5. Smaller cases are possible for read-only array views
in ndarray(broadcasting and custom strides).

In this case we mark the array as not BLAS compatible
Using cblas we can simplify this further to a more satisfying
translation (from ndarray to BLAS), much simpler logic.

Avoids creating and handling an extra layer of array views.
Add a crate with a mock blas implementation, so that we can assert that
cblas_sgemm etc are called (depending on memory layout).
@bluss bluss added the blas label Aug 8, 2024
@bluss bluss merged commit f563af0 into master Aug 8, 2024
12 checks passed
@bluss bluss deleted the blas-layout branch August 8, 2024 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

X^T*B doesn't call BLAS gemm
1 participant