Aot change merge #549

groenenboomj · 2024-03-28T17:21:20Z

Merge in AOTriton backwards kernel changes

Bring in changes from AOT, unedited.

xinyazhang · 2024-03-28T17:28:46Z

python/perf-kernels/flash-attention.py

@@ -544,132 +591,198 @@ def _bwd_kernel_dk_dv(
    stride_qz, stride_qh, stride_qm, stride_qk,
    stride_kz, stride_kh, stride_kn, stride_kk,
    stride_vz, stride_vh, stride_vk, stride_vn,
+    stride_oz, stride_oh, stride_om, stride_ok,


dkdv kernel needs strides from do, dk and dv (o's not used, contradicting to its name)

xinyazhang · 2024-03-28T17:30:15Z

python/perf-kernels/flash-attention.py

+            q = tl.load(Q_block_ptr, boundary_check=(0,1), padding_option="zero")
+            do = tl.load(DO_block_ptr, boundary_check=(0,1), padding_option="zero")
+        else:
+            q = tl.load(Q_block_ptr, boundary_check=(0,), padding_option="zero")


Needs one more level of branching.
However if we are aiming for the performance we should consider commenting out the boundary_check for now

xinyazhang · 2024-03-28T17:31:12Z

python/perf-kernels/flash-attention.py

@@ -680,82 +793,118 @@ def _bwd_kernel_dq(
    stride_qz, stride_qh, stride_qm, stride_qk,
    stride_kz, stride_kh, stride_kn, stride_kk,
    stride_vz, stride_vh, stride_vk, stride_vn,
-    seqlen_q, seqlen_k, dropout_p, philox_seed, philox_offset_base,
+    stride_oz, stride_oh, stride_om, stride_ok,


Similar to dkdv, dq kernel needs strides from do and dq (again, not strides from o)

xinyazhang · 2024-03-28T17:32:35Z

python/perf-kernels/flash-attention.py

        strides=(stride_qm, stride_qk),
        offsets=(start_m, 0),
        block_shape=(BLOCK_M, BLOCK_DMODEL),
        order=(1, 0)
    )
-    tl.store(DQ_block_ptr, (dq * sm_scale).to(DQ_block_ptr.type.element_ty))
+    tl.store(DQ_block_ptr, (dq * sm_scale).to(DQ_block_ptr.type.element_ty), boundary_check=(0,1))


tl.store also cause some performance penalties with boundary checks, although it shouldn't.

xinyazhang · 2024-03-28T17:32:56Z

python/perf-kernels/flash-attention.py

            q, k, v, ctx.sm_scale,
            o, do_scaled,
            dk, dv,
            L, delta,
            q.stride(0), q.stride(1), q.stride(2), q.stride(3),
            k.stride(0), k.stride(1), k.stride(2), k.stride(3),
            v.stride(0), v.stride(1), v.stride(2), v.stride(3),
+            o.stride(0), o.stride(1), o.stride(2), o.stride(3),


One more reminder, o.strides are not used by backward kernels.

xinyazhang · 2024-03-28T17:34:28Z

python/perf-kernels/flash-attention.py

@@ -893,28 +1046,41 @@ def backward(ctx, do, _):
        seqlen_q = q.shape[2]


remove these two assertions above, since they are not need anymore.

xinyazhang · 2024-03-28T17:34:56Z

python/perf-kernels/flash-attention.py

-    q = torch.empty((Z, H, N_CTX, D_HEAD), dtype=dtype, device="cuda").normal_(mean=0., std=0.5).requires_grad_()
-    k = torch.empty((Z, H, N_CTX, D_HEAD), dtype=dtype, device="cuda").normal_(mean=0., std=0.5).requires_grad_()
-    v = torch.empty((Z, H, N_CTX, D_HEAD), dtype=dtype, device="cuda").normal_(mean=0., std=0.5).requires_grad_()
+    dropout_p = 0


dropout_p = 0?

xinyazhang · 2024-03-28T17:35:17Z

python/perf-kernels/flash-attention.py

@@ -1186,45 +1361,88 @@ def test_op_varlen_mqa_fwd(Z, HQ, HK, N_CTX, D_HEAD, causal, dtype=torch.float16
                          (4, 48, 2048, 64),
                          (4, 48, 4096, 64),
                          (1, 16, 8192, 64),
+                          (1, 16, 128, 32),


One major concern is the UT coverage here.

groenenboomj added 4 commits March 28, 2024 17:15

Whitespace

62f7dc2

Testing update

4a3bf5e

Merge in AOT changes

4180061

Add support for PyTorch SDPA reference in backwards tests

52a0b37

xinyazhang approved these changes Mar 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aot change merge #549

Aot change merge #549

groenenboomj commented Mar 28, 2024

xinyazhang Mar 28, 2024

xinyazhang Mar 28, 2024

xinyazhang Mar 28, 2024

xinyazhang Mar 28, 2024

xinyazhang Mar 28, 2024

xinyazhang Mar 28, 2024

xinyazhang Mar 28, 2024

xinyazhang Mar 28, 2024

		@@ -893,28 +1046,41 @@ def backward(ctx, do, _):
		seqlen_q = q.shape[2]

Aot change merge #549

Are you sure you want to change the base?

Aot change merge #549

Conversation

groenenboomj commented Mar 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment