[Feature Request] Support for Direct GPU Kernel Definitions and Execution in Mojo #3788

guna-sd · 2024-11-20T18:02:40Z

Current Behavior:

Writing GPU kernels directly in Mojo is not yet natively supported. As a workaround, it is possible to write the GPU kernel in C or C++ (e.g., by specifying global, device), compile it into a ptx and use it by dynamically loading cuda lib in Mojo using a DLLHandle. The kernel can then be executed via cuLaunchKernel or similar APIs for GPU invocation.

example

from mccl.cuda import CudaLib, cudaDeviceProp
from memory import UnsafePointer

fn main():
    var cudaLib = CudaLib()
    var get_device_count = cudaLib.load_function[fn (UnsafePointer[Int32]) -> Int32]("cudaGetDeviceCount")
    var get_device_properties = cudaLib.load_function[fn (UnsafePointer[cudaDeviceProp], Int32) -> Int32]("cudaGetDeviceProperties")
    
    var deviceCount: Int32 = 0
    _ = get_device_count(UnsafePointer.address_of(deviceCount))
    
    print("Number of CUDA devices:", deviceCount)
    
    for i in range(deviceCount):
        var deviceProp = cudaDeviceProp()
        _ = get_device_properties(UnsafePointer.address_of(deviceProp), i)
        print("Device %d: %s\n", i, deviceProp.name)

Proposed Solution:

Mojo could introduce a new keyword or decorator (e.g., @kernel or @global) to explicitly declare a function as a GPU kernel.
This would allow Mojo to handle such functions differently from regular ones, generating the corresponding low-level LLVM-IR.

example:

@kernel
fn addveckernel(A: UnsafePointer[Float32], B: UnsafePointer[Float32], C: UnsafePointer[Float32], n:Int):
    var tid = threadIdx.x
    var bid = blockIdx.x
    var block_size = blockDim.x

    var global_idx = tid + bid * block_size
    
    if global_idx < n:
        var a_val = A[global_idx].load()
        var b_val = B[global_idx].load()
        var c_val = a_val + b_val
        
        C[global_idx].store(c_val)
    return

could possibly translated to below IR

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() readnone nounwind
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x() readnone nounwind

define void @addveckernel(ptr addrspace(1) %A,
                          ptr addrspace(1) %B,
                          ptr addrspace(1) %C,
                          i32 %n) 
{
entry:
    %tid_x = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() ; threadIdx.x
    %bid_x = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() ; blockIdx.x
    %bsize_x = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() ; blockDim.x
    
    %block_offset = mul i32 %bid_x, %bsize_x
    %global_idx = add i32 %tid_x, %block_offset
    
    %in_bounds = icmp slt i32 %global_idx, %n
    br i1 %in_bounds, label %compute, label %exit

compute:
    %ptrA = getelementptr float, ptr addrspace(1) %A, i32 %global_idx
    %ptrB = getelementptr float, ptr addrspace(1) %B, i32 %global_idx
    %ptrC = getelementptr float, ptr addrspace(1) %C, i32 %global_idx
    %valA = load float, ptr addrspace(1) %ptrA, align 4
    %valB = load float, ptr addrspace(1) %ptrB, align 4
    %valC = fadd float %valA, %valB
    store float %valC, ptr addrspace(1) %ptrC, align 4
    br label %exit

exit:
    ret void
}

!nvvm.annotations = !{!0}
!0 = !{ptr @addveckernel, !"kernel", i32 1}

Reference

Checkout the official mojo blogs for details

Modular Blog about mojo, LLVM, MLIR

Modular Blog about gpu module in mojo

Kernel Launch Mechanism:

Mojo could introduce a LaunchKernel function to launch a GPU kernel, passing the necessary grid and block dimensions. The kernel function would be compiled to PTX code and executed on the GPU directly. The call might look like this:
LaunchKernel(addvecKernel, gridDim=dim3(1, 1, 1), blockDim=dim3(256, 1, 1), args=(a, b, c, n))
or direct launch similar to that of <<>> in c/cpp implementing kernel call in mojo, the call might look like this:
addvecKernel[dim3(1, 1, 1), dim3(256, 1, 1)](a, b, c, n)

Conclusion:

The proposed change to Mojo support for direct GPU kernel definitions and execution would significantly enhance the language's capabilities for high-performance parallel computing.

Related Issues

Give feedback

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Support for Direct GPU Kernel Definitions and Execution in Mojo #3788

[Feature Request] Support for Direct GPU Kernel Definitions and Execution in Mojo #3788

guna-sd commented Nov 20, 2024 •

edited

Loading

Related Issues

[Feature Request] Support for Direct GPU Kernel Definitions and Execution in Mojo #3788

[Feature Request] Support for Direct GPU Kernel Definitions and Execution in Mojo #3788

Comments

guna-sd commented Nov 20, 2024 • edited Loading

Current Behavior:

example

Proposed Solution:

example:

Reference

Checkout the official mojo blogs for details

Modular Blog about mojo, LLVM, MLIR

Modular Blog about gpu module in mojo

Kernel Launch Mechanism:

Conclusion:

Related Issues

guna-sd commented Nov 20, 2024 •

edited

Loading