Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support for Direct GPU Kernel Definitions and Execution in Mojo #3788

Open
1 task
guna-sd opened this issue Nov 20, 2024 · 0 comments
Open
1 task

Comments

@guna-sd
Copy link

guna-sd commented Nov 20, 2024

Current Behavior:

Writing GPU kernels directly in Mojo is not yet natively supported. As a workaround, it is possible to write the GPU kernel in C or C++ (e.g., by specifying global, device), compile it into a ptx and use it by dynamically loading cuda lib in Mojo using a DLLHandle. The kernel can then be executed via cuLaunchKernel or similar APIs for GPU invocation.

example

from mccl.cuda import CudaLib, cudaDeviceProp
from memory import UnsafePointer

fn main():
    var cudaLib = CudaLib()
    var get_device_count = cudaLib.load_function[fn (UnsafePointer[Int32]) -> Int32]("cudaGetDeviceCount")
    var get_device_properties = cudaLib.load_function[fn (UnsafePointer[cudaDeviceProp], Int32) -> Int32]("cudaGetDeviceProperties")
    
    var deviceCount: Int32 = 0
    _ = get_device_count(UnsafePointer.address_of(deviceCount))
    
    print("Number of CUDA devices:", deviceCount)
    
    for i in range(deviceCount):
        var deviceProp = cudaDeviceProp()
        _ = get_device_properties(UnsafePointer.address_of(deviceProp), i)
        print("Device %d: %s\n", i, deviceProp.name)

Proposed Solution:

Mojo could introduce a new keyword or decorator (e.g., @kernel or @global) to explicitly declare a function as a GPU kernel.
This would allow Mojo to handle such functions differently from regular ones, generating the corresponding low-level LLVM-IR.

example:

@kernel
fn addveckernel(A: UnsafePointer[Float32], B: UnsafePointer[Float32], C: UnsafePointer[Float32], n:Int):
    var tid = threadIdx.x
    var bid = blockIdx.x
    var block_size = blockDim.x

    var global_idx = tid + bid * block_size
    
    if global_idx < n:
        var a_val = A[global_idx].load()
        var b_val = B[global_idx].load()
        var c_val = a_val + b_val
        
        C[global_idx].store(c_val)
    return

could possibly translated to below IR

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() readnone nounwind
declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x() readnone nounwind

define void @addveckernel(ptr addrspace(1) %A,
                          ptr addrspace(1) %B,
                          ptr addrspace(1) %C,
                          i32 %n) 
{
entry:
    %tid_x = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() ; threadIdx.x
    %bid_x = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() ; blockIdx.x
    %bsize_x = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() ; blockDim.x
    
    %block_offset = mul i32 %bid_x, %bsize_x
    %global_idx = add i32 %tid_x, %block_offset
    
    %in_bounds = icmp slt i32 %global_idx, %n
    br i1 %in_bounds, label %compute, label %exit

compute:
    %ptrA = getelementptr float, ptr addrspace(1) %A, i32 %global_idx
    %ptrB = getelementptr float, ptr addrspace(1) %B, i32 %global_idx
    %ptrC = getelementptr float, ptr addrspace(1) %C, i32 %global_idx
    %valA = load float, ptr addrspace(1) %ptrA, align 4
    %valB = load float, ptr addrspace(1) %ptrB, align 4
    %valC = fadd float %valA, %valB
    store float %valC, ptr addrspace(1) %ptrC, align 4
    br label %exit

exit:
    ret void
}

!nvvm.annotations = !{!0}
!0 = !{ptr @addveckernel, !"kernel", i32 1}

Reference

mojotoir

Checkout the official mojo blogs for details

Modular Blog about mojo, LLVM, MLIR

Modular Blog about gpu module in mojo

Kernel Launch Mechanism:

Mojo could introduce a LaunchKernel function to launch a GPU kernel, passing the necessary grid and block dimensions. The kernel function would be compiled to PTX code and executed on the GPU directly. The call might look like this:
LaunchKernel(addvecKernel, gridDim=dim3(1, 1, 1), blockDim=dim3(256, 1, 1), args=(a, b, c, n))
or direct launch similar to that of <<>> in c/cpp implementing kernel call in mojo, the call might look like this:
addvecKernel[dim3(1, 1, 1), dim3(256, 1, 1)](a, b, c, n)

Conclusion:

The proposed change to Mojo support for direct GPU kernel definitions and execution would significantly enhance the language's capabilities for high-performance parallel computing.

Related Issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant