Allocate managed memory if device memory runs out #709

ngc92 · 2024-07-24T13:50:31Z

Use cudaMallocManaged to allocate optimizer states if we run out of device memory, so we can still train (slowly) even if we cannot fit the optimizer state
This is based on #694 , which should be merged first

…emory

karpathy · 2024-08-16T16:43:26Z

train_gpt2.cu

@@ -393,13 +393,13 @@ void gpt2_allocate_state(GPT2 *model, int B, int T) {
    printf0("allocating %zu MiB for AdamW optimizer state v\n", (shard_num_parameters * sizeof(float)) >> 20);
    assert(model->m_memory == nullptr);
    assert(model->v_memory == nullptr);
-    cudaCheck(cudaMalloc((void**)&model->m_memory, shard_num_parameters * sizeof(float)));
-    cudaCheck(cudaMalloc((void**)&model->v_memory, shard_num_parameters * sizeof(float)));
+    cudaMallocConditionallyManaged((void**)&model->m_memory, shard_num_parameters * sizeof(float));


imo we should try to update the import statements to show which file any function (e.g. the new Managed manaloc) comes from (here cuda_utils)

ngc92 force-pushed the managed-2 branch from 97eb262 to 71655e7 Compare July 24, 2024 15:01

fall back to cudaMallocManaged for optimizer states if we're out of m…

0d52d2a

…emory

ngc92 force-pushed the managed-2 branch from 8618349 to 3828482 Compare August 15, 2024 22:42

ngc92 added 2 commits August 16, 2024 01:43

just try to allocate on device; fallback if that fails

c845757

hint to host

f72c1f2

ngc92 force-pushed the managed-2 branch from 3828482 to f72c1f2 Compare August 15, 2024 22:44

karpathy reviewed Aug 16, 2024

View reviewed changes

karpathy merged commit f72c1f2 into karpathy:master Aug 16, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocate managed memory if device memory runs out #709

Allocate managed memory if device memory runs out #709

ngc92 commented Jul 24, 2024

karpathy Aug 16, 2024

Allocate managed memory if device memory runs out #709

Allocate managed memory if device memory runs out #709

Conversation

ngc92 commented Jul 24, 2024

karpathy Aug 16, 2024

Choose a reason for hiding this comment