You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, guys, I'm not reporting a bug here. But asking for your advice here.
I have a 6b model, like chatglm1-6b. And I also have a machine with 256gb cpu memory and 4 gpus (3090, each with 24gb gpu memory).
I want to finetune the lora model of the original model (the trainable parameters is about 0.5% as I saw it reported in training logs).
With a single, or two, or four 3090 gpu, it always report cuda out of memory during the stage of back-propagation. And below is the full logs (with 4 gpus)
What I want to know is if there any configuration of deepspeed that could support this 6b model fine-tuning under current hardware: 256GB cpu memor and 4 gpus of 3090 with 24GB vram for each one. Or this is not achievable with such hardware.
Below is the deepspeed configuration I used and the logs reported from deepspeed.
And here is core logic of model initialization and training
# Load pretrained model only in cpu memorymodel=ChatGLMForConditionalGeneration.from_pretrained(
args.model_dir,
device_map="cpu")
tokenizer=ChatGLMTokenizer.from_pretrained(args.model_dir)
config=LoraConfig(r=args.lora_r,
lora_alpha=32,
target_modules=["query_key_value"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
inference_mode=False,
)
# create lora modelmodel=get_peft_model(model, config).half()
# ignore the dataset and dataload stuff here
...
# Initialize the deepspeed engine.model_engine, optimizer, _, _=deepspeed.initialize(config=conf,
model=model,
model_parameters=model.parameters())
# Ignore the forward and backward part, they are something like: `outputs = model_engine.forward(input_ids=input_ids, labels=labels)` and `model_engine.backward(loss)`
96GB of VRAM should be plenty for a 6B model. I would start with a simpler ds_config (no offload, and leave the communication values to default) and see the result. Use the DeepSpeed function see_memory_usage to monitor GPU memory usage in code, or watch the nvidia-smi manually to observe GPU memory usage.
Hi, guys, I'm not reporting a bug here. But asking for your advice here.
I have a 6b model, like chatglm1-6b. And I also have a machine with 256gb cpu memory and 4 gpus (3090, each with 24gb gpu memory).
I want to finetune the lora model of the original model (the trainable parameters is about 0.5% as I saw it reported in training logs).
With a single, or two, or four 3090 gpu, it always report cuda out of memory during the stage of back-propagation. And below is the full logs (with 4 gpus)
What I want to know is if there any configuration of deepspeed that could support this 6b model fine-tuning under current hardware: 256GB cpu memor and 4 gpus of 3090 with 24GB vram for each one. Or this is not achievable with such hardware.
Below is the deepspeed configuration I used and the logs reported from deepspeed.
Full logs of deepspeed:
The text was updated successfully, but these errors were encountered: