Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu out of memory for lora model when using zero 3 #6679

Open
xuanhua opened this issue Oct 28, 2024 · 2 comments
Open

gpu out of memory for lora model when using zero 3 #6679

xuanhua opened this issue Oct 28, 2024 · 2 comments
Assignees

Comments

@xuanhua
Copy link
Contributor

xuanhua commented Oct 28, 2024

Hi, guys, I'm not reporting a bug here. But asking for your advice here.

I have a 6b model, like chatglm1-6b. And I also have a machine with 256gb cpu memory and 4 gpus (3090, each with 24gb gpu memory).
I want to finetune the lora model of the original model (the trainable parameters is about 0.5% as I saw it reported in training logs).

With a single, or two, or four 3090 gpu, it always report cuda out of memory during the stage of back-propagation. And below is the full logs (with 4 gpus)

What I want to know is if there any configuration of deepspeed that could support this 6b model fine-tuning under current hardware: 256GB cpu memor and 4 gpus of 3090 with 24GB vram for each one. Or this is not achievable with such hardware.

Below is the deepspeed configuration I used and the logs reported from deepspeed.

{"train_micro_batch_size_per_gpu": 1,
            "gradient_accumulation_steps": 1,
            "optimizer": {
                "type": "Adam",
                "params": {
                    "lr": 1e-5,
                    "betas": [
                        0.9,
                        0.95
                    ],
                    "eps": 1e-8,
                    "weight_decay": 5e-4
                }
            },
            "fp16": {
                "enabled": True
            },
            "activation_checkpointing": {
                "partition_activations": True,
                "contiguous_memory_optimization": True,
                "profile": True
            },
            "zero_optimization": {
                "stage": 3,
                "offload_optimizer": {
                    "device": "cpu",
                    "pin_memory": False
                },
                "offload_param": {
                    "device": "cpu"
                },

                "allgather_partitions": True,
                "allgather_bucket_size": 2e8,
                "overlap_comm": True,
                "reduce_scatter": True,
                "reduce_bucket_size": 2e8,
                "contiguous_gradients": True
            },
            "steps_per_print": 10
            }

Full logs of deepspeed:

[2024-10-28 17:56:27,869] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-28 17:56:38,640] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1,2,3 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2024-10-28 17:56:38,675] [INFO] [runner.py:568:main] cmd = /root/anaconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=5524 --enable_each_rank_log=None /data/finetuning_chatglm/finetuning_lora.py
[2024-10-28 17:56:40,343] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-28 17:56:43,287] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.21.5-1+cuda12.4
[2024-10-28 17:56:43,289] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.21.5-1
[2024-10-28 17:56:43,289] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.21.5-1
[2024-10-28 17:56:43,290] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2024-10-28 17:56:43,290] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.21.5-1+cuda12.4
[2024-10-28 17:56:43,290] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2024-10-28 17:56:43,290] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.21.5-1
[2024-10-28 17:56:43,291] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2024-10-28 17:56:43,292] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-10-28 17:56:43,292] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-10-28 17:56:43,292] [INFO] [launch.py:163:main] dist_world_size=4
[2024-10-28 17:56:43,292] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2024-10-28 17:56:43,317] [INFO] [launch.py:253:main] process 109788 spawned with command: ['/root/anaconda3/bin/python', '-u', '/data/finetuning_chatglm/finetuning_lora.py', '--local_rank=0']
[2024-10-28 17:56:43,337] [INFO] [launch.py:253:main] process 109789 spawned with command: ['/root/anaconda3/bin/python', '-u', '/data/finetuning_chatglm/finetuning_lora.py', '--local_rank=1']
[2024-10-28 17:56:43,356] [INFO] [launch.py:253:main] process 109790 spawned with command: ['/root/anaconda3/bin/python', '-u', '/data/finetuning_chatglm/finetuning_lora.py', '--local_rank=2']
[2024-10-28 17:56:43,377] [INFO] [launch.py:253:main] process 109791 spawned with command: ['/root/anaconda3/bin/python', '-u', '/data/finetuning_chatglm/finetuning_lora.py', '--local_rank=3']
[2024-10-28 17:56:52,712] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-28 17:56:55,018] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-28 17:56:57,406] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-28 17:56:58,583] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:13<00:00,  1.69s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:11<00:00,  1.48s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:11<00:00,  1.47s/it]
Loading checkpoint shards:  62%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                                                        | 5/8 [00:06<00:03,  1.32s/it]/root/anaconda3/lib/python3.9/site-packages/peft/tuners/lora.py:173: UserWarning: fan_in_fan_out is set to True but the target module is not a Conv1D. Setting fan_in_fan_out to False.
  warnings.warn(
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:09<00:00,  1.19s/it]
/root/anaconda3/lib/python3.9/site-packages/peft/tuners/lora.py:173: UserWarning: fan_in_fan_out is set to True but the target module is not a Conv1D. Setting fan_in_fan_out to False.
  warnings.warn(
/root/anaconda3/lib/python3.9/site-packages/peft/tuners/lora.py:173: UserWarning: fan_in_fan_out is set to True but the target module is not a Conv1D. Setting fan_in_fan_out to False.
  warnings.warn(
/root/anaconda3/lib/python3.9/site-packages/peft/tuners/lora.py:173: UserWarning: fan_in_fan_out is set to True but the target module is not a Conv1D. Setting fan_in_fan_out to False.
  warnings.warn(
trainable params: 3670016 || all params: 6258876416 || trainable%: 0.05863697820615348
[2024-10-28 17:58:08,531] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.14.0, git-hash=unknown, git-branch=unknown
[2024-10-28 17:58:08,532] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-10-28 17:58:16,468] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.14.0, git-hash=unknown, git-branch=unknown
[2024-10-28 17:58:16,469] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-10-28 17:58:16,469] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
trainable params: 3670016 || all params: 6258876416 || trainable%: 0.05863697820615348
trainable params: 3670016 || all params: 6258876416 || trainable%: 0.05863697820615348
[2024-10-28 17:58:25,785] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.14.0, git-hash=unknown, git-branch=unknown
[2024-10-28 17:58:25,786] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-10-28 17:58:26,914] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.14.0, git-hash=unknown, git-branch=unknown
[2024-10-28 17:58:26,915] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-10-28 17:58:51,663] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 13.038025379180908 seconds
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 13.995056629180908 seconds
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000010, betas=(0.900000, 0.950000), weight_decay=0.000500, adam_w=1
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000010, betas=(0.900000, 0.950000), weight_decay=0.000500, adam_w=1
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 15.653298377990723 seconds
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 15.602773189544678 seconds
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000010, betas=(0.900000, 0.950000), weight_decay=0.000500, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000010, betas=(0.900000, 0.950000), weight_decay=0.000500, adam_w=1
[2024-10-28 17:59:21,789] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
[2024-10-28 17:59:21,790] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-10-28 17:59:21,928] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2024-10-28 17:59:21,928] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2024-10-28 17:59:21,929] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2024-10-28 17:59:21,929] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
[2024-10-28 17:59:22,053] [INFO] [utils.py:800:see_memory_usage] Stage 3 initialize beginning
[2024-10-28 17:59:22,056] [INFO] [utils.py:801:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2024-10-28 17:59:22,057] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 22.38 GB, percent = 8.9%
[2024-10-28 17:59:22,067] [INFO] [stage3.py:130:__init__] Reduce bucket size 200000000
[2024-10-28 17:59:22,068] [INFO] [stage3.py:131:__init__] Prefetch bucket size 50,000,000
[2024-10-28 17:59:22,154] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-10-28 17:59:22,157] [INFO] [utils.py:801:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2024-10-28 17:59:22,158] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 22.38 GB, percent = 8.9%
Parameter Offload: Total persistent parameters: 5169152 in 282 params
[2024-10-28 17:59:29,547] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-10-28 17:59:29,550] [INFO] [utils.py:801:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2024-10-28 17:59:29,550] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 34.35 GB, percent = 13.7%
[2024-10-28 17:59:29,644] [INFO] [utils.py:800:see_memory_usage] Before creating fp16 partitions
[2024-10-28 17:59:29,647] [INFO] [utils.py:801:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2024-10-28 17:59:29,647] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 34.35 GB, percent = 13.7%
[2024-10-28 17:59:29,801] [INFO] [utils.py:800:see_memory_usage] After creating fp16 partitions: 1
[2024-10-28 17:59:29,804] [INFO] [utils.py:801:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2024-10-28 17:59:29,805] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 34.36 GB, percent = 13.7%
[2024-10-28 17:59:29,897] [INFO] [utils.py:800:see_memory_usage] Before creating fp32 partitions
[2024-10-28 17:59:29,900] [INFO] [utils.py:801:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2024-10-28 17:59:29,901] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 34.36 GB, percent = 13.7%
[2024-10-28 17:59:30,002] [INFO] [utils.py:800:see_memory_usage] After creating fp32 partitions
[2024-10-28 17:59:30,005] [INFO] [utils.py:801:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2024-10-28 17:59:30,006] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 34.36 GB, percent = 13.7%
[2024-10-28 17:59:30,116] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states
[2024-10-28 17:59:30,119] [INFO] [utils.py:801:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2024-10-28 17:59:30,119] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 34.37 GB, percent = 13.7%
[2024-10-28 17:59:30,212] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states
[2024-10-28 17:59:30,215] [INFO] [utils.py:801:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2024-10-28 17:59:30,215] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 34.37 GB, percent = 13.7%
[2024-10-28 17:59:30,216] [INFO] [stage3.py:486:_setup_for_real_optimizer] optimizer state initialized
[2024-10-28 17:59:30,488] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer
[2024-10-28 17:59:30,491] [INFO] [utils.py:801:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2024-10-28 17:59:30,492] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 34.38 GB, percent = 13.7%
[2024-10-28 17:59:30,492] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam
[2024-10-28 17:59:30,493] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-10-28 17:59:30,493] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-10-28 17:59:30,494] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1e-05], mom=[[0.9, 0.95]]
[2024-10-28 17:59:30,533] [INFO] [config.py:996:print] DeepSpeedEngine configuration:
[2024-10-28 17:59:30,535] [INFO] [config.py:1000:print]   activation_checkpointing_config  {
    "partition_activations": true, 
    "contiguous_memory_optimization": true, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": true
}
[2024-10-28 17:59:30,535] [INFO] [config.py:1000:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-10-28 17:59:30,536] [INFO] [config.py:1000:print]   amp_enabled .................. False
[2024-10-28 17:59:30,536] [INFO] [config.py:1000:print]   amp_params ................... False
[2024-10-28 17:59:30,537] [INFO] [config.py:1000:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-10-28 17:59:30,537] [INFO] [config.py:1000:print]   bfloat16_enabled ............. False
[2024-10-28 17:59:30,538] [INFO] [config.py:1000:print]   bfloat16_immediate_grad_update  False
[2024-10-28 17:59:30,538] [INFO] [config.py:1000:print]   checkpoint_parallel_write_pipeline  False
[2024-10-28 17:59:30,538] [INFO] [config.py:1000:print]   checkpoint_tag_validation_enabled  True
[2024-10-28 17:59:30,538] [INFO] [config.py:1000:print]   checkpoint_tag_validation_fail  False
[2024-10-28 17:59:30,539] [INFO] [config.py:1000:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f6edfd7a5b0>
[2024-10-28 17:59:30,539] [INFO] [config.py:1000:print]   communication_data_type ...... None
[2024-10-28 17:59:30,539] [INFO] [config.py:1000:print]   compile_config ............... enabled=False backend='inductor' kwargs={}
[2024-10-28 17:59:30,540] [INFO] [config.py:1000:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-10-28 17:59:30,540] [INFO] [config.py:1000:print]   curriculum_enabled_legacy .... False
[2024-10-28 17:59:30,540] [INFO] [config.py:1000:print]   curriculum_params_legacy ..... False
[2024-10-28 17:59:30,541] [INFO] [config.py:1000:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-10-28 17:59:30,541] [INFO] [config.py:1000:print]   data_efficiency_enabled ...... False
[2024-10-28 17:59:30,541] [INFO] [config.py:1000:print]   dataloader_drop_last ......... False
[2024-10-28 17:59:30,541] [INFO] [config.py:1000:print]   disable_allgather ............ False
[2024-10-28 17:59:30,542] [INFO] [config.py:1000:print]   dump_state ................... False
[2024-10-28 17:59:30,542] [INFO] [config.py:1000:print]   dynamic_loss_scale_args ...... None
[2024-10-28 17:59:30,542] [INFO] [config.py:1000:print]   eigenvalue_enabled ........... False
[2024-10-28 17:59:30,542] [INFO] [config.py:1000:print]   eigenvalue_gas_boundary_resolution  1
[2024-10-28 17:59:30,543] [INFO] [config.py:1000:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-10-28 17:59:30,543] [INFO] [config.py:1000:print]   eigenvalue_layer_num ......... 0
[2024-10-28 17:59:30,543] [INFO] [config.py:1000:print]   eigenvalue_max_iter .......... 100
[2024-10-28 17:59:30,543] [INFO] [config.py:1000:print]   eigenvalue_stability ......... 1e-06
[2024-10-28 17:59:30,544] [INFO] [config.py:1000:print]   eigenvalue_tol ............... 0.01
[2024-10-28 17:59:30,544] [INFO] [config.py:1000:print]   eigenvalue_verbose ........... False
[2024-10-28 17:59:30,544] [INFO] [config.py:1000:print]   elasticity_enabled ........... False
[2024-10-28 17:59:30,545] [INFO] [config.py:1000:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-10-28 17:59:30,545] [INFO] [config.py:1000:print]   fp16_auto_cast ............... False
[2024-10-28 17:59:30,545] [INFO] [config.py:1000:print]   fp16_enabled ................. True
[2024-10-28 17:59:30,545] [INFO] [config.py:1000:print]   fp16_master_weights_and_gradients  False
[2024-10-28 17:59:30,546] [INFO] [config.py:1000:print]   global_rank .................. 0
[2024-10-28 17:59:30,546] [INFO] [config.py:1000:print]   grad_accum_dtype ............. None
[2024-10-28 17:59:30,546] [INFO] [config.py:1000:print]   gradient_accumulation_steps .. 1
[2024-10-28 17:59:30,546] [INFO] [config.py:1000:print]   gradient_clipping ............ 0.0
[2024-10-28 17:59:30,547] [INFO] [config.py:1000:print]   gradient_predivide_factor .... 1.0
[2024-10-28 17:59:30,547] [INFO] [config.py:1000:print]   graph_harvesting ............. False
[2024-10-28 17:59:30,547] [INFO] [config.py:1000:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-10-28 17:59:30,548] [INFO] [config.py:1000:print]   initial_dynamic_scale ........ 65536
[2024-10-28 17:59:30,548] [INFO] [config.py:1000:print]   load_universal_checkpoint .... False
[2024-10-28 17:59:30,548] [INFO] [config.py:1000:print]   loss_scale ................... 0
[2024-10-28 17:59:30,548] [INFO] [config.py:1000:print]   memory_breakdown ............. False
[2024-10-28 17:59:30,549] [INFO] [config.py:1000:print]   mics_hierarchial_params_gather  False
[2024-10-28 17:59:30,549] [INFO] [config.py:1000:print]   mics_shard_size .............. -1
[2024-10-28 17:59:30,549] [INFO] [config.py:1000:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-10-28 17:59:30,550] [INFO] [config.py:1000:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-10-28 17:59:30,550] [INFO] [config.py:1000:print]   optimizer_legacy_fusion ...... False
[2024-10-28 17:59:30,550] [INFO] [config.py:1000:print]   optimizer_name ............... adam
[2024-10-28 17:59:30,551] [INFO] [config.py:1000:print]   optimizer_params ............. {'lr': 1e-05, 'betas': [0.9, 0.95], 'eps': 1e-08, 'weight_decay': 0.0005}
[2024-10-28 17:59:30,551] [INFO] [config.py:1000:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-10-28 17:59:30,551] [INFO] [config.py:1000:print]   pld_enabled .................. False
[2024-10-28 17:59:30,552] [INFO] [config.py:1000:print]   pld_params ................... False
[2024-10-28 17:59:30,552] [INFO] [config.py:1000:print]   prescale_gradients ........... False
[2024-10-28 17:59:30,552] [INFO] [config.py:1000:print]   scheduler_name ............... None
[2024-10-28 17:59:30,552] [INFO] [config.py:1000:print]   scheduler_params ............. None
[2024-10-28 17:59:30,553] [INFO] [config.py:1000:print]   seq_parallel_communication_data_type  torch.float32
[2024-10-28 17:59:30,553] [INFO] [config.py:1000:print]   sparse_attention ............. None
[2024-10-28 17:59:30,553] [INFO] [config.py:1000:print]   sparse_gradients_enabled ..... False
[2024-10-28 17:59:30,553] [INFO] [config.py:1000:print]   steps_per_print .............. 10
[2024-10-28 17:59:30,554] [INFO] [config.py:1000:print]   train_batch_size ............. 4
[2024-10-28 17:59:30,554] [INFO] [config.py:1000:print]   train_micro_batch_size_per_gpu  1
[2024-10-28 17:59:30,554] [INFO] [config.py:1000:print]   use_data_before_expert_parallel_  False
[2024-10-28 17:59:30,554] [INFO] [config.py:1000:print]   use_node_local_storage ....... False
[2024-10-28 17:59:30,555] [INFO] [config.py:1000:print]   wall_clock_breakdown ......... False
[2024-10-28 17:59:30,555] [INFO] [config.py:1000:print]   weight_quantization_config ... None
[2024-10-28 17:59:30,555] [INFO] [config.py:1000:print]   world_size ................... 4
[2024-10-28 17:59:30,555] [INFO] [config.py:1000:print]   zero_allow_untested_optimizer  False
[2024-10-28 17:59:30,556] [INFO] [config.py:1000:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=200000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=200000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-10-28 17:59:30,557] [INFO] [config.py:1000:print]   zero_enabled ................. True
[2024-10-28 17:59:30,557] [INFO] [config.py:1000:print]   zero_force_ds_cpu_optimizer .. True
[2024-10-28 17:59:30,557] [INFO] [config.py:1000:print]   zero_optimization_stage ...... 3
[2024-10-28 17:59:30,558] [INFO] [config.py:986:print_user_config]   json = {
    "train_micro_batch_size_per_gpu": 1, 
    "gradient_accumulation_steps": 1, 
    "optimizer": {
        "type": "Adam", 
        "params": {
            "lr": 1e-05, 
            "betas": [0.9, 0.95], 
            "eps": 1e-08, 
            "weight_decay": 0.0005
        }
    }, 
    "fp16": {
        "enabled": true
    }, 
    "activation_checkpointing": {
        "partition_activations": true, 
        "contiguous_memory_optimization": true, 
        "profile": true
    }, 
    "zero_optimization": {
        "stage": 3, 
        "offload_optimizer": {
            "device": "cpu", 
            "pin_memory": false
        }, 
        "offload_param": {
            "device": "cpu"
        }, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 2.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 2.000000e+08, 
        "contiguous_gradients": true
    }, 
    "steps_per_print": 10
}
Traceback (most recent call last):
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3713, in <module>
    main()
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3706, in main
    globals = debugger.run(setup["file"], None, None, is_module)
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2704, in run
    return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2712, in _exec
    globals = pydevd_runpy.run_path(file, globals, "__main__")
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path
    return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname)
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code
    _run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name)
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code
    exec(code, run_globals)
  File "/data/finetuning_chatglm/finetuning_lora.py", line 168, in <module>
    main()
  File "/data/finetuning_chatglm/finetuning_lora.py", line 150, in main
    model_engine.backward(loss)
  File "/root/anaconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/root/anaconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/root/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: out of memory

@xuanhua
Copy link
Contributor Author

xuanhua commented Oct 28, 2024

And here is core logic of model initialization and training

# Load pretrained model only in cpu memory
model = ChatGLMForConditionalGeneration.from_pretrained(
        args.model_dir,
        device_map="cpu")
tokenizer = ChatGLMTokenizer.from_pretrained(args.model_dir)

config = LoraConfig(r=args.lora_r,
                        lora_alpha=32,
                        target_modules=["query_key_value"],
                        lora_dropout=0.1,
                        bias="none",
                        task_type="CAUSAL_LM",
                        inference_mode=False,
                        )
# create lora model
model = get_peft_model(model, config).half()

# ignore the dataset and dataload stuff here
...

# Initialize the deepspeed engine.
model_engine, optimizer, _, _ = deepspeed.initialize(config=conf,
                                                         model=model,
                                                         model_parameters=model.parameters())

# Ignore the forward and backward part, they are something like: `outputs = model_engine.forward(input_ids=input_ids, labels=labels)`  and `model_engine.backward(loss)`

@jomayeri jomayeri self-assigned this Oct 28, 2024
@jomayeri
Copy link
Contributor

96GB of VRAM should be plenty for a 6B model. I would start with a simpler ds_config (no offload, and leave the communication values to default) and see the result. Use the DeepSpeed function see_memory_usage to monitor GPU memory usage in code, or watch the nvidia-smi manually to observe GPU memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants