Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llmblog2 #16

Merged
merged 36 commits into from
Feb 26, 2024
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
c7bfdba
working deepspeed
aciborowska Jan 29, 2024
67a9654
Trying out mistral
KevinMusgrave Feb 7, 2024
047b607
Move some chat formatting logic to chat_format
KevinMusgrave Feb 7, 2024
7a86d96
minor changes
KevinMusgrave Feb 9, 2024
34ff373
adding scrap.py
KevinMusgrave Feb 9, 2024
9ba7e6b
get max_length
KevinMusgrave Feb 9, 2024
788b170
Get rid of max_length function
KevinMusgrave Feb 12, 2024
8608815
test with batch size
KevinMusgrave Feb 12, 2024
a90a835
Test with a batch
KevinMusgrave Feb 12, 2024
341542e
Add get_tokenize_fn
KevinMusgrave Feb 12, 2024
5ea0939
Added test_model back in
KevinMusgrave Feb 12, 2024
c614bbe
Add max_length back in
KevinMusgrave Feb 12, 2024
b80fa15
lora seems to work
KevinMusgrave Feb 14, 2024
cbde3df
include system prompt in user prompt for mistral
KevinMusgrave Feb 14, 2024
96abc27
check if response is in decoded
KevinMusgrave Feb 14, 2024
b20e987
Use WarmupDecayLR
KevinMusgrave Feb 15, 2024
2f2d38f
Add lora flag to configs
KevinMusgrave Feb 15, 2024
0032b48
delete test_model, move into inference.py
KevinMusgrave Feb 15, 2024
bd45e48
plot token histogram
KevinMusgrave Feb 16, 2024
7df117a
plot for num tokens before response
KevinMusgrave Feb 16, 2024
ed845ce
right-side padding and truncation. max_length 2048. Plot token histog…
KevinMusgrave Feb 16, 2024
7696c20
profiling
aciborowska Feb 16, 2024
f700eda
scrap->validate_tokenizer. Remove profiler stuff. Remove unnecessary …
KevinMusgrave Feb 21, 2024
e9329d5
Separate folder for new blog post
KevinMusgrave Feb 21, 2024
70a8d06
Updated readme
KevinMusgrave Feb 21, 2024
1d0c924
Manually set title
KevinMusgrave Feb 22, 2024
d4656da
Apply suggestions from code review
KevinMusgrave Feb 26, 2024
31c8dfb
Rename distributed.yaml -> lora.yaml. Remove image field in both configs
KevinMusgrave Feb 26, 2024
353939e
Use pytorch 2 image
KevinMusgrave Feb 26, 2024
562ef91
readme: distributed.yaml -> lora.yaml
KevinMusgrave Feb 26, 2024
e87ba2b
inference works with lora too
KevinMusgrave Feb 26, 2024
8919b58
minor changes to the readmes
KevinMusgrave Feb 26, 2024
d411b1f
Updated docker image
KevinMusgrave Feb 26, 2024
6757ee6
Mention the --lora flag
KevinMusgrave Feb 26, 2024
55166fa
Add --device argument to the inference script
KevinMusgrave Feb 26, 2024
955c0e0
Minor change to the readme
KevinMusgrave Feb 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ This repository contains a variety of Determined examples that are not actively
| Example | Description |
|:---------------------------------------:|:----------------------------------------------------------------------------:|
| [LLM Finetuning](blog/llm-finetuning) | Finetuning the TinyLlama-1.1B Model on Text-to-SQL. |
| [LLM Finetuning 2](blog/llm-finetuning-2) | Finetuning the Mistral-7B Model on Text-to-SQL using LoRA and DeepSpeed. |
| [Python SDK demo](blog/python_sdk_demo) | Example usage of the Determined Python SDK to run and administer experiments. |

## Computer Vision
Expand Down
2 changes: 2 additions & 0 deletions blog/llm-finetuning-2/.detignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
text-to-sql*
checkpoints
5 changes: 5 additions & 0 deletions blog/llm-finetuning-2/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
__pycache__
.DS_STORE
text-to-sql*
checkpoints
*.png
60 changes: 60 additions & 0 deletions blog/llm-finetuning-2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Finetuning Mistral-7B using LoRA and DeepSpeed

In this demo, we finetune the [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) using [LoRA](https://arxiv.org/abs/2106.09685) and [DeepSpeed](https://github.com/microsoft/DeepSpeed). We ran LoRA on two 80 GB A100 GPUs, and DeepSpeed on two, four, and eight 80 GB A100 GPUs.
KevinMusgrave marked this conversation as resolved.
Show resolved Hide resolved

To get started, first install Determined on your local machine:
```bash
pip install determined
```

Finetune with LoRA:
```bash
det e create distributed.yaml .
KevinMusgrave marked this conversation as resolved.
Show resolved Hide resolved
```

Finetune with DeepSpeed:
```bash
det e create deepspeed.yaml .
```

## Configuration

Change configuration options in `distributed.yaml`. Some important options are:
- `slots_per_trial`: the number of GPUs to use.
- `dataset_subset`: the difficulty subset to train on.
- `per_device_train_batch_size`: the batch size per GPU.


DeepSpeed configuration options are in the `ds_configs` folder.
KevinMusgrave marked this conversation as resolved.
Show resolved Hide resolved

## Testing

Test your model's generation capabilities:

```bash
python test_model.py --exp_id <exp_id> --dataset_subset <dataset_subset>
KevinMusgrave marked this conversation as resolved.
Show resolved Hide resolved
```

Where
- `<exp_id>` is the id of your finetuning experiment in the Determined UI.
- `<dataset_subset>` is one of "easy", "medium", or "hard".

To test the pretrained model (not finetuned), leave out `--exp_id`. For example:

```bash
python test_model.py --dataset_subset easy
```

## Validating the tokenizer

Plot the distribution of dataset sample lengths, and see how many samples will be truncated by the tokenizer:

```bash
python validate_tokenizer.py
```


## Contributors

- [Kevin Musgrave](https://github.com/KevinMusgrave)
- [Agnieszka Ciborowska](https://github.com/aciborowska)
67 changes: 67 additions & 0 deletions blog/llm-finetuning-2/chat_format.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
CHAT_ML_TEMPLATE = """
{% for message in messages %}
{% if message['role'] == 'user' %}
{{'<|im_start|>user\n' + message['content'].strip() + '<|im_end|>' }}
{% elif message['role'] == 'system' %}
{{'<|im_start|>system\n' + message['content'].strip() + '<|im_end|>' }}
{% elif message['role'] == 'assistant' %}
{{'<|im_start|>assistant\n' + message['content'] + '<|im_end|>' }}
{% endif %}
{% endfor %}
"""


CHAT_ML_EOS_TOKEN = "<|im_end|>"


def get_chat_format(element, model_name, with_assistant_response=True):
system_prompt = (
"You are a helpful programmer assistant that excels at SQL. "
"When prompted with a task and a definition of an SQL table, you "
"respond with a SQL query to retrieve information from the table. "
"Don't explain your reasoning, only provide the SQL query."
)

user_prompt = "Task: {instruction}\nSQL table: {input}\nSQL query: "

if model_name == "mistralai/Mistral-7B-Instruct-v0.2":
user_prompt = f"{system_prompt}\n{user_prompt}"
output = [
{"role": "user", "content": user_prompt.format_map(element)},
]
else:
output = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt.format_map(element)},
]

if with_assistant_response:
output.append({"role": "assistant", "content": element["response"]})

return output


def set_special_tokens(tokenizer, model_name):
if model_name == "TinyLlama/TinyLlama-1.1B-Chat-v0.4":
tokenizer.chat_template = CHAT_ML_TEMPLATE
tokenizer.eos_token = CHAT_ML_EOS_TOKEN
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id


def get_assistant_prompt(model_name):
if model_name == "TinyLlama/TinyLlama-1.1B-Chat-v0.4":
return "<|im_start|>assistant\n"
else:
return "[/INST]"


def get_response_template_ids(tokenizer, model_name):
return tokenizer.encode(get_assistant_prompt(model_name), add_special_tokens=False)


def maybe_add_generation_prompt(x, model_name):
if model_name == "TinyLlama/TinyLlama-1.1B-Chat-v0.4":
return x + get_assistant_prompt(model_name)
else:
return x
69 changes: 69 additions & 0 deletions blog/llm-finetuning-2/dataset_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import datasets
import pandas as pd


def add_length_column(dataset) -> pd.DataFrame:
df = dataset.to_pandas()
df["total_length"] = 0
for column_name in ["instruction", "input", "response"]:
num_words = df[column_name].astype(str).str.split().apply(len)
df["total_length"] += num_words

return df


def filter_by_total_length(df, difficulty, number_of_samples):
if difficulty == "easy":
return df[df["total_length"].between(10, 100)].iloc[:number_of_samples]
elif difficulty == "medium":
return df[df["total_length"].between(101, 200)].iloc[:number_of_samples]
elif difficulty == "hard":
return df[df["total_length"].between(201, 800)].iloc[:number_of_samples]


def get_dataset_subset_name(difficulty: str) -> str:
return f"text-to-sql-v1-{difficulty}"


def create_and_save_datasets(
df, difficulty, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1
):
seed = 123
# remove total_length column because we don't need it anymore
df = df.drop(columns=["total_length"])
dataset = datasets.Dataset.from_pandas(df, preserve_index=False)

# split into training and "the rest"
train_valtest = dataset.train_test_split(train_size=train_ratio, seed=seed)

# split "the rest" into validation and testing
val_test = train_valtest["test"].train_test_split(
test_size=test_ratio / (test_ratio + val_ratio), seed=seed
)

dataset = datasets.DatasetDict(
{
"train": train_valtest["train"],
"valid": val_test["train"],
"test": val_test["test"],
}
)
dataset_name = get_dataset_subset_name(difficulty)
dataset.save_to_disk(dataset_name)
return dataset


def load_dataset(difficulty):
return datasets.load_from_disk(get_dataset_subset_name(difficulty))


def load_or_create_dataset(difficulty, num_samples=10000):
try:
return load_dataset(difficulty)
except FileNotFoundError:
dataset = datasets.load_dataset("Clinton/Text-to-sql-v1")
dataset = dataset["train"]
dataset = dataset.remove_columns(["text", "source"])
df = add_length_column(dataset)
df = filter_by_total_length(df, difficulty, num_samples)
return create_and_save_datasets(df, difficulty)
36 changes: 36 additions & 0 deletions blog/llm-finetuning-2/deepspeed.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: mistral deepspeed easy
debug: false
environment:
environment_variables:
- NCCL_DEBUG=INFO
image: determinedai/genai-train:latest
KevinMusgrave marked this conversation as resolved.
Show resolved Hide resolved
resources:
slots_per_trial: 2
searcher:
name: single
max_length:
batches: 5000
metric: eval_accuracy
smaller_is_better: false
hyperparameters:
model: "mistralai/Mistral-7B-Instruct-v0.2"
dataset_subset: "easy"
lora: false
training_args:
output_dir: "/tmp/llm_finetuning"
max_steps: 5000
per_device_train_batch_size: 2
per_device_eval_batch_size: 4
bf16: true
evaluation_strategy: "steps"
eval_steps: 1000
logging_strategy: "steps"
logging_steps: 100
save_strategy: "steps"
save_steps: 5000
learning_rate: 1e-5
deepspeed: "ds_configs/ds_config_stage_3.json"
entrypoint: >-
python -m determined.launch.deepspeed
python finetune.py
max_restarts: 0
35 changes: 35 additions & 0 deletions blog/llm-finetuning-2/distributed.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: mistral lora easy
debug: false
environment:
environment_variables:
- NCCL_DEBUG=INFO
image: determinedai/environments-dev:python-3.10-pytorch-2.0-deepspeed-0.10.0-smartsim
KevinMusgrave marked this conversation as resolved.
Show resolved Hide resolved
resources:
slots_per_trial: 2
searcher:
name: single
max_length:
batches: 5000
metric: eval_accuracy
smaller_is_better: false
hyperparameters:
model: "mistralai/Mistral-7B-Instruct-v0.2"
dataset_subset: "easy"
lora: true
training_args:
output_dir: "/tmp/llm_finetuning"
max_steps: 5000
per_device_train_batch_size: 8
per_device_eval_batch_size: 4
bf16: true
evaluation_strategy: "steps"
eval_steps: 1000
logging_strategy: "steps"
logging_steps: 100
save_strategy: "steps"
save_steps: 1000
learning_rate: 1e-5
entrypoint: >-
python -m determined.launch.torch_distributed
python finetune.py
max_restarts: 0
48 changes: 48 additions & 0 deletions blog/llm-finetuning-2/ds_configs/ds_config_stage_1.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"flops_profiler": {
"enabled": true,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
}
48 changes: 48 additions & 0 deletions blog/llm-finetuning-2/ds_configs/ds_config_stage_2.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"flops_profiler": {
"enabled": true,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
}
Loading