This repository allows you to generate a dataset of well-documented Swift code. I created it in an attempt to improve DoccGPT by fine-tuning one of OpenAI's base models.
I am sad to report that it doesn't look like it is immediately worth it. It might be after significantly more fine-tuning (and significantly more dollars), but even then given the context windows of the soon-to-land GPT-4 models there is likely no point in paying so much to get a good base model that still has a context window of 2048 tokens. Hopefully we can fine-tune GPT-4 in the future, or perhaps we may not even need to.
Here's an overview of what I did:
-
Ran through the directions in the README to generate the
data.jsonl
file. This results in a little over 200 viable prompt/completion pairs, which is the minimum number of examples that OpenAI suggests for fine-tuning. -
Then, I fine-tuned
ada
for $0.40. It was able to put comments in the right place for a simpleenum
, but the comments failed to describe the code well. -
Lastly, I fine-tuned
davinci
for about $30. It left better comments, also in the right places, but at the end of the day the fine-tuned model's performance is still not even remotely close to what I was seeing with the more advanced out-of-the-box models. It struggled to document all of the fields in a simpleUser
struct
with 4 properties and a function.
- Clone the repository and its submodules:
git clone --recurse-submodules [email protected]:gonzalonunez/docc-gpt-data.git
- If you have already cloned it, you can also update the submodules:
git submodule update --init --recursive
- Ensure that you have yonaskolb/Mint installed, and use it to install ross.
mint install gonzalonunez/ross
- Ensure that you have apple/swift-format installed. We use this in the next step in order to format files after cleaning them up.
brew install swift-format
-
Run
python generate.py
to generate prompt/completion pairs in/files
. This will generate hundreds of folders based on the.swift
files in the repository, which come from the repository's submodules. Each folder contains aPrompt.swift
file and aCompletion.swift
file, which are both modified copies of an original source file.Prompt.swift
isCompletion.swift
but with all DocC comments removed by ross. -
Run
python data.py
. This takes all of the prompt/completion pairs in/files
and formats them in such a way that they can later be used to fine-tune an OpenAI Model. The formatted data is saved into a JSON file nameddata.jsonl
. You can now delete the/files
directory if you'd like to, but I find that it's nice to inspect manually before spending the time/money to fine-tune a model. -
This
data.jsonl
file is what you will pass over to OpenAI for fine-tuning. Follow the instructions here and fine-tune your model. Using OpenAI's tool will take care of removing examples that are too long.
OPENAI_API_KEY=<YOUR_KEY> openai tools fine_tunes.prepare_data -f data.jsonl
After using OpenAI's CLI preparation tool, you should see the following message:
After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string
\n\n###\n\n
for the model to start generating completions, rather than continuing with the prompt. Make sure to includestop=[" <END>"]
so that the generated texts ends at the expected place.