New Notebook to Fine-tune with OpenAI

Get the most out of Gen AI with fine-tuning

It was brought to my attention that Chapter 4 of my latest book uses a dataset that Amazon has since revoked from HuggingFace (always keeping me on my toes). Because of this, I re-wrote the notebook in Github to update the example with a working dataset and at the same time, updated the code to use OpenAI’s latest fine-tuning API. I figured I would share some of the takeaways of the case study here.

Our data is App Reviews from HuggingFace (original Github here). The dataset is 288,065 reviews extracted from the Google Play. I split the data into training, validation, and testing. I used training and validation to fine-tune on OpenAI and held out testing to compare the final 4 mo

Our model options are:

  1. Babbage trained for 1 epoch (3B model)

  2. Babbage trained for 4 epochs (3B model)

  3. GPT 3.5 trained for 1 epoch and no system prompt (175B model)

  4. GPT 3.5 trained for 1 epoch with a system prompt (175B model)

1. Cost project early and cost project often

If this model is going to be used a lot, make sure to keep an eye on OpenAI’s pricing page to estimate how much money you’re about to spend on fine-tuning. Here is a breakdown of how much it cost me to train and run evaluation on all four models on the training dataset, obviously 3.5 was much more expensive, but were the performance gains worth it?!

Our performance increase from GPT 3.5 comes at a cost - literally. Fine-tuning GPT 3.5 was up to 75x more expensive than fine-tuning Babbage and inference with GPT 3.5 was up to 26x more expensive than Babbage! Worth it? Ehhh..

You could consider writing a batch prompt which is exactly what it sounds like, a prompt that predicts multiple apps at a time.

2. Consider simplifying the task

Btw the answer to the question was the extra money to fine-tuning GPT 3.5 worth it? NO. In testing accuracy, GPT 3.5 was only about 3% better than Babbage. For being about 60x times bigger, GPT 3.5 can sometimes be not worth the money.

This is true even if you consider simplifying the task and defining new metrics. For example, we have raw accuracy (simple # correct / # items) but we could also consider “good” vs “bad” as a binary classifier of changing classes to be “Good” (4 or 5 stars) or “Bad” (1, 2, or 3 stars). Of course you can do whatever you want. You can also do “one-off accuracy” so if the model predicts “3” and the answer was “2” or “4” it would be counted as right. All up to you on what matters 🙂. On there 3 metrics, GPT 3.5 still only does up to 3% better than Babbage.

Fine-tuning GPT 3.5 (ChatGPT) is performing a bit better than the much smaller Babbage models, even among the simplified tasks but is it worth the extra $$$?

3. Generative models generating nonsense

If you let a model blabber on, it will eventually say something unhelpful. In our fine-tuned GPT 3.5 model with no system prompt with a temperature of 0.1 (to make the outputs more deterministic) I saw some instances of the model not predicting 0, 1, 2, 3, or 4. Seems like the system prompt helps prevent against this and Babbage doesn’t need to be told this as much 🙂. It’s annoying but hey, generative models gonna generate.

Only our fine-tuned 3.5 model with no system prompt generates predictions out of the 0-4 range on our testing set sometimes (even with our temperature turned down low). Both Babbage models and GPT 3.5 with a system prompt never did this.

Until next time!

I have more takeaways than that but I’ll leave it there for now. If you want to see more, check out the notebook. Happy coding!