Harnessing LLM Alignment

Making AI More Accessible

Hey everyone! I’m giving an alignment workshop next week at ODSC and they had me write a blog post to intro the work we were going to be doing. I wanted to share this intro with you all as well!

Back in 2020, the world was introduced to OpenAI’s GPT-3, a marvel in the AI domain to many. However, it wasn’t until two years later, in 2022, when OpenAI unveiled its instruction-aligned version of GPT-3, aptly named “InstructGPT,” that its full potential came into the spotlight, and the world started really paying attention. That innovation wasn’t just a technological leap for AI alignment; it was a demonstration of the power of reinforcement learning to make AI more accessible to everyone.

Aligning Our Expectations

Alignment, broadly defined, is the process of making an AI system that behaves in accordance with what a human wants. Alignment isn’t just about training AI to follow instructions; it’s about designing a system to sculpt an already powerful AI model into something more usable and beneficial to both technically inclined users and to someone who just needs help planning a birthday party. It’s this very aspect of alignment that has democratized the magic of Large Language Models (LLMs), enabling a broader audience to extract value from them.

If alignment is the heart of LLMs’ usability, what keeps this heart pumping? That’s where the intricate dance of Reinforcement Learning (RL) comes into play. While the term ‘alignment’ might be synonymous with reinforcement learning for some, there’s a lot more under the hood. Capturing the multifaceted dimensions of human emotions, ethics, or humor within the confines of next-token prediction is a colossal – and potentially impossible – task. How do you effectively program ‘neutrality’ or ‘ethical behavior’ into a loss function? Arguably, you can’t. It’s here that RL rises as a dynamic way to model these intricate nuances without strictly encoding them.

RLHF, which stands for Reinforcement Learning from Human Feedback is the technique OpenAI originally used to align their InstructGPT model and is frequently discussed among AI enthusiasts as the main way to align LLMs, but it’s merely one tool among many for alignment. The core principle of RLHF revolves around obtaining high-quality human feedback and using it to give LLMs feedback on their task performance in the hopes of having the AI speak in a more user-friendly manner by the end of the loop.

In our own day-to-day work with LLMs however, we often don’t need the AI to answer everything, we need them to solve the tasks relevant to us / our businesses / our projects. In our journey with RL, we’ll explore alternative approaches to RLHF where we can utilize other forms of feedback mechanisms that do not rely on human preferences.

Case Study – Aligning FLAN-T5 to make more neutral summaries

Let’s look at an example of using two classifiers from Hugging Face to enhance the FLAN-T5 model’s ability to write summaries of news articles that are both grammatically polished and consistently neutral in style.

The below code will define one such reward feedback, using a pre-fine-tuned sentiment classifier to obtain the logits for the neutral class to reward FLAN-T5 for speaking in a neutral tone and punish it otherwise:

sentiment_pipeline = pipeline(




def get_neutral_scores(texts):

  scores = []

  # function_to_apply='none' returns logits which can be negative

  results = sentiment_pipeline(texts, function_to_apply='none', top_k=None)

  for result in results:

    for label in result:

      if label['label'] == 'LABEL_1': # logit for neutral class


    return scores

>> get_neutral_scores(['hello', 'I love you!', 'I hate you']) 

>> [0.85, -0.75, -0.57]

We can use this classifier along with another one to classify a piece of text’s grammatical correctness to align our FLAN-T5 model to generate summaries how we want them to be generated.

The Reinforcement Learning from Feedback loop looks something like this:

  1. Give FLAN-T5 a batch of news articles to summarize (taken from https://huggingface.co/datasets/argilla/news-summary only using the raw articles)

  2. Assign a weighted sum of rewards from:

    1. A CoLA model (judging grammatical correctness) from textattack/roberta-base-CoLA

    2. A sentiment model (judging neutrality) from cardiffnlp/twitter-roberta-base-sentiment

  3. Use the rewards to update the FLAN-T5 model using the TRL package, taking into consideration how far the updated model had deviated from the original parameters

Here is a sample of the training loop we will build at the workshop I’m giving next week:

for epoch in tqdm(range(2)):

  for batch in tqdm(ppo_trainer.dataloader):

    #### prepend the summarize token

    game_data["query"] = ['summarize: ' + b for b in batch["text"]]

    #### get response from reference + current flan-t5

    input_tensors = [_.squeeze() for _ in batch["input_ids"]]

    # ....

    for query in input_tensors:

      response = ppo_trainer.generate(query.squeeze(), **generation_kwargs)



    #### Reward system

    game_data["response"] = [flan_t5_tokenizer.decode(...)

    game_data['cola_scores'] = get_cola_scores(


    game_data['neutral_scores'] = get_neutral_scores(


    #### Run PPO training and log stats

    stats = ppo_trainer.step(input_tensors, response_tensors, rewards)

    stats['env/reward'] = np.mean([r.cpu().numpy() for r in rewards])

    ppo_trainer.log_stats(stats, game_data, rewards)

I omitted several lines of this loop to save space but you can of course come to my workshop to see the loop in its entirety!

The Results

After a few epochs of training, our FLAN-T5 starts to show signs of enhanced alignment towards our goal of more grammatically correct and neutral summaries. Here’s a sample of what the different summaries look like using the validation data from the dataset:

A sample of FLAN-T5 before and after RL. We can see the RL fine-tuned version of the model is using words like “announced” over terms like “scrapped”.

Running both our models (the unaligned base FLAN-T5 and our aligned version) over the entire validation set shows an increase (albeit a subtle one) in both rewards from our CoLA model and our sentiment model!

The model is garnering increased rewards from our system, and upon inspection, there’s a nuanced shift in its summary generation. However, its core summarization abilities remain largely consistent with the base model.


Alignment isn’t just about the tools or methodologies of collecting data and making LLMs answer any and all questions. It’s also about understanding what we actually want from our LLMs. The goal of alignment, however, remains unwavering: fashion LLMs whose outputs resonate with human sensibilities, making AI not just a tool for the engineer but a companion for all. Whether you’re an AI enthusiast or someone looking to dip your toes into this world, there’s something here for everyone. Join me at ODSC this year as we traverse the landscape of LLM alignment together!

I will have a github repo for ODSC soon but until then, you can see the source notebook from my book here: https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/notebooks/7_rl_flan_t5_summaries.ipynb