Evaluating AI Agent Tool Selection

When you have a hammer (in the first position), everything looks like a nail

AI Agents are all the rage, and I get it. The promise of letting an LLM just pick the right tool for the job is very appealing if not somewhat “magical”, but of course if I’m writing this, then it means there’s a lot more to it than meets the eye.

At it’s most basic, an AI Agent is an auto-regressive LLM (virtually any commercial Generative AI model like GPT, Llama, Mistral, Claude, Command-R, etc) with a prompt telling the LLM how to reason through tasks by selecting and running tools which can be APIs, image generation models (like how ChatGPT uses DALL-E to make images), execute code, really anything that can be distilled down into a simple run function.

Your Basic AI Agent relies on an auto-regressive LLM’s ability to think through a task and select the right tool at the right time.

Agents are useful in theory, but in practice can often fall short. Evaluating agents can be done on several levels including:

  1. Making sure the final answer is accurate, and helpful

  2. Ensuring the latency/speed of the system is good enough

  3. Mitigating failures of the LLM to reason through a complex task

One of the more underrated evaluation criteria is the quantifying the ability of the LLM to select the right tool at the right time. On it’s face it's obvious that we have to measure this but many people dismiss this as being just part of the overall system and if the answer is right at the end, that would imply the agent selected the right tools, right? Well not always. Perhaps the agent selected the wrong tool twice before fumbling it’s way into the right one and that would impact both the latency and the accuracy overall.

Moreover, there are underlying issues with the deep learning architecture that virtually every LLM is based on, the Transformer. While there’s no doubt that the invention of the Transformer was one of the greatest advancements in NLP in the last several decades, there’s one particular type of bias it falls prey to quite often, the positional bias.

Depending on where the tools are in the agent prompt, tools listed later in the list might end up towards the middle of the prompt, where information can get ignored due to positional bias

Positional bias essentially means the LLM has a tendency to pay more attention to tokens at the start or end of the prompt while glossing over tokens in the middle. You may have heard this called the "lost-in-the-middle" problem. This can be a big deal when it comes to agents, especially if the LLM favors tools that are recorded earlier in the prompt, glossing over later tools which often appear towards the middle of the overall prompt. As a result, the LLM could pick the wrong tool.

Testing Tool Selection

To properly investigate tool selection in agents, let’s run a simple test. Our test will run in 3 stages:

Stage 1 - Setup

  1. Choose an agent framework to test. I made my own here

  2. Write a test set where we write a fairly simple task which (mostly) obviously matches to a single tool. I have 5 tools total. Examples include:

    • Check the status of my NFT listings → 'Crypto Lookup Tool'

    • Add a new row and just write "To do" in it → 'Google Spreadsheet Tool'

    • Convert 98 degrees Fahrenheit to Celsius using Python → 'Python Tool'

  3. Define several LLMs to test against. I tested several from OpenAI, Anthropic, a Mistral model, a few Llama models and a Gemini model

Stage 2 - Run the Agent + Log results

  1. Choose an n (I chose n=10)

  2. For each test datapoint, and for each LLM, shuffle the tools around and pass the order of the tools into the agent framework n times.

    • for each time, log the correct tool index, the chosen tool index, and whether the agent was correct.

Stage 3 - Calculate Results

  1. Calculate the accuracy, precision, recall, and F1 for each LLM on its tool selection along with broken down metrics (see the final results in the github)

  2. Calculate the % difference between each tool index being chosen and the index being correct to try and see if the LLM favored any particular tool indices.

The Results

The notebook with the test & results (see references) has about a dozen graphs in it but here are a few key takeaways:

Tool Selection Accuracy can vary greatly between LLMs

Depending on which LLM I tried, there were pretty stark differences between tool selection accuracy. It’e tempting to look at this and say “Oh ok, so Anthropic’s Claude 3.5 Haiku" is clearly the best LLM for agents”. Incorrect 🙂 this is a test for my agent framework on my tools and on my data. I hope you will take this post/notebook as a framework to follow when testing your own LLMs!

To no one’s surprise, the choice of LLM impacted overall tool selection accuracy

Positional Bias is Real

The graph below shows the average % difference between how often the agent chose a particular tool index (there are 5 bars because I had 5 tools) over how often that tool index was actually correct. so a 9.51% in the first bar means that on average, the LLMs chose the first tool in the list 9.51% more often that the index was correct. For example if that tool index was the correct tool index 95 times during the test, the LLM actually chose that tool index roughly 104 times. Meanwhile the later tools are under-chosen, showing evidence of a positional bias.

On average, the chosen LLMs tended to over-select tools in earlier indexes

You might be thinking that it was the smaller open source models that really skewed the results, but if you look at the results broken down by model provider, even OpenAI models fall victim to positional bias:

Even the “gold standard” OpenAI LLMs fall victim to positional biases

Conclusion

The allure of AI agents lies in their potential to solve a freeform task by choosing the right tools at the right moment, but the reality is far more nuanced. This experiment highlights that evaluating an agent's performance goes beyond simply checking the final answer and how long it took to get there. Even when an agent ultimately reaches the correct solution, inefficient tool selection driven by inherent biases can impact accuracy, latency, and consistency.

Moreover, even the most advanced LLMs from top providers like OpenAI and Google are not immune to these challenges. The over-selection of tools appearing earlier in the list underscores the need for robust testing frameworks and deeper investigations into the LLM’s decision-making process.

The takeaway? Don’t assume a strong final output implies flawless tool selection. Use testing frameworks like the one shared here to rigorously test, iterate, and refine your agents for better real-world performance. And remember, the right tool at the right time isn’t just magical—it’s measurable.

References

This work came from my lecture & video on agents which you can find on O’Reilly. Here is the Github for both the lecture and the video as well as the github for my agent framework that I used to perform this test.