LLM·Dex
BenchmarksTool useFunction calling

The Quiet Rise of Tool-Use Benchmarks

BFCL, t-bench, Tau-bench. Why function-calling evals matter more than MMLU now, and how to read them.

By LLMDex Editorial

Two years ago the benchmark conversation was about MMLU. The leaderboard was MMLU, the model card lead was MMLU, the analyst takes were on MMLU. Today, MMLU is a footnote on most flagship releases. The benchmarks people argue about are tool-use evals: Berkeley Function Calling Leaderboard (BFCL), t-bench, Tau-bench, GAIA, and a handful of agent-specific subsets.

This shift mirrors how production AI is actually used in 2026. Most LLM workloads are not pure text generation, they're agents calling tools, RAG pipelines retrieving sources, and structured-output pipelines extracting data. The benchmark that predicts how well your model handles those workloads matters more than the benchmark that predicts academic knowledge.

This article is a practical guide to the four tool-use benchmarks worth tracking, and how to read them.

Why tool-use benchmarks matter

Three reasons:

1. Production AI is mostly tool use

Every chat product worth using is a tool-using agent. ChatGPT searches the web. Claude reads files. Cursor edits code. The pure-text use case (write me a poem, summarize this article) is a small slice of real usage. The rest is "use these tools to accomplish a task."

If a model is bad at tool use, it's bad at the workloads users actually run. MMLU doesn't test this. Tool-use benchmarks do.

2. Failure modes are non-obvious

Tool use can fail in subtle ways: wrong function signature, hallucinated arguments, parallel calls that should be sequential, sequential calls that should be parallel, refusing to call when needed, calling when refusing. None of these failures show up on standard text benchmarks but all of them show up in production.

Tool-use benchmarks specifically probe these failure modes.

3. Improvements are real and measurable

Frontier models doubled their BFCL scores between 2023 and 2026. Agent reliability in production tracks the leaderboard movement closely. This isn't true for all benchmarks (MMLU saturated years ago and stopped meaningfully predicting product quality). Tool-use benchmarks are still differentiating.

The four benchmarks worth tracking

Berkeley Function Calling Leaderboard (BFCL)

BFCL is the de-facto standard for measuring how well a model calls functions. It tests:

  • Simple function calling. Pick the right function with the right arguments.
  • Parallel function calling. Call multiple functions when the user request needs it.
  • Multiple function calling. Sequential chains of dependent calls.
  • Function calling with relevance detection. Don't call when the user's request doesn't need a tool.

Top of the leaderboard at the time of writing: GPT-5.5, Claude Opus 4.7, Gemini 3 Pro, within 1-2 percentage points of each other. Below that, a noticeable gap to mid-tier models.

How to read BFCL: the headline score is "average accuracy" across the four subsets. The subset breakdown is more informative than the average. If your workload is heavy on parallel calls (e.g., a research agent fetching multiple URLs simultaneously), look at the parallel score specifically.

t-bench

t-bench is an academic benchmark that tests tool use with a strong emphasis on following the function specification exactly. Hallucinated arguments are penalized heavily. The benchmark is harder than BFCL and tends to stratify models more visibly.

For production deployments where strict-mode JSON validity matters (and it almost always does), t-bench is a better predictor than BFCL.

Tau-bench

Tau-bench is a more application-flavored eval. It simulates customer-service scenarios where a model uses a small set of tools (search the database, update a record, send an email) to complete user requests. The eval scores both task completion and tool-use efficiency.

Tau-bench is the most predictive of "will this model work as a customer support agent" in our experience. If you're building anything in the support automation category, look here first.

GAIA

GAIA is a research-flavored benchmark that tests multi-tool reasoning across web search, file reading, calculation, and synthesis. It's the closest public eval to the "Deep Research" workflow OpenAI and Anthropic ship.

GAIA scores improved dramatically through 2025, from 30% on top models in early 2024 to 75%+ on the best agents today. It's still hard. Anthropic's research workflows and OpenAI's Deep Research dominate GAIA among publicly-evaluated systems.

What to look at

For each model you're evaluating, we'd look at:

  1. BFCL average + parallel + relevance subset scores. Three-number summary is enough.
  2. t-bench score. A single quality floor.
  3. Tau-bench if you're in customer support. Otherwise skip.
  4. GAIA if you're building a research agent. Otherwise skip.

Use these in addition to MMLU and HumanEval, not instead. MMLU still measures something real (broad knowledge). HumanEval still measures coding. Tool-use benchmarks measure agent reliability, which is a separate axis.

How tool-use scores translate to production

Three rules of thumb from our experience:

  1. A 5% gap in BFCL is real in production. Two models that are within 1% are probably indistinguishable. Two models 5% apart will produce visibly different agent reliability.
  2. The relevance subset matters most. A model that calls tools when it should and refuses when it shouldn't is a useful agent. A model that calls aggressively is a failed agent.
  3. The benchmarks lag the models by a few months. New models often hit BFCL leaderboards weeks after release. The model card numbers are usually trustworthy in the interim.

Why MMLU isn't dead

MMLU still matters for:

  • General-knowledge chat use cases.
  • Routing models that need broad understanding.
  • Comparing across model families on a stable historical metric.

But MMLU is now saturated for frontier models, the top three or four cluster within 1-2 points of each other, and small differences don't predict product quality. For the production decisions most engineers care about, which model to deploy in an agent, which model to use for RAG, which model to fine-tune, tool-use benchmarks are the better signal.

Where to find the numbers

  • BFCL, gorilla.cs.berkeley.edu/leaderboard.html
  • t-bench, Tau-bench, published by their respective research labs; check Hugging Face.
  • GAIA, the original paper publishes scores; updated leaderboards live on Hugging Face.

LLMDex spec sheets currently track MMLU, HumanEval, GPQA, and SWE-bench Verified by default. We're adding BFCL coverage as we verify per-model scores. As a working principle: where we can source it, we'll publish it. Where we can't, we don't fabricate.

Verdict

The benchmark conversation in 2026 is dominated by tool-use evals because production AI is dominated by tool-using agents. Track BFCL and t-bench for any agent-shaped workload. Track Tau-bench for support automation. Track GAIA if you're building research workflows. Treat MMLU as a historical reference, not a primary metric.

If you're a buyer comparing models, the tool-use scores tell you more about real-world reliability than the academic scores. If you're a builder, design your evals around the benchmarks that mirror your workload.

Further reading

Keep reading

Friday digest

Intelligence, distilled weekly.

One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.