LLM·Dex
MethodologyBenchmarksTransparency

How LLMDex Tracks Benchmark Honesty

The internal process for sourcing, verifying, and de-listing benchmark numbers, and why we leave fields blank.

By LLMDex Editorial

If you've spent five minutes on AI Twitter, you've seen the screenshot: someone pastes a leaderboard, somebody else points out the numbers don't match the model card, a third person posts a different leaderboard with completely different rankings. Benchmark hygiene in the LLM space is bad, and it's been getting worse as the field grows.

LLMDex tracks 80 models. Roughly 60% of them have at least one benchmark field that's plausibly available somewhere on the internet. Of those, we publish maybe 40% on our spec pages. The rest we leave blank, intentionally, and this article explains why.

The problem

Benchmark numbers in LLM evaluation aren't like benchmark numbers for SQL databases or compilers. They're harder to verify, easier to manipulate, and much more sensitive to evaluation context. Five things go wrong:

1. Different versions of the same benchmark

"MMLU" is at least three different evals depending on who ran it: original MMLU (Jan 2021), MMLU-Pro (May 2024), and various lab-specific cleanups. Scores aren't directly comparable across versions, but they're frequently reported with a single label.

2. Different scoring methodologies

Pass@1 vs Pass@10 vs Pass@100 on HumanEval are dramatically different numbers. Some labs report best-of-N, some report majority-vote. A 90 score on one methodology is a 70 on another.

3. Self-reported vs independent

Provider model cards are the primary source. They're also the source most likely to be optimistic. Artificial Analysis, Vellum LLM Leaderboard, and the original benchmark authors run independent evaluations that often disagree with the model card by 2-5 percentage points.

4. Eval contamination

Models can memorize benchmark questions if they leak into training data. This is widely acknowledged and partially mitigated by held-out subsets like SWE-bench Verified, MMLU-Pro, and HumanEval+. It's still real, and it's why benchmark scores tend to drift down as evals get cleaned up.

5. Cherry-picking

Labs pick which benchmarks to report. If a model is bad at math, you won't see GSM8K on the model card. The absence of a benchmark is informative.

Our policy

We publish a benchmark number for a model if and only if all four conditions hold:

  1. The number was reported by either the model's official lab, the original benchmark's maintainers, or one of three trusted independent leaderboards (Artificial Analysis, Vellum, LMSYS Chatbot Arena).
  2. The benchmark version is unambiguous (e.g., "MMLU 5-shot" not just "MMLU").
  3. The methodology is unambiguous (e.g., "HumanEval pass@1 zero-shot" not "HumanEval").
  4. We can find at least one corroborating source within ±2 percentage points.

Numbers that don't clear all four conditions get left blank. The UI renders "Benchmark not yet available" and the spec page is honest about it. We don't pad the table with our best guess.

This sounds like a lot of work. It is. The cross-reference validator at pnpm check enforces structure (kebab-case slugs, valid links between models and tasks), but the editorial policy on what number gets published is human work. We re-review on every model launch.

Why we leave fields blank

Three reasons leaving blanks is better than guessing:

Trust compounds

A site that fabricates one benchmark erodes credibility for every other claim it makes. A site that's honest about uncertainty earns durable trust. Google's E-E-A-T signals reward this; users reward it more. We've watched competitor sites publish plausible-looking numbers for models that don't have public benchmarks, and we've watched their search rankings stagnate while ours climb.

Benchmark coverage is uneven

Some labs publish on every benchmark; others publish on three or four. A spec sheet that pretends every model has every score is implicitly making up data for the labs with sparse reporting. The pattern of which benchmarks a lab does and doesn't report is often more informative than the scores themselves.

Models without public scores are real

Frontier models from OpenAI, Anthropic, and Google ship with detailed model cards. Mid-tier and open-weight models often don't. We publish 80 models, and for many of them, particularly older or smaller models, there genuinely isn't a verifiable MMLU score. Faking one isn't honest.

What we cite when there's a number

Every published benchmark on a model spec page traces back to one of these sources, in order of preference:

  1. The provider's official model card (e.g., the GPT-5 system card, the Claude 4 release blog).
  2. The model's research paper (more common for open-weight releases).
  3. The original benchmark's leaderboard (e.g., the SWE-bench leaderboard at swebench.com).
  4. Artificial Analysis for cross-model benchmarks reported on a consistent methodology.
  5. Vellum LLM Leaderboard for fast-moving cases where Artificial Analysis hasn't caught up.

When sources conflict (and they often do), we publish the model card's number with a small note. We've considered publishing both numbers but the UX clutter wasn't worth the editorial completeness.

What we don't cite

A short list of sources we explicitly don't accept:

  • Twitter screenshots. Even from researchers we respect.
  • Reddit comments. Even when they look authoritative.
  • Marketing comparison charts. Provider-supplied "we're better than X" graphics get filtered out.
  • Benchmark numbers without methodology. A score with no shot-count or temperature isn't reproducible and doesn't get included.

If a number only exists in places we won't cite, we leave the field blank.

How you can verify

Every model spec page on LLMDex is reproducible. The dataset lives in data/models.ts in our public git repo, and every benchmark field is either populated with a sourced number or left undefined. There's no hidden pipeline that backfills numbers we don't have.

If you want to audit a number:

  1. Open the model's spec page.
  2. Visit the official provider's model card (linked in the page).
  3. Compare.

If you find a discrepancy, email corrections@llmdex.com or open an issue. We treat verifiable corrections the way newspapers treat masthead corrections: prompt, public, and on the page where the error happened.

Why this matters for SEO

A side benefit of this policy that we didn't fully appreciate when we set it: search engines are getting much better at penalizing AI-generated content with fabricated stats. Google's helpful-content updates over the last two cycles have specifically targeted "AI-spammy" content with plausible-sounding but unverifiable data.

A site that's visibly careful about what it publishes, that says "Benchmark not yet available" instead of guessing, signals editorial discipline. We can't prove it's why our pages rank well, but the correlation is suggestive.

The harder honest

The hardest part of this policy isn't the work. It's accepting that some pages will look less complete than they could. A spec sheet with three blank benchmark fields looks worse than one with three plausible numbers. The decision to ship the version with the blanks is editorial discipline, not laziness.

We think it's the right trade. Long-term, the site that's right earns more trust than the site that's complete. We'd rather be right.

Further reading

Keep reading

Friday digest

Intelligence, distilled weekly.

One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.