LLM·Dex
MethodologyEditorialData

The Two Rules of Honest AI Data

Don't fabricate. Don't omit context. The full editorial standard behind LLMDex's data and how to apply it to your own work.

By LLMDex Editorial

If you publish data about AI models, a leaderboard, a comparison page, a buyer's guide, a Twitter thread, a slide in a pitch deck, you're making editorial choices whether you realize it or not. The rate of fabrication and selective omission in the AI data space is high. The rate at which it's called out is low. This article is the full editorial standard we use at LLMDex, distilled to the two rules we actually apply.

Rule 1: Don't fabricate

This rule is shorter to state than to follow. Don't publish a number you can't source. Don't fill a benchmark cell because the table looks ugly with blanks. Don't approximate a release date because "it was sometime that quarter."

Three places fabrication sneaks in:

1. Plausible-looking placeholders

The temptation is to put a number that looks right into a blank cell, especially if the rest of the row is populated. "GPT-X scored around 88 on MMLU based on what we'd expect" is fabrication. "Benchmark not yet available" is honesty. The cell with the right answer might be the empty one.

2. Composite scores

A composite that says "GPT-5 scored 91 on coding" might be averaging HumanEval, SWE-bench, and a third number. That average is a derived statistic, not a benchmark; it should be labeled as such. Slapping the composite in a "score" cell as if it were a single benchmark is misleading.

3. Out-of-date numbers presented as current

A benchmark from 2023 published with a 2026 timestamp implies the model performs that way today. Many models have been updated, fine-tuned, or had their post-training redone. The honest thing is to date-stamp the benchmark or update it.

How to actually not fabricate

Three operational habits:

1. Source every number

Before publishing, write down where the number came from. If you can't, don't publish. We literally maintain a comment in our model dataset for cells that took non-obvious sourcing, it's a habit that prevents the "I forgot where this came from" drift.

2. Make blank visible

Don't hide unknowns. Make the UI render "Benchmark not yet available" or "Pricing not published." Visible gaps are stronger trust signals than full tables. Our spec pages are full of explicit blanks; our analytics say users find this more trustworthy, not less.

3. Build a verification step

We have a script (pnpm check) that asserts cross-references and bench-value sanity. It can't verify whether a number is correct, but it can flag inconsistencies and impossible values. The discipline of running it before commit catches a meaningful fraction of errors.

Rule 2: Don't omit context

The harder rule. A number can be correctly sourced, accurately reported, and still misleading if the context is missing.

Three classic context-omissions:

1. Methodology

"GPT-X scored 92 on HumanEval" without "pass@1, zero-shot, temp=0" is missing essential context. The same model under different methodology gets different numbers. Without the suffix, the number is non-comparable.

2. Time

"Claude is the leader on coding", was that true last week, last quarter, or two years ago? The leaderboard moves. A claim without a date is implicit: "as of now," and "now" rolls forward without the claim updating. Date-stamp every claim.

3. Benchmark version

MMLU and MMLU-Pro are not the same benchmark. HumanEval and HumanEval+ are not the same benchmark. SWE-bench and SWE-bench Verified are not the same benchmark. A score against the older version doesn't tell you anything about performance on the harder version.

The harder editorial calls

Two cases where the rules conflict and require human judgment:

When sources disagree

The model card says 89.3 on GPQA. Artificial Analysis says 87.1. Which do you publish?

Our policy: prefer the source closest to the test. For a benchmark with a well-maintained leaderboard (SWE-bench, ARC-AGI), prefer the leaderboard. For a benchmark that depends on the model's setup (sampling temperature, prompt format), prefer the model card. When in doubt, publish a range or note the discrepancy.

When the methodology is unclear

A provider publishes a number with no methodology section. Do you include it?

Our policy: include with explicit caveat ("provider-reported, methodology unspecified") or omit. Both are defensible. Including with caveat is better for completeness; omitting is better for purity. We mostly include-with-caveat.

Why these rules matter

Three reasons honest data is a long-term moat:

1. Search engines reward it

Google's helpful-content updates over the last two years have increasingly penalized AI-generated reference content with fabricated stats. Sites that visibly source, that admit unknowns, that don't pad their tables are doing better in rankings. This is a strong signal that being honest is also being optimized.

2. Sophisticated readers verify

The audience for serious AI reference content, engineers, founders, analysts, checks. Cite a number that doesn't trace back to a primary source, and someone will email you. Publish a fabricated number, and someone will catch it. The cost of being caught is high; the cost of being honest is low.

3. Trust compounds

A site that has been right for two years builds a reputation that's hard to dislodge. A site that's been caught fabricating once never recovers, even if subsequent data is accurate. The asymmetry rewards the careful and punishes the lazy.

Apply this to your own work

If you're publishing AI data, even just a Twitter thread, three quick checks:

  1. Can you cite every number? If not, edit it out or qualify it.
  2. Can a reader reproduce your methodology? If not, add it or note that you can't.
  3. Are you presenting unknowns as knowns? If yes, change the framing.

The bar isn't "be flawless." It's "be the kind of source other sources cite." That's a much lower bar than perfection and a much higher bar than "post and forget."

A short list of places to be especially careful

Three specific traps:

  • Cost-per-quality charts. It's almost impossible to plot model quality vs cost without a value-laden choice of "quality." Be explicit about what you're plotting.
  • Composite leaderboards. If you average across benchmarks, label the composite. Different averages produce different rankings.
  • Marketing comparisons. "Model X is 47% better than Y" almost always hides a methodology that's not falsifiable. Skeptical sources outperform definitive ones over time.

The honest endpoint

We've been at this for over a year. Two years ago we'd have said "data quality is a feature." Today we'd say it's the feature. The site that has the right number when others have the wrong number is the site readers come back to. That's the whole position.

If your editorial standard isn't already these two rules, retrofit them. The work is real and the payoff is durable.

Further reading

Keep reading

Friday digest

Intelligence, distilled weekly.

One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.