The Two Rules of Honest AI Data
Don't fabricate. Don't omit context. The full editorial standard behind LLMDex's data and how to apply it to your own work.
If you publish data about AI models, a leaderboard, a comparison page, a buyer's guide, a Twitter thread, a slide in a pitch deck, you're making editorial choices whether you realize it or not. The rate of fabrication and selective omission in the AI data space is high. The rate at which it's called out is low. This article is the full editorial standard we use at LLMDex, distilled to the two rules we actually apply.
Rule 1: Don't fabricate
This rule is shorter to state than to follow. Don't publish a number you can't source. Don't fill a benchmark cell because the table looks ugly with blanks. Don't approximate a release date because "it was sometime that quarter."
Three places fabrication sneaks in:
1. Plausible-looking placeholders
The temptation is to put a number that looks right into a blank cell, especially if the rest of the row is populated. "GPT-X scored around 88 on MMLU based on what we'd expect" is fabrication. "Benchmark not yet available" is honesty. The cell with the right answer might be the empty one.
2. Composite scores
A composite that says "GPT-5 scored 91 on coding" might be averaging HumanEval, SWE-bench, and a third number. That average is a derived statistic, not a benchmark; it should be labeled as such. Slapping the composite in a "score" cell as if it were a single benchmark is misleading.
3. Out-of-date numbers presented as current
A benchmark from 2023 published with a 2026 timestamp implies the model performs that way today. Many models have been updated, fine-tuned, or had their post-training redone. The honest thing is to date-stamp the benchmark or update it.
How to actually not fabricate
Three operational habits:
1. Source every number
Before publishing, write down where the number came from. If you can't, don't publish. We literally maintain a comment in our model dataset for cells that took non-obvious sourcing, it's a habit that prevents the "I forgot where this came from" drift.
2. Make blank visible
Don't hide unknowns. Make the UI render "Benchmark not yet available" or "Pricing not published." Visible gaps are stronger trust signals than full tables. Our spec pages are full of explicit blanks; our analytics say users find this more trustworthy, not less.
3. Build a verification step
We have a script (pnpm check) that asserts cross-references and bench-value sanity. It can't verify whether a number is correct, but it can flag inconsistencies and impossible values. The discipline of running it before commit catches a meaningful fraction of errors.
Rule 2: Don't omit context
The harder rule. A number can be correctly sourced, accurately reported, and still misleading if the context is missing.
Three classic context-omissions:
1. Methodology
"GPT-X scored 92 on HumanEval" without "pass@1, zero-shot, temp=0" is missing essential context. The same model under different methodology gets different numbers. Without the suffix, the number is non-comparable.
2. Time
"Claude is the leader on coding", was that true last week, last quarter, or two years ago? The leaderboard moves. A claim without a date is implicit: "as of now," and "now" rolls forward without the claim updating. Date-stamp every claim.
3. Benchmark version
MMLU and MMLU-Pro are not the same benchmark. HumanEval and HumanEval+ are not the same benchmark. SWE-bench and SWE-bench Verified are not the same benchmark. A score against the older version doesn't tell you anything about performance on the harder version.
The harder editorial calls
Two cases where the rules conflict and require human judgment:
When sources disagree
The model card says 89.3 on GPQA. Artificial Analysis says 87.1. Which do you publish?
Our policy: prefer the source closest to the test. For a benchmark with a well-maintained leaderboard (SWE-bench, ARC-AGI), prefer the leaderboard. For a benchmark that depends on the model's setup (sampling temperature, prompt format), prefer the model card. When in doubt, publish a range or note the discrepancy.
When the methodology is unclear
A provider publishes a number with no methodology section. Do you include it?
Our policy: include with explicit caveat ("provider-reported, methodology unspecified") or omit. Both are defensible. Including with caveat is better for completeness; omitting is better for purity. We mostly include-with-caveat.
Why these rules matter
Three reasons honest data is a long-term moat:
1. Search engines reward it
Google's helpful-content updates over the last two years have increasingly penalized AI-generated reference content with fabricated stats. Sites that visibly source, that admit unknowns, that don't pad their tables are doing better in rankings. This is a strong signal that being honest is also being optimized.
2. Sophisticated readers verify
The audience for serious AI reference content, engineers, founders, analysts, checks. Cite a number that doesn't trace back to a primary source, and someone will email you. Publish a fabricated number, and someone will catch it. The cost of being caught is high; the cost of being honest is low.
3. Trust compounds
A site that has been right for two years builds a reputation that's hard to dislodge. A site that's been caught fabricating once never recovers, even if subsequent data is accurate. The asymmetry rewards the careful and punishes the lazy.
Apply this to your own work
If you're publishing AI data, even just a Twitter thread, three quick checks:
- Can you cite every number? If not, edit it out or qualify it.
- Can a reader reproduce your methodology? If not, add it or note that you can't.
- Are you presenting unknowns as knowns? If yes, change the framing.
The bar isn't "be flawless." It's "be the kind of source other sources cite." That's a much lower bar than perfection and a much higher bar than "post and forget."
A short list of places to be especially careful
Three specific traps:
- Cost-per-quality charts. It's almost impossible to plot model quality vs cost without a value-laden choice of "quality." Be explicit about what you're plotting.
- Composite leaderboards. If you average across benchmarks, label the composite. Different averages produce different rankings.
- Marketing comparisons. "Model X is 47% better than Y" almost always hides a methodology that's not falsifiable. Skeptical sources outperform definitive ones over time.
The honest endpoint
We've been at this for over a year. Two years ago we'd have said "data quality is a feature." Today we'd say it's the feature. The site that has the right number when others have the wrong number is the site readers come back to. That's the whole position.
If your editorial standard isn't already these two rules, retrofit them. The work is real and the payoff is durable.
Further reading
Keep reading
- The Five LLM Myths That Won't Die
Reasoning models hallucinate too. Open-weight is not always cheaper. And three more myths the AI Twitter consensus needs to retire.
- Why We Built LLMDex
A short story about how an internal model-tracking spreadsheet became a public site, and what we learned along the way.
- How to Read an AI Benchmark: A Skeptical Reader's Guide
MMLU, HumanEval, SWE-bench, GPQA, what they actually measure, how providers game them, and how to think about benchmark numbers in 2026.
Intelligence, distilled weekly.
One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.