Methodology
How LLMDex collects, verifies, and presents data. Read this before you cite us, or correct us.
1. Data sources
Every model and tool in LLMDex is documented from one or more of these primary sources:
- The provider's official model card or product page.
- The provider's public pricing or API documentation.
- Independent leaderboards: Artificial Analysis, LMSYS Chatbot Arena, Vellum LLM Leaderboard, SWE-bench, GPQA, ARC-AGI public sets.
- Provider blog posts announcing new releases.
Where a number is reported in two sources with conflicting values, we prefer the model card. Where a number isn't reported anywhere we can verify, we leave the field blank. Always.
2. Our anti-fabrication policy
The single most important rule of LLMDex is this: we do not invent benchmark numbers. If a model doesn't have a published MMLU score, the page reads "Benchmark not yet available", not a plausible-looking guess. This costs us superficial polish and earns us trust. We think the trade is correct.
The same rule applies to pricing. If a provider's pricing is gated behind a sales conversation, we say so rather than imputing a number from a competitor.
3. Comparison-page verdicts
Every /compare/[a-vs-b] page renders a multi-paragraph verdict block. These verdicts are generated programmatically from the underlying data, they're not LLM-written prose and they're not the same template across pages. The same function applied to two different models produces meaningfully different output because the input data differs.
Concretely: the verdict reasons about price ratios, context-window deltas, benchmark deltas across each shared metric, modality overlap, openness/license differences, and release-date proximity. You can audit the verdict generator at lib/verdict.ts in the source repo.
4. "Best for X" rankings
Each /best/[task] page ranks 5, 10 models. Rankings are curated by hand based on the criteria listed at the top of each page, for example, the "Best LLM for Coding" page lists SWE-bench Verified, HumanEval, long-context support, and tool-calling reliability as the ranking factors.
We don't use a single composite score. Different tasks reward different traits, and a generic average tends to rank everything identically. Where two models are close, we prefer the one with stronger production deployments at the time of writing.
5. Affiliate disclosure
LLMDex links to provider pages. Some of those links carry a referral code that pays a fraction of any signup back to LLMDex. These links are visibly marked as "sponsored" on every page and use rel="sponsored noopener" for transparency to search engines and users. We never adjust rankings, ratings, or comparisons based on whether a partner pays us. Where we have an affiliate relationship with one option but not its alternative, we link to both, partner first only when its rank is genuinely first.
6. Update cadence
The dataset gets reviewed weekly. New model launches typically go live within 7 days. Pricing changes are caught on the next review. The footer displays a "last updated" date that reflects the most recent dataset commit.
7. Errors and corrections
We get things wrong. When you spot one, email corrections@llmdex.com or open an issue against the public repo. We treat verifiable corrections the way newspapers treat masthead corrections: prompt, public, and on the page where the error happened.
8. What we don't do
- No paid placements. Money does not move ranks.
- No AI-generated filler prose. Comparisons are programmatic, not stochastic. Long-form content (this page, blog articles) is written by humans.
- No private benchmark sets. If we can't cite it, we don't use it.
The week's AI launches, in your inbox.
One short email every Friday, new models, leaks, and quietly-shipped APIs you missed.