When we set out to measure how AI represents brands, the first question we faced wasn't a product question — it was a sampling question. Which models do you ask? Pick too few and you miss the disagreement that makes consensus meaningful. Pick too many and you inflate cost without adding signal. The right answer is somewhere in the middle, but "the middle" turned out to be more contested than we expected.
Every brand-monitoring tool on the market makes this decision implicitly. SEMRush AI Visibility tracks four models. Otterly tracks six. Profound tracks eight. None of them publish why. We wanted to publish why.
So we ran a bake-off. Eleven models, sixty brand questions, three independent runs each. The goal: find the smallest set of models that captures the most signal about how AI represents brands today — and admit, in public, the trade-offs we made.
The setup
We picked sixty questions across six categories — pricing, products, founding, competitors, support, and reputation. Each question got fed to all eleven models, three times, with temperature held constant at 0.2 to dampen but not eliminate stochasticity. That gave us 1,980 model responses to grade.
Grading was the hard part. We needed an inter-model agreement score that treated "the answer is wrong" and "I don't know" differently from "the models gave three slightly different correct answers." So we built a three-axis rubric:
- Factual accuracy — Was the claim true, according to the brand's own site?
- Inter-model agreement — Cohen's kappa across model pairs, adapted for multi-class verdicts.
- Confidence calibration — Did the model hedge appropriately when it didn't know?
Cohen's kappa was designed for human raters. Adapting it to stochastic LLMs is non-trivial — we treat each model-question pair as a single rater, then compute pairwise kappa across all three runs. Full code and dataset are open-sourced. See the methodology repo for the gnarly details.
The eleven models we tested
The lineup: ChatGPT (GPT-4o, GPT-4o-mini), Claude (3.5 Sonnet, 3 Haiku), Gemini (1.5 Pro, 1.5 Flash), Perplexity (sonar-pro), Mistral (Large 2), Cohere (Command-R+), Llama (3.1 405B), and Grok (2). We left out smaller fine-tunes and open-weight derivatives — not because they don't matter, but because their answers tended to track their parent model closely enough to not add signal.
We ran each question through each model three times. We graded each response. Then we asked: which subset of models, taken together, explains the most variance in the full eleven-model agreement matrix?
The answer, when we ran the principal components, was surprising.
"Three models captured 94% of the disagreement signal. Five models captured 97%. Eleven models captured 100%, at four times the cost. Diminishing returns came fast."
— From the methodology paper, draft 0.4The three we picked
ChatGPT, Gemini, and Perplexity. Here's why each earned its spot — and why two obvious candidates didn't.
ChatGPT (GPT-4o)
The market-share argument is enough on its own. ChatGPT serves roughly 60% of consumer AI queries. If a brand is misrepresented in ChatGPT, it's misrepresented to most of the people asking. But there's a methodological reason too: GPT-4o has an idiosyncratic answering pattern — confidently terse, often web-grounded — that disagrees with every other model in distinctive ways. Drop ChatGPT and you lose more signal than dropping any other single model.
Gemini (1.5 Pro)
Gemini's grounding is fundamentally different. Where GPT pulls heavily from web search, Gemini integrates Google's knowledge graph and AI Overview infrastructure. This makes Gemini disagree with ChatGPT on exactly the queries where the difference matters: structured facts. Founding dates, employee counts, product categories. If you only run ChatGPT, you'll never notice the disagreements that Gemini surfaces.
Perplexity (sonar-pro)
Perplexity is the outlier — and that's the point. It's the only model in our set that refuses to answer at meaningful rates. About 18% of our pricing-question runs got an "I don't know" from Perplexity. That refusal pattern is a signal: it tells you which brand facts are genuinely hard to find on the public web. ChatGPT and Gemini will confabulate before they'll refuse. Perplexity won't.
The omission: why Claude didn't make the cut
Claude 3.5 Sonnet is, by most measures, the best raw reasoner of the eleven. It scored highest on factual accuracy when given good context. So why isn't it one of our three?
Because Claude agrees with ChatGPT 81% of the time. The two models share enough training data and human-feedback signal that they make remarkably similar mistakes — and reach remarkably similar correct answers — on brand questions. For a tool that measures disagreement, two models that mostly agree are redundant.
This was the most counterintuitive finding of the entire study. We expected Claude to be the natural third pick. The data said otherwise. Diversity of method matters more than quality of method when you're sampling for consensus.
We will revisit this as Claude's web-search and grounding capabilities evolve. For now, Claude sits in our "watch list" — we run it monthly as a sanity check, but not as a primary signal source.
What this means for monitoring
If you're building a brand-monitoring system — or evaluating one — here are the operational takeaways:
- Three is enough. The marginal signal from the fourth model is small. The cost is not. For monthly scans at scale, three providers hits the right efficient frontier.
- Pick for diversity, not quality. The best three models are the three that disagree with each other most often, not the three with the highest accuracy scores.
- Hold out a watchdog. Run a fourth model monthly to detect ecosystem drift. If your watchdog starts agreeing with your panel, your panel has drifted toward consensus on its own and you're losing signal.
That's the framework we ship in misquoted today. Three primary models, monthly watchdog runs, and a public methodology we'll keep updating as the model ecosystem moves.
What this doesn't mean
It doesn't mean the other eight models are bad. It means they're redundant with each other for our specific use case. If your job is to assess raw reasoning quality, you'd pick differently. If your job is to build a recommender, you'd pick differently. If your job is to measure how AI represents your brand to actual end-users — three models, diverse providers, run regularly — that's the answer the data gave us.
We'll keep publishing this kind of work because the AI monitoring space is still soft on methodology. Most tools won't tell you how they pick their model panel. Many of them just pick whatever's easy to integrate. That's a defensible engineering choice and an indefensible product choice, and the difference matters when the output is a number you're charging brands to act on.
If you've gotten this far: thanks for reading. The full dataset, scoring rubric, and analysis code are on GitHub under MIT. Run your own bake-off. Tell us where we got it wrong.
See your brand through three models at once
The free scan runs ChatGPT, Gemini, and Perplexity against six questions about your brand. No card, no signup.