How we ask, how we score, and what we won't do.
Trust comes from showing the work. This page documents the query construction, model selection, scoring formulas, and the things we explicitly refuse to do. If you can't audit it, we won't ship it.
How we scan.
Every scan is a sequence of three steps — site crawl, query generation, model interrogation — run in parallel against three models, with deterministic prompt templates that we publish below.
Query types
We construct queries the way a real customer would phrase them. Three categories, weighted by what buyers actually ask:
What it does, who it is.
Foundational facts: pricing, products, founding year, headquarters, leadership. The information a customer needs before they'll consider buying.
How it stacks up.
Positioning relative to competitors and alternatives. The questions a buyer asks once they know the category exists.
Should I pick this?
Trust signals, customer experience, integrations, support quality, ROI. The questions that gate a purchase.
Query budget per scan
| Tier | Cat. A (Factual) | Cat. B (Competitive) | Cat. C (Buying) | Total queries | × 3 models |
|---|---|---|---|---|---|
| Free ScanNo account | 6 | 0 | 0 | 6 | 18 |
| $49 AI Readiness Report | 10 | 4 | 4 | 18 | 54 |
| $199 Full Accuracy ReportDynamically generated, with site cross-checking | 20–28 | 14–22 | 16–30 | 50–80 | 150–240 |
| Monitor / Monitor ProSame as $49, every month or week | 10 | 4 | 4 | 18 / scan | 54 / scan |
Prompt format
Every query goes to every model as a clean, single-turn user message. No system prompts, no role-play, no chain-of-thought scaffolding. The same words a buyer would type. The exact template is:
Each model receives identical prompts in the same session, with web search enabled where the model supports it. Responses are captured verbatim — including refusals — and stored for the report.
How we score.
Three separate numbers, each with a published formula. We never combine them into a single proprietary "AI score" that hides its arithmetic.
Do the models agree?
For each question, we compare model answers pairwise using a semantic-similarity score (cosine on embeddings). The question's consensus is the average pairwise agreement. The brand's consensus is the average of all question consensus scores, weighted by question category.
Are they actually right?
$199 tier only. Every claim extracted from a model answer is checked against ground truth — your site, structured data, and (where available) verified business records. Accuracy is the fraction of claims that pass verification, with a confidence interval on each.
Which models can you trust?
For each model independently, we compute the share of its answers that agreed with the majority and (on $199) were also factually correct. Reliability is published per-model so you can see, for instance, that Gemini is right 71% of the time and Perplexity 59%.
Score thresholds
| Range | Verdict | Color | Plain English |
|---|---|---|---|
| 70 – 100 | Verified | Green | Models agree, claims check out. AI is representing the brand accurately. |
| 40 – 69 | Mixed | Yellow | Partial disagreement or noticeable factual gaps. Some answers are right, some aren't. |
| 0 – 39 | Critical | Red | Models broadly disagree or get foundational facts wrong. Fix this before next quarter. |
Why these three models.
We picked the smallest set that covers the most-used AI surfaces a buyer might actually consult. Each model exists in the scan for a specific reason, listed below. We'll add Claude and Grok when we can score them as carefully.
What we'll add next, and when.
A model joins the scan only when we can score it with the same rigor as the rest. That means consistent prompt handling, a public reliability number, and a citation surface we can verify against.
Freshness and methodology versioning.
Every scan is a point-in-time snapshot. We tell you exactly when it ran, which methodology version produced it, and when its data goes read-only.
Methodology changelog
| Version | Released | Status | Key change |
|---|---|---|---|
| v2.4 | 2026-04-01 | Current | Pairwise semantic agreement for Cat. C; reissue path for older scans. |
| v2.3 | 2026-02-15 | Archived | Reliability weighting moved to 0.6/0.4 (consensus/accuracy) on $199 tier. |
| v2.2 | 2026-01-04 | Archived | Dynamic 50–80 question generation introduced for Full Accuracy. |
| v2.1 | 2025-11-20 | Archived | Perplexity Sonar replaces Perplexity online; web search enabled on ChatGPT. |
| v2.0 | 2025-09-08 | Archived | Three-pillar split (Consensus / Accuracy / Visibility) replaces single composite. |
What we don't do.
An honest methodology is defined as much by its refusals as by its formulas. Here is the work we will not do, and why.
No system prompts.
We never preface a query with a system message — no "you are an expert," no "respond in this format," no hidden persona. The buyer doesn't get one. Neither do we.
No role-play or persona injection.
No "pretend you're a customer," no "act as an analyst." The query is exactly what a real customer would type. Anything else is staged performance, not measurement.
No chain-of-thought scaffolding.
We don't ask the model to "think step by step" or "reason carefully." Those tricks inflate apparent quality. The customer never asks for them, so we don't measure what they produce.
No few-shot examples.
No "here's an example of a good answer" before the actual question. That contaminates the response. We measure what the model says cold.
No retries until we like the answer.
One call per question per model per scan. Refusals count. "I don't know" counts. Re-rolling until the model gives a flattering response would defeat the entire purpose of the measurement.
No proprietary composite scores.
Every number we publish comes from a formula on this page. There is no "misquoted score" produced by a secret weighting. If you want to recompute it yourself, you have everything you need.
The fastest way to audit our work is to run a scan.
Every report shows you the verbatim model answers next to the scores they produced. Click any question to see how the math arrived where it did.