Methodology v2.4 · Current
Last updated 2026-05-10 · Reviewed quarterly
The Methodology

How we ask, how we score, and what we won't do.

Trust comes from showing the work. This page documents the query construction, model selection, scoring formulas, and the things we explicitly refuse to do. If you can't audit it, we won't ship it.

Models queried3
Queries per scan50–80
Avg. wall-clock~90s
API cost per scan$0.50–1.50
Freshness window14d
Methodology versionv2.4
01

How we scan.

Every scan is a sequence of three steps — site crawl, query generation, model interrogation — run in parallel against three models, with deterministic prompt templates that we publish below.

Query types

We construct queries the way a real customer would phrase them. Three categories, weighted by what buyers actually ask:

Cat. A · Factual

What it does, who it is.

Foundational facts: pricing, products, founding year, headquarters, leadership. The information a customer needs before they'll consider buying.

"What is christmasornot.com's pricing model?"
Cat. B · Competitive

How it stacks up.

Positioning relative to competitors and alternatives. The questions a buyer asks once they know the category exists.

"How does christmasornot.com compare to its main competitors?"
Cat. C · Buying

Should I pick this?

Trust signals, customer experience, integrations, support quality, ROI. The questions that gate a purchase.

"Is christmasornot.com worth using for a small business?"

Query budget per scan

Tier Cat. A (Factual) Cat. B (Competitive) Cat. C (Buying) Total queries × 3 models
Free ScanNo account 6 0 0 6 18
$49 AI Readiness Report 10 4 4 18 54
$199 Full Accuracy ReportDynamically generated, with site cross-checking 20–28 14–22 16–30 50–80 150–240
Monitor / Monitor ProSame as $49, every month or week 10 4 4 18 / scan 54 / scan

Prompt format

Every query goes to every model as a clean, single-turn user message. No system prompts, no role-play, no chain-of-thought scaffolding. The same words a buyer would type. The exact template is:

// Prompt template
{question}
// That's it. No prefix. No suffix.

Each model receives identical prompts in the same session, with web search enabled where the model supports it. Responses are captured verbatim — including refusals — and stored for the report.

02

How we score.

Three separate numbers, each with a published formula. We never combine them into a single proprietary "AI score" that hides its arithmetic.

Formula 01 · Consensus

Do the models agree?

For each question, we compare model answers pairwise using a semantic-similarity score (cosine on embeddings). The question's consensus is the average pairwise agreement. The brand's consensus is the average of all question consensus scores, weighted by question category.

C= avg( simi,j ) ·w
where simi,j = pairwise semantic agreement, w = category weight (A:0.5, B:0.25, C:0.25)
Formula 02 · Accuracy

Are they actually right?

$199 tier only. Every claim extracted from a model answer is checked against ground truth — your site, structured data, and (where available) verified business records. Accuracy is the fraction of claims that pass verification, with a confidence interval on each.

A= verified / claims × 100
verified = claims passing source check at p ≥ 0.85; claims = total atomic claims extracted by GPT-4o
Formula 03 · Reliability (per-model)

Which models can you trust?

For each model independently, we compute the share of its answers that agreed with the majority and (on $199) were also factually correct. Reliability is published per-model so you can see, for instance, that Gemini is right 71% of the time and Perplexity 59%.

Rm= 0.6·Cm + 0.4·Am
Cm = consensus participation, Am = per-claim accuracy; 0.4/0.6 weighting reverts to consensus-only on Free/$49

Score thresholds

Range Verdict Color Plain English
70 – 100 Verified Green Models agree, claims check out. AI is representing the brand accurately.
40 – 69 Mixed Yellow Partial disagreement or noticeable factual gaps. Some answers are right, some aren't.
0 – 39 Critical Red Models broadly disagree or get foundational facts wrong. Fix this before next quarter.
03

Why these three models.

We picked the smallest set that covers the most-used AI surfaces a buyer might actually consult. Each model exists in the scan for a specific reason, listed below. We'll add Claude and Grok when we can score them as carefully.

G
ChatGPT
GPT-4o · web-enabled
"The default." Largest consumer footprint by an order of magnitude. If your brand is misrepresented here, it's the misrepresentation most customers will encounter first.
Coverage~62%
Avg. latency1.2s
G
Gemini
Gemini 1.5 · Google Search grounded
"The search-grounded one." Tightly integrated with Google's index, including AI Overview. The model your customers most often encounter without choosing to.
Coverage~22%
Avg. latency0.9s
P
Perplexity
Sonar · citation-first
"The research one." Citation-first interface. The model power-users — analysts, journalists, B2B buyers — increasingly default to. Different traffic, but disproportionately influential.
Coverage~9%
Avg. latency1.4s
Roadmap · Model expansion

What we'll add next, and when.

A model joins the scan only when we can score it with the same rigor as the rest. That means consistent prompt handling, a public reliability number, and a citation surface we can verify against.

Q3 2026 · Next
Claude (Sonnet)
Largest gap. Strong reasoning, growing share. Adding once Anthropic's web search reaches stable public availability.
Q4 2026
Grok
X-grounded surface, distinct training. Adding once an API with verifiable answers is available outside the X interface.
2027
Copilot, Meta AI
Microsoft and Meta surfaces. Adding once the buyer-question response is decoupled from the host-app context.
04

Freshness and methodology versioning.

Every scan is a point-in-time snapshot. We tell you exactly when it ran, which methodology version produced it, and when its data goes read-only.

14-day freshness window
A scan stays "fresh" for fourteen days.
Day 0 · Scan run Today · Day 5 Day 14 · Read-only
After 14 days the report stays accessible, but the live "data fresh" banner switches to "archived." Re-scan to refresh. Monitor subscribers re-scan automatically.
Methodology version · current
v2.4
"Drift-aware consensus"
What changed in v2.4 Added pairwise semantic agreement (replacing exact-string matching) for Cat. C buying questions. Improves recall on paraphrase. Reports issued before 2026-04-01 use v2.3 — original score and v2.4-reissued score are both shown on those reports for comparability.

Methodology changelog

Version Released Status Key change
v2.4 2026-04-01 Current Pairwise semantic agreement for Cat. C; reissue path for older scans.
v2.3 2026-02-15 Archived Reliability weighting moved to 0.6/0.4 (consensus/accuracy) on $199 tier.
v2.2 2026-01-04 Archived Dynamic 50–80 question generation introduced for Full Accuracy.
v2.1 2025-11-20 Archived Perplexity Sonar replaces Perplexity online; web search enabled on ChatGPT.
v2.0 2025-09-08 Archived Three-pillar split (Consensus / Accuracy / Visibility) replaces single composite.
05

What we don't do.

An honest methodology is defined as much by its refusals as by its formulas. Here is the work we will not do, and why.

Don't · 01

No system prompts.

We never preface a query with a system message — no "you are an expert," no "respond in this format," no hidden persona. The buyer doesn't get one. Neither do we.

Don't · 02

No role-play or persona injection.

No "pretend you're a customer," no "act as an analyst." The query is exactly what a real customer would type. Anything else is staged performance, not measurement.

Don't · 03

No chain-of-thought scaffolding.

We don't ask the model to "think step by step" or "reason carefully." Those tricks inflate apparent quality. The customer never asks for them, so we don't measure what they produce.

Don't · 04

No few-shot examples.

No "here's an example of a good answer" before the actual question. That contaminates the response. We measure what the model says cold.

Don't · 05

No retries until we like the answer.

One call per question per model per scan. Refusals count. "I don't know" counts. Re-rolling until the model gives a flattering response would defeat the entire purpose of the measurement.

Don't · 06

No proprietary composite scores.

Every number we publish comes from a formula on this page. There is no "misquoted score" produced by a secret weighting. If you want to recompute it yourself, you have everything you need.

See the methodology in action

The fastest way to audit our work is to run a scan.

Every report shows you the verbatim model answers next to the scores they produced. Click any question to see how the math arrived where it did.

Run a free scan →