Methodology v2.4 · Current

Last updated 2026-05-10 · Reviewed quarterly

The Methodology

How we ask, how we score, and what we won't do.

Trust comes from showing the work. This page documents the query construction, model selection, scoring formulas, and the things we explicitly refuse to do. If you can't audit it, we won't ship it.

Models queried3

Queries per scan50–80

Avg. wall-clock~90s

Freshness window14d

Methodology versionv2.4

How we scan.

Every scan is a sequence of three steps — site crawl, query generation, model interrogation — run in parallel against three models, with deterministic prompt templates that we publish below.

Query types

We construct queries the way a real customer would phrase them. Three categories, weighted by what buyers actually ask:

Cat. A · Factual

What it does, who it is.

Foundational facts: pricing, products, founding year, headquarters, leadership. The information a customer needs before they'll consider buying.

"What is christmasornot.com's pricing model?"

Cat. B · Competitive

How it stacks up.

Positioning relative to competitors and alternatives. The questions a buyer asks once they know the category exists.

"How does christmasornot.com compare to its main competitors?"

Cat. C · Buying

Should I pick this?

Trust signals, customer experience, integrations, support quality, ROI. The questions that gate a purchase.

"Is christmasornot.com worth using for a small business?"

Query budget per scan

Tier	Cat. A (Factual)	Cat. B (Competitive)	Cat. C (Buying)	Total queries	× 3 models
Free ScanNo account	6	0	0	6	18
$49 AI Readiness Report	10	4	4	18	54
$199 Full Accuracy ReportDynamically generated, with site cross-checking	20–28	14–22	16–30	50–80	150–240
Monitor / Monitor ProSame as $49, every month or week	10	4	4	18 / scan	54 / scan

Prompt format

Every query goes to every model as a clean, single-turn user message. No system prompts, no role-play, no chain-of-thought scaffolding. The same words a buyer would type. The exact template is:

// Prompt template

{question}

// That's it. No prefix. No suffix.

Each model receives identical prompts in the same session, with web search enabled where the model supports it. Responses are captured verbatim — including refusals — and stored for the report.

The accuracy loop

A scan isn't a single shot — it's a loop. The ground-truth fact sheet you maintain inside the product becomes the scoring baseline for every future scan. When you correct what AI got wrong, the next scan checks the models against your edits, not the old answer.

01
You correct a fact.
Update your founders, pricing, products — anything AI gets wrong about you.
02
Next scan picks it up.
Your edits become the scoring baseline for the next run.
03
AI scored against truth.
Each model response is compared claim-by-claim against the version you maintain.
04
Drift alerts when AI is wrong.
When a model contradicts the truth, you get notified — that's the alert that lands in your inbox.

The loop repeats every scan. Your corrections accumulate as canonical truth — and the gap between what the models say and what's actually true is what we measure.

How we score.

Three separate numbers, each with a published formula. We never combine them into a single proprietary “AI score” that hides its arithmetic.

Formula 01 · Consensus

Do the models agree?

For each question, we compare model answers pairwise using a semantic-similarity score (cosine on embeddings). The question's consensus is the average pairwise agreement. The brand's consensus is the average of all question consensus scores, weighted by question category.

C=avg(sim_i,j)·w

where sim_i,j = pairwise semantic agreement, w = category weight (A:0.5, B:0.25, C:0.25)

Formula 02 · Accuracy

Are they actually right?

$199 tier only. Every claim extracted from a model answer is checked against ground truth — your site, structured data, and (where available) verified business records. Accuracy is the fraction of claims that pass verification, with a confidence interval on each.

A=verified/claims×100

verified = claims passing source check at p ≥ 0.85; claims = total atomic claims extracted by GPT-4o

Formula 03 · Reliability (per-model)

Which models can you trust?

For each model independently, we compute the share of its answers that agreed with the majority and (on $199) were also factually correct. Reliability is published per-model so you can see, for instance, that Gemini is right 71% of the time and Perplexity 59%.

R_m=0.6·C_m+0.4·A_m

C_m = consensus participation, A_m = per-claim accuracy; 0.4/0.6 weighting reverts to consensus-only on Free/$49

Score thresholds

Range	Verdict	Color	Plain English
70 – 100	Verified	Green	Models agree, claims check out. AI is representing the brand accurately.
40 – 69	Mixed	Yellow	Partial disagreement or noticeable factual gaps. Some answers are right, some aren't.
0 – 39	Critical	Red	Models broadly disagree or get foundational facts wrong. Fix this before next quarter.

Why these three models.

We picked the smallest set that covers the most-used AI surfaces a buyer might actually consult. Each model exists in the scan for a specific reason, listed below. We'll add Claude and Grok when we can score them as carefully.

ChatGPT

GPT-4o · web-enabled

"The default." Largest consumer footprint by an order of magnitude. If your brand is misrepresented here, it's the misrepresentation most customers will encounter first.

Coverage~62%

Avg. latency1.2s

Gemini

Gemini 1.5 · Google Search grounded

"The search-grounded one." Tightly integrated with Google's index, including AI Overview. The model your customers most often encounter without choosing to.

Coverage~22%

Avg. latency0.9s

Perplexity

Sonar · citation-first

"The research one." Citation-first interface. The model power-users — analysts, journalists, B2B buyers — increasingly default to. Different traffic, but disproportionately influential.

Coverage~9%

Avg. latency1.4s

Roadmap · Model expansion

What we'll add next, and when.

A model joins the scan only when we can score it with the same rigor as the rest. That means consistent prompt handling, a public reliability number, and a citation surface we can verify against.

Q3 2026 · Next

Claude (Sonnet)

Largest gap. Strong reasoning, growing share. Adding once Anthropic's web search reaches stable public availability.

Q4 2026

Grok

X-grounded surface, distinct training. Adding once an API with verifiable answers is available outside the X interface.

2027

Copilot, Meta AI

Microsoft and Meta surfaces. Adding once the buyer-question response is decoupled from the host-app context.

Freshness and methodology versioning.

Every scan is a point-in-time snapshot. We tell you exactly when it ran, which methodology version produced it, and when its data goes read-only.

14-day freshness window

A scan stays "fresh" for fourteen days.

Day 0 · Scan runToday · Day 5Day 14 · Read-only

After 14 days the report stays accessible, but the live "data fresh" banner switches to "archived." Re-scan to refresh. Monitor subscribers re-scan automatically.

Methodology version · current

v2.4

"Drift-aware consensus"

What changed in v2.4Added pairwise semantic agreement (replacing exact-string matching) for Cat. C buying questions. Improves recall on paraphrase. Reports issued before 2026-04-01 use v2.3 — original score and v2.4-reissued score are both shown on those reports for comparability.

Methodology changelog

Version	Released	Status	Key change
v2.4	2026-04-01	Current	Pairwise semantic agreement for Cat. C; reissue path for older scans.
v2.3	2026-02-15	Archived	Reliability weighting moved to 0.6/0.4 (consensus/accuracy) on $199 tier.
v2.2	2026-01-04	Archived	Dynamic 50–80 question generation introduced for Full Accuracy.
v2.1	2025-11-20	Archived	Perplexity Sonar replaces Perplexity online; web search enabled on ChatGPT.
v2.0	2025-09-08	Archived	Three-pillar split (Consensus / Accuracy / Visibility) replaces single composite.

What we don't do.

An honest methodology is defined as much by its refusals as by its formulas. Here is the work we will not do, and why.

✗Don't · 01

No system prompts.

We never preface a query with a system message — no "you are an expert," no "respond in this format," no hidden persona. The buyer doesn't get one. Neither do we.

✗Don't · 02

No role-play or persona injection.

No "pretend you're a customer," no "act as an analyst." The query is exactly what a real customer would type. Anything else is staged performance, not measurement.

✗Don't · 03

No chain-of-thought scaffolding.

We don't ask the model to "think step by step" or "reason carefully." Those tricks inflate apparent quality. The customer never asks for them, so we don't measure what they produce.

✗Don't · 04

No few-shot examples.

No "here's an example of a good answer" before the actual question. That contaminates the response. We measure what the model says cold.

✗Don't · 05

No retries until we like the answer.

One call per question per model per scan. Refusals count. "I don't know" counts. Re-rolling until the model gives a flattering response would defeat the entire purpose of the measurement.

✗Don't · 06

No proprietary composite scores.

Every number we publish comes from a formula on this page. There is no "misquoted score" produced by a secret weighting. If you want to recompute it yourself, you have everything you need.

See the methodology in action

The fastest way to audit our work is to run a scan.

Every report shows you the verbatim model answers next to the scores they produced. Click any question to see how the math arrived where it did.

Run a free scan →