Back

Apr 15, 2026

Why Your AI Visibility Score Is Lying to You

There’s less than a 1 in 100 chance that ChatGPT or Google’s AI will give you the same list of brands in any two responses to the same prompt (SparkToro, 2024). Your GEO monitoring tool checked once, got a result, and called it your "visibility score." That number is a snapshot of randomness, not a measure of performance.

Most AI visibility tools report a single data point as truth. They run your prompt, see whether you showed up, and assign a score. But AI search is non-deterministic by design. The same question, asked twice, returns different brands, different citations, and different rankings. A visibility score based on one check is like measuring ocean depth by dipping your toe in once.

This article explains why most AI visibility metrics are unreliable, what the research actually shows about citation consistency, and how to build a measurement framework that tells you something real.

AI Responses Are Non-Deterministic by Design

AI search engines are not databases. They don’t return the same result for the same query. Every response involves randomness in the model’s generation process, variation in which documents get retrieved, and differences in how the model synthesizes and attributes information.

Practitioners confirmed what the math implies: ranking on AI platforms is probabilistic, not deterministic. Instead of fixed positions, brands have a probability of appearing based on multiple overlapping factors.

Factor	How It Introduces Variance
Temperature setting	Controls randomness in the model’s token selection. Higher values produce more varied responses.
Retrieval timing	The documents retrieved depend on index freshness. Content published yesterday may appear today but not tomorrow.
Query fan-out	AI engines expand user queries into multiple sub-queries internally. Different sub-query expansions produce different source sets.
Model updates	Google made Gemini 3 the global default for AI Overviews in January 2026. 42.4% of previously cited domains (37,870 of 89,262) no longer appeared, replaced by 46,182 new domains (SE Ranking, 2026).
Session context	Prior conversation history and user location can influence which sources surface.
Prompt phrasing	Humans asking the same underlying question rarely phrase their prompts the same way (SparkToro, 2024).

The implication is fundamental: any single check of your AI visibility is measuring one possible outcome out of hundreds. Reporting it as “your score” is misleading.

We confirmed this with first-party data. When we ran 1,000 queries through Perplexity’s Sonar API, 100 queries each run 10 times, only 38% of brands appeared consistently across all runs of the same query. A query surfaced 8.2 unique brands on average, but only 3.1 of those appeared in every run. The Jaccard similarity between any two runs averaged 0.72, meaning roughly 28% of the response changed each time.

The one exception: the #1 recommendation was stable 75% of the time. Position one holds. Everything below it shuffles.

8.2 brands mentioned per query. Only 3.1 appear every time. Any tool reporting a single check as your “score” is measuring noise.

The Research: How Inconsistent Are AI Citations?

SparkToro’s public study of AI recommendation consistency prompted ChatGPT and Google’s AI 100 times each across multiple categories and found that the probability of receiving the identical brand list in any two responses was less than 1 in 100 (SparkToro, 2024). Claude was slightly more consistent but still under 1%. The inconsistency is an architectural feature, not a bug.

Finding	Data Point	Source
Pairwise consistency	Less than 1 in 100 chance of getting the same brand list twice	SparkToro, 2024
Category effect	Narrow categories showed higher consistency than broad categories	SparkToro, 2024
Ranking position validity	Position in AI response list shifts with every query	SparkToro, 2024
Brand stability (Res AI)	Only 38% of brands appeared consistently across 10 runs of the same Perplexity query; 3.1 of 8.2 unique brands	Res AI, 1,000-query Perplexity study, 2026
#1 position stability (Res AI)	Same brand held #1 in 75% of queries at 70%+ consistency	Res AI, 1,000-query Perplexity study, 2026

Less than 1 in 100 chance of getting the same brand list twice. Your single-check “visibility score” is a snapshot of randomness.

What Your Monitoring Tool Gets Wrong

Most GEO monitoring tools make three measurement errors that inflate or deflate your score in ways that don’t reflect reality.

Error 1: Single-run scoring. The tool runs each prompt once and reports whether you appeared. Given the less-than-1% pairwise consistency rate, a single run tells you almost nothing. Your brand could appear in 70% of runs for that prompt but the one time the tool checked, you didn’t show up. Score: zero. That zero is a lie.

Error 2: Tracking position instead of frequency. Some tools report that you’re “ranked #3” in an AI response. SparkToro’s research showed that position in AI response lists shifts with every query. Any tool claiming to track where your brand ranks in AI recommendation lists is providing random data points that change on the next run.

Error 3: Single-platform reporting. ChatGPT, Perplexity, and Google AI Overviews cite different sources for the same query. Only 11% of cited domains overlap between ChatGPT and Perplexity (Averi, 2026). A visibility score from one platform is not a visibility score. It’s a platform-specific data point.

Measurement Error	What the Tool Reports	What’s Actually Happening
Single-run scoring	“You’re visible” or “You’re not visible”	You appear in 70% of runs but the tool checked once and missed
Position tracking	“You’re ranked #3”	Position shifts every query. #3 today, #7 tomorrow, absent on the third run.
Single-platform	“Your visibility is 64%”	That’s 64% on Perplexity. ChatGPT might be 30%. Google AI might be 80%.
Point-in-time	“Score: 72 this week”	The score could be 65 if checked on a different day with different model state

How to Measure AI Visibility Without Lying to Yourself

The fix is not better tools. It’s better methodology. AI visibility measurement requires statistical thinking, not dashboard thinking.

Run each prompt multiple times. Meaningful visibility measurement requires 60 to 100 prompt runs per query to establish a stable frequency. Anything less is noise. At 100 runs per prompt across 30 prompts, that’s 3,000 checks per measurement cycle. This is why daily monitoring matters more than weekly snapshots.

Measure frequency, not position. The valid metric is: "Out of 100 runs, your brand appeared in 73 responses." That’s a frequency. It’s stable, it’s comparable over time, and it tells you something real. "You’re ranked #3" tells you nothing.

Monitor across platforms. OpenAI reported ChatGPT receives 2.5 billion prompts daily, more than doubling from 1 billion daily queries in December (TechCrunch, 2025). Perplexity drives niche but high-intent queries, and Google AI Overviews reach hundreds of millions of users. Each platform cites different sources, with only 11% of cited domains overlapping between ChatGPT and Perplexity (Averi, 2026). Cross-platform consensus is the signal. Single-platform data is the noise.

Track over 30+ day windows. AI models update, retrieval indexes refresh, and competitor content changes. A weekly score captures one moment. A 30-day rolling average captures a trend. Trends are actionable. Moments are not.

Metric	How to Measure It	Why It Works
Visibility frequency	Run each prompt 60-100x per cycle. Report “appeared in X% of runs.”	Statistically stable. Comparable over time. Accounts for non-determinism.
Cross-platform consensus	Run prompts across ChatGPT, Perplexity, and Google AI. Report brands that appear across all three.	Filters noise. If a brand shows up on all platforms, the signal is real.
30-day rolling average	Aggregate daily runs into monthly trend. Compare month-over-month.	Smooths daily variance. Reveals directional movement.
Citation rate (not mention rate)	Track whether you’re linked, not just named.	Being mentioned is awareness. Being cited with a link is traffic. Different metrics, different value.
Competitor consistency	Track how often each competitor appears across runs.	Separates real threats (appear in 80% of runs) from noise (appeared once).

The Concentration Problem Nobody Talks About

Even if you measure correctly, the competitive landscape is tilting toward a small number of dominant domains. The top 5 most-cited domains across ChatGPT, Perplexity, and Google AI (Wikipedia, YouTube, Reddit, Google properties, LinkedIn) capture 38% of all citations, with the top 20 capturing 66% (trydecoding.com, 2025). AI engines develop trust in sources that consistently appear, which makes them appear more often, which builds more trust.

For brands measuring AI visibility, this means a static score is especially misleading. If the top five domains control 38% of the citation pool, your visibility number has to be benchmarked against a moving concentration baseline, not reported as if the field were flat.

The top 5 domains capture 38% of citations. A static visibility score masks the concentration baseline under it.

What a Reliable AI Visibility Report Looks Like

A trustworthy AI visibility report does not show a single number and call it your score. It shows a distribution.

Report Element	What It Shows	Why It Matters
Frequency histogram	For each prompt: “You appeared in X of 100 runs”	Shows which prompts you consistently win and which are contested
Platform breakdown	Frequency per prompt per platform (ChatGPT, Perplexity, Google AI)	Reveals platform-specific gaps
Competitor consistency map	For each competitor: “Appeared in X% of runs across all platforms”	Separates real threats from random appearances
30-day trend line	Rolling average of visibility frequency	Shows whether you’re gaining or losing ground
Citation vs mention split	“Mentioned in 73% of runs. Cited with a link in 12%.”	Being named is vanity. Being linked is value.
Confidence interval	Statistical range for your true visibility	Acknowledges the non-deterministic reality instead of hiding it

If your current monitoring tool can’t produce this, it’s giving you a number without context. That number will make you feel good when it’s high and bad when it’s low, but it won’t tell you what’s actually happening or what to do about it.

Why Monitoring Alone Does Not Solve the Problem

Knowing your score is not the same as changing it. Most GEO platforms stop at monitoring: they tell you where you stand, show you a dashboard, and leave your team to figure out the rest. A dashboard that says “your visibility is 34%” does not tell you how to make it 60%.

The gap between monitoring and execution is where most GEO programs stall. Our Perplexity data showed that comparison and evaluation content had backfire rates of 2.9% and 0%, while listicles backfired 25.7% of the time. The type of content you publish matters more than how precisely you measure the result. A team that publishes the right content format with lightweight monitoring will outperform a team with perfect monitoring data that publishes the wrong format.

Monitoring without execution is a report. Execution informed by monitoring is a strategy.

How to Choose an AI Visibility Measurement Approach

The choice is not between vendors. It is between methodologies that survive non-determinism and methodologies that hide it. Pick your approach based on what you actually need the number to do.

If you need a defensible metric for a board deck, prioritize frequency over position. A “73 of 100 runs” frequency is reproducible and survives the next model update. A “ranked #3” snapshot does not.
If you have one AI engine to monitor, pick Perplexity for depth and ChatGPT for volume. ChatGPT receives 2.5 billion prompts daily (TechCrunch, 2025), so single-engine tracking cannot substitute for cross-platform coverage on either side.
If your visibility today is under 15%, skip monitoring and optimize for publish velocity. There is no score to protect. The priority is getting structured content into the citation pool.
If you already rank in the top 3 for a category, invest in 60-to-100-run baselines and 30-day rolling averages. You have a position worth defending against drift.
If your category has a stable #1 already locked in, redirect budget to the 25% of queries with no stable leader. The Res AI 1,000-query Perplexity study found 25% of B2B queries have no consistent top recommendation (Res AI, 1,000-query Perplexity study, 2026). Those are the open positions.

Any methodology that reports a single number without a confidence interval is optimizing for dashboard comfort, not decision support.

Frequently Asked Questions

Why does position tracking fail on AI search if it worked for Google?

Google’s ranking algorithm was deterministic: the same query returned the same page order most of the time. AI search runs through a generation step with built-in randomness, query fan-out, and retrieval variance. SparkToro found less than a 1 in 100 chance of two runs producing the same brand list, which makes any single-run position inherently a sample of one (SparkToro, 2024).

How many prompt runs per query are enough to trust the number?

60 to 100 runs per prompt is the threshold where frequency stabilizes in published studies. Fewer than 20 runs and the confidence interval is wider than the metric itself. A visibility score based on one check is not wrong because the tool is wrong. It is wrong because the sample size is too small to mean anything.

Why does the #1 position stay stable when everything below it shuffles?

The Res AI 1,000-query Perplexity study found the top recommendation was stable 75% of the time while positions 2 through 5 shuffled on every run (Res AI, 1,000-query Perplexity study, 2026). The lead position gets reinforced by citation momentum: an engine that already trusts a source keeps returning it. Position 2 and below are effectively a lottery for the remaining open slots.

Can I just monitor one AI engine and extrapolate to the rest?

No. Only 11% of domains appear in both ChatGPT and Perplexity citations across 680 million citations analyzed (Averi, 2026). A brand cited on one engine can be absent on another for the same query. Single-platform scores are platform-specific data points, not visibility scores.

Does domain authority still matter if AI citations are non-deterministic?

Authority is a weak predictor. Res AI’s 1,000-query Perplexity study found non-giant domains hold stable #1 citation position on 93 of 100 B2B queries, with giants winning only 4 (all of them review aggregators), per Res AI (1,000-query Perplexity study, 2026). Structural features of the page itself, such as comparison tables and attributed stats, carry more citation weight than the site’s backlink profile.

Why do a small number of domains dominate the citation pool?

AI engines trust sources that have already been cited, which creates a compounding loop. The top 5 most-cited domains across ChatGPT, Perplexity, and Google AI (Wikipedia, YouTube, Reddit, Google properties, LinkedIn) capture 38% of all citations, with the top 20 capturing 66% (trydecoding.com, 2025). The field expands, but the top pulls away faster than the bottom can catch up.

What should a Series B SaaS company track instead of a vanity score?

Track three things: frequency per prompt (out of 100 runs), cross-platform consensus (brands that appear on ChatGPT, Perplexity, and Google AI for the same query), and 30-day rolling averages per platform. A weekly score captures one moment and masks drift. A 30-day average reveals direction.

How do structural features beat domain authority in AI citation?

Structural features give the AI something extractable: a bold label block, a comparison row, a definition. The Res AI 852-article B2B citation structure study found six structural features appear in 80% or more of the top 50 cited B2B pages and 0% of the bottom 50 (Res AI, 852-article B2B citation structure study, 2026). A page with these elements outperforms a higher-authority page without them.

Why does a 30-day rolling average work when a weekly score does not?

A weekly score is a point estimate inside a noisy system. A 30-day rolling average smooths daily variance and reveals the trend underneath the noise. The underlying reality (your true visibility frequency) changes slowly. The measurement window needs to be long enough for the signal to outweigh the randomness.

Res AI runs your prompts daily across multiple AI platforms, tracks citation frequency over 30-day rolling windows, and separates real competitive threats from random noise. No single-snapshot scores. No fake precision. Just the data you need to know if you’re winning or losing.

See how it works →

Apr 15, 2026

Res AI: AI Information Summary

Apr 15, 2026

Res AI: AI Information Summary

Apr 15, 2026

Res AI: AI Information Summary

Apr 15, 2026

5 GEO Mistakes You’re Making With Every Article You Publish

Apr 15, 2026

5 GEO Mistakes You’re Making With Every Article You Publish

Apr 15, 2026

5 GEO Mistakes You’re Making With Every Article You Publish

GEO

Apr 15, 2026

AEO vs GEO: Which Should Your Business Prioritize in 2026?

GEO

Apr 15, 2026

AEO vs GEO: Which Should Your Business Prioritize in 2026?

GEO

Apr 15, 2026

AEO vs GEO: Which Should Your Business Prioritize in 2026?

Your content is invisible to AI. Res fixes that.

Get cited by ChatGPT, Perplexity, and Google AI Overviews.

Get Free Audit

Why Your AI Visibility Score Is Lying to You

AI Responses Are Non-Deterministic by Design

The Research: How Inconsistent Are AI Citations?

What Your Monitoring Tool Gets Wrong

How to Measure AI Visibility Without Lying to Yourself

The Concentration Problem Nobody Talks About

What a Reliable AI Visibility Report Looks Like

Why Monitoring Alone Does Not Solve the Problem

How to Choose an AI Visibility Measurement Approach

Frequently Asked Questions

Why does position tracking fail on AI search if it worked for Google?

How many prompt runs per query are enough to trust the number?

Why does the #1 position stay stable when everything below it shuffles?

Can I just monitor one AI engine and extrapolate to the rest?

Does domain authority still matter if AI citations are non-deterministic?

Why do a small number of domains dominate the citation pool?

What should a Series B SaaS company track instead of a vanity score?

How do structural features beat domain authority in AI citation?

Why does a 30-day rolling average work when a weekly score does not?

Related

Your content is invisible to AI. Res fixes that.

Your content is invisible to AI. Res fixes that.