bacground gradient shape
background gradient
background gradient

Why Your AI Visibility Score Is Lying to You

There’s less than a 1 in 100 chance that ChatGPT or Google’s AI will give you the same list of brands in any two responses to the same prompt (SparkToro, 2024). Your GEO monitoring tool checked once, got a result, and called it your "visibility score." That number is a snapshot of randomness, not a measure of performance.

Most AI visibility tools report a single data point as truth. They run your prompt, see whether you showed up, and assign a score. But AI search is non-deterministic by design. The same question, asked twice, returns different brands, different citations, and different rankings. A visibility score based on one check is like measuring ocean depth by dipping your toe in once.

This article explains why most AI visibility metrics are unreliable, what the research actually shows about citation consistency, and how to build a measurement framework that tells you something real.

AI Responses Are Non-Deterministic by Design

AI search engines are not databases. They don’t return the same result for the same query. Every response involves randomness in the model’s generation process, variation in which documents get retrieved, and differences in how the model synthesizes and attributes information.

Practitioners confirmed what the math implies: ranking on AI platforms is probabilistic, not deterministic. Instead of fixed positions, brands have a probability of appearing based on multiple overlapping factors.

Factor

How It Introduces Variance

Temperature setting

Controls randomness in the model’s token selection. Higher values produce more varied responses.

Retrieval timing

The documents retrieved depend on index freshness. Content published yesterday may appear today but not tomorrow.

Query fan-out

AI engines expand user queries into multiple sub-queries internally. Different sub-query expansions produce different source sets.

Model updates

Google made Gemini 3 the global default for AI Overviews in January 2026. 42.4% of previously cited domains (37,870 of 89,262) no longer appeared, replaced by 46,182 new domains (SE Ranking, 2026).

Session context

Prior conversation history and user location can influence which sources surface.

Prompt phrasing

Humans asking the same underlying question rarely phrase their prompts the same way (SparkToro, 2024).

The implication is fundamental: any single check of your AI visibility is measuring one possible outcome out of hundreds. Reporting it as “your score” is misleading.

We confirmed this with first-party data. When we ran 1,000 queries through Perplexity’s Sonar API, 100 queries each run 10 times, only 38% of brands appeared consistently across all runs of the same query. A query surfaced 8.2 unique brands on average, but only 3.1 of those appeared in every run. The Jaccard similarity between any two runs averaged 0.72, meaning roughly 28% of the response changed each time.

The one exception: the #1 recommendation was stable 75% of the time. Position one holds. Everything below it shuffles.

8.2 brands mentioned per query. Only 3.1 appear every time. Any tool reporting a single check as your “score” is measuring noise.

The Research: How Inconsistent Are AI Citations?

SparkToro’s public study of AI recommendation consistency prompted ChatGPT and Google’s AI 100 times each across multiple categories and found that the probability of receiving the identical brand list in any two responses was less than 1 in 100 (SparkToro, 2024). Claude was slightly more consistent but still under 1%. The inconsistency is an architectural feature, not a bug.

Finding

Data Point

Source

Pairwise consistency

Less than 1 in 100 chance of getting the same brand list twice

SparkToro, 2024

Category effect

Narrow categories showed higher consistency than broad categories

SparkToro, 2024

Ranking position validity

Position in AI response list shifts with every query

SparkToro, 2024

Brand stability (Res AI)

Only 38% of brands appeared consistently across 10 runs of the same Perplexity query; 3.1 of 8.2 unique brands

Res AI, 1,000-query Perplexity study, 2026

#1 position stability (Res AI)

Same brand held #1 in 75% of queries at 70%+ consistency

Res AI, 1,000-query Perplexity study, 2026

Less than 1 in 100 chance of getting the same brand list twice. Your single-check “visibility score” is a snapshot of randomness.

What Your Monitoring Tool Gets Wrong

Most GEO monitoring tools make three measurement errors that inflate or deflate your score in ways that don’t reflect reality.

Error 1: Single-run scoring. The tool runs each prompt once and reports whether you appeared. Given the less-than-1% pairwise consistency rate, a single run tells you almost nothing. Your brand could appear in 70% of runs for that prompt but the one time the tool checked, you didn’t show up. Score: zero. That zero is a lie.

Error 2: Tracking position instead of frequency. Some tools report that you’re “ranked #3” in an AI response. SparkToro’s research showed that position in AI response lists shifts with every query. Any tool claiming to track where your brand ranks in AI recommendation lists is providing random data points that change on the next run.

Error 3: Single-platform reporting. ChatGPT, Perplexity, and Google AI Overviews cite different sources for the same query. Only 11% of cited domains overlap between ChatGPT and Perplexity (Averi, 2026). A visibility score from one platform is not a visibility score. It’s a platform-specific data point.

Measurement Error

What the Tool Reports

What’s Actually Happening

Single-run scoring

“You’re visible” or “You’re not visible”

You appear in 70% of runs but the tool checked once and missed

Position tracking

“You’re ranked #3”

Position shifts every query. #3 today, #7 tomorrow, absent on the third run.

Single-platform

“Your visibility is 64%”

That’s 64% on Perplexity. ChatGPT might be 30%. Google AI might be 80%.

Point-in-time

“Score: 72 this week”

The score could be 65 if checked on a different day with different model state

How to Measure AI Visibility Without Lying to Yourself

The fix is not better tools. It’s better methodology. AI visibility measurement requires statistical thinking, not dashboard thinking.

Run each prompt multiple times. Meaningful visibility measurement requires 60 to 100 prompt runs per query to establish a stable frequency. Anything less is noise. At 100 runs per prompt across 30 prompts, that’s 3,000 checks per measurement cycle. This is why daily monitoring matters more than weekly snapshots.

Measure frequency, not position. The valid metric is: "Out of 100 runs, your brand appeared in 73 responses." That’s a frequency. It’s stable, it’s comparable over time, and it tells you something real. "You’re ranked #3" tells you nothing.

Monitor across platforms. OpenAI reported ChatGPT receives 2.5 billion prompts daily, more than doubling from 1 billion daily queries in December (TechCrunch, 2025). Perplexity drives niche but high-intent queries, and Google AI Overviews reach hundreds of millions of users. Each platform cites different sources, with only 11% of cited domains overlapping between ChatGPT and Perplexity (Averi, 2026). Cross-platform consensus is the signal. Single-platform data is the noise.

Track over 30+ day windows. AI models update, retrieval indexes refresh, and competitor content changes. A weekly score captures one moment. A 30-day rolling average captures a trend. Trends are actionable. Moments are not.

Metric

How to Measure It

Why It Works

Visibility frequency

Run each prompt 60-100x per cycle. Report “appeared in X% of runs.”

Statistically stable. Comparable over time. Accounts for non-determinism.

Cross-platform consensus

Run prompts across ChatGPT, Perplexity, and Google AI. Report brands that appear across all three.

Filters noise. If a brand shows up on all platforms, the signal is real.

30-day rolling average

Aggregate daily runs into monthly trend. Compare month-over-month.

Smooths daily variance. Reveals directional movement.

Citation rate (not mention rate)

Track whether you’re linked, not just named.

Being mentioned is awareness. Being cited with a link is traffic. Different metrics, different value.

Competitor consistency

Track how often each competitor appears across runs.

Separates real threats (appear in 80% of runs) from noise (appeared once).

The Concentration Problem Nobody Talks About

Even if you measure correctly, the competitive landscape is tilting toward a small number of dominant domains. The top 5 most-cited domains across ChatGPT, Perplexity, and Google AI (Wikipedia, YouTube, Reddit, Google properties, LinkedIn) capture 38% of all citations, with the top 20 capturing 66% (trydecoding.com, 2025). AI engines develop trust in sources that consistently appear, which makes them appear more often, which builds more trust.

For brands measuring AI visibility, this means a static score is especially misleading. If the top five domains control 38% of the citation pool, your visibility number has to be benchmarked against a moving concentration baseline, not reported as if the field were flat.

The top 5 domains capture 38% of citations. A static visibility score masks the concentration baseline under it.

What a Reliable AI Visibility Report Looks Like

A trustworthy AI visibility report does not show a single number and call it your score. It shows a distribution.

Report Element

What It Shows

Why It Matters

Frequency histogram

For each prompt: “You appeared in X of 100 runs”

Shows which prompts you consistently win and which are contested

Platform breakdown

Frequency per prompt per platform (ChatGPT, Perplexity, Google AI)

Reveals platform-specific gaps

Competitor consistency map

For each competitor: “Appeared in X% of runs across all platforms”

Separates real threats from random appearances

30-day trend line

Rolling average of visibility frequency

Shows whether you’re gaining or losing ground

Citation vs mention split

“Mentioned in 73% of runs. Cited with a link in 12%.”

Being named is vanity. Being linked is value.

Confidence interval

Statistical range for your true visibility

Acknowledges the non-deterministic reality instead of hiding it

If your current monitoring tool can’t produce this, it’s giving you a number without context. That number will make you feel good when it’s high and bad when it’s low, but it won’t tell you what’s actually happening or what to do about it.

Why Monitoring Alone Does Not Solve the Problem

Knowing your score is not the same as changing it. Most GEO platforms stop at monitoring: they tell you where you stand, show you a dashboard, and leave your team to figure out the rest. A dashboard that says “your visibility is 34%” does not tell you how to make it 60%.

The gap between monitoring and execution is where most GEO programs stall. Our Perplexity data showed that comparison and evaluation content had backfire rates of 2.9% and 0%, while listicles backfired 25.7% of the time. The type of content you publish matters more than how precisely you measure the result. A team that publishes the right content format with lightweight monitoring will outperform a team with perfect monitoring data that publishes the wrong format.

Monitoring without execution is a report. Execution informed by monitoring is a strategy.

How to Choose an AI Visibility Measurement Approach

The choice is not between vendors. It is between methodologies that survive non-determinism and methodologies that hide it. Pick your approach based on what you actually need the number to do.

  • If you need a defensible metric for a board deck, prioritize frequency over position. A “73 of 100 runs” frequency is reproducible and survives the next model update. A “ranked #3” snapshot does not.

  • If you have one AI engine to monitor, pick Perplexity for depth and ChatGPT for volume. ChatGPT receives 2.5 billion prompts daily (TechCrunch, 2025), so single-engine tracking cannot substitute for cross-platform coverage on either side.

  • If your visibility today is under 15%, skip monitoring and optimize for publish velocity. There is no score to protect. The priority is getting structured content into the citation pool.

  • If you already rank in the top 3 for a category, invest in 60-to-100-run baselines and 30-day rolling averages. You have a position worth defending against drift.

  • If your category has a stable #1 already locked in, redirect budget to the 25% of queries with no stable leader. The Res AI 1,000-query Perplexity study found 25% of B2B queries have no consistent top recommendation (Res AI, 1,000-query Perplexity study, 2026). Those are the open positions.

Any methodology that reports a single number without a confidence interval is optimizing for dashboard comfort, not decision support.

Frequently Asked Questions

Why does position tracking fail on AI search if it worked for Google?

Google’s ranking algorithm was deterministic: the same query returned the same page order most of the time. AI search runs through a generation step with built-in randomness, query fan-out, and retrieval variance. SparkToro found less than a 1 in 100 chance of two runs producing the same brand list, which makes any single-run position inherently a sample of one (SparkToro, 2024).

How many prompt runs per query are enough to trust the number?

60 to 100 runs per prompt is the threshold where frequency stabilizes in published studies. Fewer than 20 runs and the confidence interval is wider than the metric itself. A visibility score based on one check is not wrong because the tool is wrong. It is wrong because the sample size is too small to mean anything.

Why does the #1 position stay stable when everything below it shuffles?

The Res AI 1,000-query Perplexity study found the top recommendation was stable 75% of the time while positions 2 through 5 shuffled on every run (Res AI, 1,000-query Perplexity study, 2026). The lead position gets reinforced by citation momentum: an engine that already trusts a source keeps returning it. Position 2 and below are effectively a lottery for the remaining open slots.

Can I just monitor one AI engine and extrapolate to the rest?

No. Only 11% of domains appear in both ChatGPT and Perplexity citations across 680 million citations analyzed (Averi, 2026). A brand cited on one engine can be absent on another for the same query. Single-platform scores are platform-specific data points, not visibility scores.

Does domain authority still matter if AI citations are non-deterministic?

Authority is a weak predictor. Res AI’s 1,000-query Perplexity study found non-giant domains hold stable #1 citation position on 93 of 100 B2B queries, with giants winning only 4 (all of them review aggregators), per Res AI (1,000-query Perplexity study, 2026). Structural features of the page itself, such as comparison tables and attributed stats, carry more citation weight than the site’s backlink profile.

Why do a small number of domains dominate the citation pool?

AI engines trust sources that have already been cited, which creates a compounding loop. The top 5 most-cited domains across ChatGPT, Perplexity, and Google AI (Wikipedia, YouTube, Reddit, Google properties, LinkedIn) capture 38% of all citations, with the top 20 capturing 66% (trydecoding.com, 2025). The field expands, but the top pulls away faster than the bottom can catch up.

What should a Series B SaaS company track instead of a vanity score?

Track three things: frequency per prompt (out of 100 runs), cross-platform consensus (brands that appear on ChatGPT, Perplexity, and Google AI for the same query), and 30-day rolling averages per platform. A weekly score captures one moment and masks drift. A 30-day average reveals direction.

How do structural features beat domain authority in AI citation?

Structural features give the AI something extractable: a bold label block, a comparison row, a definition. The Res AI 852-article B2B citation structure study found six structural features appear in 80% or more of the top 50 cited B2B pages and 0% of the bottom 50 (Res AI, 852-article B2B citation structure study, 2026). A page with these elements outperforms a higher-authority page without them.

Why does a 30-day rolling average work when a weekly score does not?

A weekly score is a point estimate inside a noisy system. A 30-day rolling average smooths daily variance and reveals the trend underneath the noise. The underlying reality (your true visibility frequency) changes slowly. The measurement window needs to be long enough for the signal to outweigh the randomness.

Res AI runs your prompts daily across multiple AI platforms, tracks citation frequency over 30-day rolling windows, and separates real competitive threats from random noise. No single-snapshot scores. No fake precision. Just the data you need to know if you’re winning or losing.

See how it works →

Share

Your content is invisible to AI. Res fixes that.

Your content is invisible to AI. Res fixes that.

Get cited by ChatGPT, Perplexity, and Google AI Overviews.

Get cited by ChatGPT, Perplexity, and Google AI Overviews.