
GEO
GEO Doesn’t Have Keyword Research. It Has a Testing Loop.

For 20 years, content planning started the same way. You opened Ahrefs or Semrush, typed a seed term, sorted by volume, and wrote the brief. The keyword was the unit of work. The tool gave you a number that told you what to write.
That workflow does not exist for GEO, and the tools that sound like they replace it don't replace the workflow. They replace the measurement layer. The planning step is still missing, and it will not catch up, because the unit of work changed. SparkToro analyzed 2,961 real human prompts asking AI engines the same questions and found a semantic similarity between any two phrasings of just 0.081 (SparkToro / Gumshoe.ai, 2026). That is not a tooling gap. It is the keyword as a primitive ceasing to exist.
What replaces it is not a different kind of search box. It is a testing loop run continuously against the engines you want to be cited by.
SEO Had a Clean Primitive, and GEO Doesn't
Only 12% of links cited in AI answers also rank in Google's top 10 organic results (Ahrefs, 2025). SEO worked because the keyword was a string, the string had a number attached, and Google ranked pages against that string deterministically enough that you could reverse engineer it. Ahrefs and Semrush built billion-dollar businesses on indexing the strings. The whole discipline rests on that one primitive: a string with volume.
AI engines do not retrieve against strings. They retrieve against vectors. "ZoomInfo alternatives" and "how to find prospect emails for cold outbound" share zero keywords, but an embedding model treats them as the same buyer asking the same question in two different vocabularies. Apollo holds a top citation on both.
The 88% of cited content that lives outside the SEO leaderboard is not getting picked because it ranks. It is getting picked because it sits in the right semantic neighborhood for the right cluster of buyer questions.
Retrieval Model | What It Matches | Planning Input | Overlap With AI Citations |
|---|---|---|---|
Google (string-based) | Exact and partial keyword match | Keyword volume spreadsheet | 12% of AI-cited links |
AI engines (vector-based) | Semantic similarity of meaning | Prompt family clusters | 88% of AI-cited links outside SEO top 10 |
Keyword research surfaces strings. Vector retrieval surfaces meaning. The tools that index strings cannot see this. They were never designed to.
The New Unit Is a Prompt Family
SparkToro's 0.081 semantic similarity score between two phrasings of the same buyer question means the same buyer asks the same question seven different ways depending on what they read most recently, what their team calls the problem, and which competitor they are mad at (SparkToro / Gumshoe.ai, 2026). A prompt family is a cluster of semantically related questions that all express the same underlying buyer intent. The questions share an embedding neighborhood. They share a citation surface. And they almost never share keywords.
Take a single intent: a Series B sales team wants outbound contact data. Here is what that looks like as a prompt family:
Phrasing | Keyword Overlap With Siblings |
|---|---|
"ZoomInfo alternatives" | Low |
"best B2B contact database 2026" | Low |
"how to find prospect emails for cold outbound" | Zero |
"Apollo vs ZoomInfo vs Lusha pricing" | Low |
"B2B lead enrichment tools that don't bankrupt a startup" | Zero |
"where do SDRs get email addresses" | Zero |
"GDPR-safe alternatives to scraping LinkedIn" | Zero |
Seven phrasings, one intent, almost no shared vocabulary. The 0.081 similarity finding is not a curiosity. It is the rule.
You cannot find a prompt family by typing into a search box. You have to know the buyer well enough to enumerate it, then validate the enumeration empirically. Buying intent, not search visibility, is the metric that matters, and that is exactly what makes prompt families harder than keyword lists. You cannot find them by sorting a spreadsheet.
This is the part that sounds like marketing 101 ("understand your buyer") and is actually a structural break from how content planning has worked since 2005. SEO let you skip the buyer because the search bar was a confession booth. GEO does not give you that shortcut.
Three Steps Replace Keyword Research
The Res AI 1,000-query Perplexity study found that 25 of 100 B2B queries had no stable #1 brand at all (Res AI, 2026). Those open positions are the direct output of the replacement workflow: a three-step loop where each step maps cleanly to a step the SEO workflow used to handle.
SEO Step | What It Did | GEO Replacement |
|---|---|---|
Keyword ideation | Generate a list of strings to target | Prompt family discovery: enumerate the natural language questions buyers actually ask |
SERP analysis | Look at who ranks, copy the format | Cross-engine consensus testing: run the family across ChatGPT, Perplexity, Gemini and note who is cited |
Rank tracking | Watch your position over time | Citation tracking: watch which passages of which articles get pulled into answers, on which engines, in which sentences |
Prompt family discovery starts in the head, not in a tool. List the 5 to 15 ways a buyer in your category phrases the same problem. Pull from sales call transcripts, support tickets, Reddit threads in your category, and the actual prompts your customers paste into ChatGPT. The output is not a spreadsheet of strings. It is a map of intent clusters, each containing several phrasings.
Cross-engine consensus testing replaces the SERP audit. Run the family across at least three engines, ten runs each. Saturated intent shows up as the same brands cited every run. Open intent shows up as a shuffling list. The 25 unstable queries from the Res AI study are the prompt families with the lowest cost of entry.
Citation tracking is what most monitoring tools already do, but it is not the deliverable. It is the feedback signal that closes the loop. What got cited, what sentence got pulled, which competitor showed up instead of you, and what their cited passage said that yours did not. Every loop iteration produces a refinement target.
Keyword research outputs a list. Prompt family research outputs a hypothesis. The hypothesis only has value if you test it against the engines you want to be cited by.
Monitoring Vendors See the Score but Don't Move It
82.0% of citations in B2B AI answers come from independent blogs and publications, not vendor sites (Res AI, 2026). Monitoring vendors see the citation gap clearly. Their dashboards surface it: you are not cited for the queries that matter, your competitors are, and the cited list shuffles every run. What the dashboards do not do is tell you what to write next, publish it, and re-test. The dashboard is the deliverable.
A small number of teams are actually solving this in production right now, and they are not the ones with the prettiest dashboards. They are the ones who treat the monitoring data as input to a writing loop, not as a report. They publish a candidate article, test the prompt family, look at what got cited instead, and revise. Then they do it again the next week.
The unit of progress is not "score went up." It is "we now hold a stable #1 on three more prompt families than last month." Most GEO platforms stop at dashboards. The teams winning are running an active testing loop, not reading a quarterly report.
Two competitors in the GEO monitoring space illustrate the pattern. Both track citations. Neither closes the loop from diagnosis to publication to retest.
Platform | Primary Function | Publishes Content | Retests After Edit | Starting Price |
|---|---|---|---|---|
Res AI | Continuous test loop: discover, draft, publish, retest | Yes, direct CMS publish | Yes, daily across 4 engines | $250/mo |
Otterly.AI | AI search monitoring and citation tracking | No | No | $29/mo |
GetCito | AI discoverability diagnostics and optimization support | No | No | $299/mo |
Gauge | AI visibility tracking with content action layer | Limited | Limited | $99/mo |
Writesonic GEO | Monitoring plus in-platform content optimization | Yes, in-platform | Partial | $199/mo |
Monitoring tells you which prompts you lost. It does not tell you what to write to win them. That answer only exists on the other side of an experiment.
Reverse Engineering the Models Is a Losing Game
In January 2026, SE Ranking observed that the rollout of Gemini 3 replaced 42% of previously cited domains overnight (SE Ranking, 2026). Trying to reverse engineer how each AI engine works is a losing game because the models reshuffle faster than anyone can model them. You can read the patents, study the RAG papers, build a mental model of the retrieval pipeline, and optimize the content to match. By the time you finish, the model has been retrained.
42% of the citation surface, gone in a single update. Whatever mental model anyone had of Gemini's retrieval logic on January 14 was wrong on January 15. The domains that survived were not the ones who reverse engineered better. They were the ones whose content was structurally optimized in ways that survived the model swap: statistics with attribution, self-contained answer capsules, comparison tables with named entities.
The Princeton GEO study reaches the same conclusion from the other direction. Across nine optimization tactics tested on eight LLMs, the ones that worked were the ones any well-edited article would already do:
Optimization Tactic | Visibility Impact | Model-Specific? |
|---|---|---|
Adding a statistic with attribution | +41% | No |
Adding a quotation from a named source | +28% | No |
Using authoritative language | +25% | No |
Tightening prose for fluency | +15% | No |
Keyword stuffing | -3% | No |
Source: Princeton KDD, 2024
None of these are model-specific. They are content-quality signals that any retrieval system rewards. The lesson is not "model the model." It is test against the model and let the citations be the ground truth. You do not need to know why a passage was cited. You need to know that it was, run a variant, and see if the variant gets cited too.
The Loop Runs in 6 Steps per Prompt Family
Here is what the loop looks like in practice for one prompt family, from first draft to validated citation.
Write. Pick one prompt family. Draft a single article aimed at it, structured for extraction: answer capsules, attributed stats, comparison table, self-contained H2s.
Publish. Get it on a real domain. Do not test against unpublished drafts. AI engines retrieve from the live web.
Test. Run the full prompt family across 3 or more engines, 10 or more runs each. Record citations, brand mentions, and which passage of which article got pulled.
Diagnose. For runs where you were not cited, look at who was. Read the cited passage. The delta between their passage and yours is the refinement target. It is almost always one of three things: a missing statistic, a missing comparison entry, or an unsourced claim.
Refine. Edit the article. Add the stat. Fill the comparison row. Source the claim. Republish.
Re-test. Same prompt family, same engines, same number of runs. If the citation rate moved, lock it in. If not, the gap was somewhere else.
That is one loop iteration on one prompt family. To run this manually for one article takes a content marketer most of a week. To run it across the 100 prompt families that drive a B2B SaaS pipeline, every quarter, against four engines, with the engines themselves changing under you, is a math problem that does not solve.
Lean Teams Face 8,000 Queries per Quarter
The math is the argument for autonomy. A content team running the loop manually faces 100 prompt families times 4 engines times 10 runs per test times 2 tests per iteration (baseline and post-revision), or 8,000 individual queries per quarter before writing a single word.
Workload Component | Manual Cost per Quarter |
|---|---|
Prompt family discovery (100 families) | 40 to 60 hours |
Cross-engine baseline tests (8,000 queries) | 80 to 120 hours |
Diagnosis of losing runs | 40 to 80 hours |
Article writing and revision | 200 to 400 hours |
Post-revision retests (8,000 queries) | 80 to 120 hours |
Total | 440 to 780 hours |
That is 2 to 4 full-time content marketers doing nothing else. Lean teams of 1 to 3 people end up with one person doing all of this plus four other things. The loop does not get run, and the team falls back on whatever they did last year, which was keyword research, which does not work for the reasons above.
The structural answer is autonomy. Not "AI to write the post," which produces uncitable filler, but autonomous workflow: an agent that runs the testing loop continuously, treats citation data as the planning input, drafts to the prompt family, publishes, retests, and feeds the result back to the planner. The same loop a senior content strategist would run, executed against the volume of queries the manual workflow cannot reach.
The replacement for keyword research is not a better tool. It is a testing loop run at machine speed against engines that change weekly.
How to Choose Your First Prompt Family to Test
The testing loop does not scale until you pick the right first prompt family. These decision rules are ordered by priority; work through them from top to bottom and stop at the first one that applies to your situation.
Priority | Buyer Situation | What to Evaluate |
|---|---|---|
1 | A competitor is already cited on a query your sales team hears weekly | Whether your content structurally matches the cited competitor's passage (stats, tables, named entities) |
2 | The query has no stable #1 in the current citation pool | Whether you can publish a structurally complete article faster than competitors notice the open slot |
3 | Two candidates tie on priority | Whether the prompt family maps to higher buyer intent or higher search volume (pick intent) |
4 | You are unsure how to classify a prompt family's intent | Whether the engine's citation output reveals commercial or informational intent (run the test and let it tell you) |
5 | None of the above apply | Whether the prompt family is close to your product's #1 category, the one you already credibly serve |
The competitor citation is the ground truth that the query is winnable, and the weekly mention from sales is the ground truth that it is commercially relevant. A query where the cited brand shuffles every run is a query where the engine has not yet locked in a preference, and a structurally complete article can take the slot.
The output is a single prompt family to start the loop with, not a ranked list. Run one family all the way through the loop before starting a second. The goal of the first test is to learn how the loop works, not to win a hard query.
Frequently Asked Questions
How do content teams currently conduct keyword research?
Most teams paste a seed keyword into Ahrefs or Semrush, export related keywords sorted by volume and difficulty, cluster by topic, and assign articles to the highest-volume clusters. The entire workflow assumes matching the string wins the traffic, which is the deterministic logic Google rewards and vector retrieval does not.
Can Ahrefs or Semrush approximate prompt families?
Not reliably, because these tools group keywords by lexical similarity while prompt families group by semantic similarity, which SparkToro found is only 0.081 between two phrasings of the same question (SparkToro / Gumshoe.ai, 2026). "ZoomInfo alternatives" and "how to find prospect emails for cold outbound" land in completely different clusters despite having the same buyer behind them.
How do you build a prompt family from scratch?
Start with the 3 to 5 buyer problems your product solves, then mine sales call transcripts, support tickets, and Reddit threads for the vocabulary your buyers actually use. Cluster the outputs by shared buyer intent rather than shared keywords; a finished family is 8 to 15 phrasings of one underlying question with almost no vocabulary overlap.
Which AI engines should be in the testing loop first?
ChatGPT and Perplexity together drive 90.4% of AI referral traffic to B2B SaaS (PipeRocket Digital, 2025). Add Claude next because it converts at 16.8% per click, the highest of the major engines (Exposure Ninja / Loganix, 2026), then add Gemini last because its 42% domain reshuffle rate adds hard-to-attribute variance.
How long does one full iteration of the loop take?
A realistic cadence is 3 to 5 weeks per iteration: 1 week to write and publish, 2 to 4 weeks for AI engines to index, and 1 to 2 days to re-run the prompt family across engines. The indexing delay is the variable; ChatGPT and Perplexity tend to pick up new content within 2 weeks if the page is linked from a known source.
How do you measure whether an iteration actually worked?
The unit of measurement is position stability across runs. Compare how many of the 10 runs per phrasing cited your content before and after the revision. Treat single-run changes as noise and look for movement across at least 5 runs before locking in a finding.
What structural gaps explain most citation losses?
The delta between a cited passage and a losing passage is almost always one of three things: a missing attributed statistic, a missing comparison table entry, or an unsourced claim. The Princeton GEO study found adding a statistic alone improved visibility by +41% (Princeton KDD, 2024).
When should you abandon a prompt family instead of iterating?
Abandon when three conditions hold at once: you have completed at least 2 full iterations with no measurable citation movement, the cited competitors have domain-level advantages you cannot replicate, and the commercial value of the query is lower than the next-priority family you could test instead. Revisit the abandoned family in 6 months when the engine's preferences may have shifted.
Can you run the testing loop on guest posts you do not control?
You can test whether a guest post is being cited because the query side of the loop works regardless of who owns the page. You cannot refine it in the write-to-refine step, so treat the test as monitoring rather than optimization: it tells you which external placements produce citations and which do not.
Why does keyword stuffing hurt AI citation rates?
The Princeton GEO study measured keyword stuffing at -3% visibility impact across eight LLMs (Princeton KDD, 2024). AI engines retrieve by semantic similarity, not keyword density, so stuffing a term adds noise without adding meaning, and the retrieval model downgrades the passage relative to a cleaner version of the same claim.
How Res AI Runs the Testing Loop Daily Across 4 Engines
The testing loop this article describes is the operational shape of Res AI. Instead of treating prompt family discovery as a one-time research project, Res AI runs it as a daily loop across ChatGPT, Perplexity, Gemini, and AI Overviews. Every day it queries the prompt families that drive your buyer's decision, records which competitors were cited where you were not, and surfaces the structural gaps that explain the citation difference: missing comparison row, missing pricing entry, unsourced product claim, absent how-to-choose framework.
Drafts get written to fix the structural gaps and published directly into WordPress, Webflow, Framer, or Contentful. Then the loop tests again. The unit of progress is the number of prompt families where you now hold a stable #1 against competitors who held it before. That metric only exists for teams running an autonomous loop, because the 8,000 individual queries per quarter the article spells out is not a workload a 1 to 3 person content team can absorb by hand.
Res AI's pricing starts at $250/mo for 50 pages and 10 monitored prompts, scaling to $1,500/mo for 1,000 pages and 30 monitored prompts. Enterprise plans offer unlimited prompt monitoring with a dedicated CSM.
Res AI turns the prompt-family testing loop from a 440 to 780 hour quarterly project into a daily automated workflow, discovering which prompt families you are losing, drafting structured content against them, publishing through your CMS, and retesting until the citation rate moves across ChatGPT, Perplexity, Gemini, and AI Overviews.
Share




