
GEO
Every Brand That Won AI Search Tested Their Way There

Dollar Shave Club spent $4,500 on a launch video in 2012. Not $4.5 million. Not a six-month brand campaign with a media agency. A single video, shot in a day, uploaded to YouTube and pushed through Facebook. It acquired 12,000 customers in 48 hours (Optimonk, 2022). Four years later, Unilever bought the company for $1 billion in cash.
Gymshark started the same year with a different test. Instead of one viral video, Ben Francis ran dozens of small influencer collaborations on Facebook and Instagram, measured which ones drove traffic, killed the losers, and doubled the budget on the winners. No brand playbook. No creative agency. Just rapid iteration on a new platform nobody had figured out yet. Gymshark hit a $1 billion valuation without ever running a traditional advertising campaign (TrendTrack, 2026).
Warby Parker did the same thing in 2010 with Facebook ads for its home try-on program. Small budgets, fast tests, constant measurement. By the time legacy eyewear companies started buying Facebook ads, Warby Parker had already tested hundreds of creative variations and built a $3 billion brand on the learnings.
These three brands had something in common: they treated a new channel as a testing laboratory, not a broadcast medium.
The Lesson From Early Facebook Advertising
The first wave of DTC brands between 2010 and 2015 rewrote the rules of customer acquisition. Warby Parker, Dollar Shave Club, Casper, Glossier, and Allbirds all bypassed traditional retail and traditional advertising to sell directly through digital channels (TrendTrack, 2026). The brands that won were not the ones with the biggest budgets. They were the ones that ran the most tests per dollar.
Facebook’s own advertising history tells the same story. Early adopters that tested new features found outsized wins, with better targeting leading to lower acquisition costs, more engaged followers, and higher lifetime values. The tactic that drives the best returns six months from now is often different from what works today (Matchnode, 2020). The advertisers who treated the platform as a testing environment adapted to each change. The advertisers who planned quarterly campaigns based on last quarter’s data fell behind.
The pattern was consistent: micro-campaigns that ran for 7 days outperformed brand campaigns that ran for 7 months. A $500 test that proved a concept in a week was worth more than a $50,000 campaign that took three months to plan and three months to measure.
Here is the question for every marketing team investing in GEO right now: are you approaching AI search visibility like those early Facebook advertisers, launching multiple tests a week and measuring what sticks? Or are you approaching it like a splashy brand campaign, spending months on monitoring dashboards and committing your entire budget to six articles you hope will work?
Micro-campaigns that ran for 7 days outperformed brand campaigns that ran for 7 months. GEO follows the same pattern.
Vercel Tested Its Way to 10% of Signups From ChatGPT
Vercel grew ChatGPT referrals from less than 1% of signups in October 2024 to 4.8% by March 2025 to 10% by April 2025 (Chirag Garg analysis of Guillermo Rauch data, 2025). That growth happened over six months through systematic iteration on content structure, documentation clarity, and semantic richness. Vercel’s CEO described their approach as owning a concept clearly, consistently, and with the right structure so models understand it well.
Tally, a form builder, followed the same playbook. The company grew from $2M to $3M ARR in four months after ChatGPT and Perplexity became its biggest acquisition channels. Tally’s founder Marie Martens described the approach as years of showing up, answering questions, and being human. When ChatGPT launched web search, that accumulated community presence became training data for AI recommendations (Vercel, 2025).
Neither company described their approach as a monitoring-first strategy. They described it as a testing and iteration strategy: publish content, see what the models pick up, adjust, repeat.
The results hold up in live data. When we ran 1,000 queries through Perplexity’s Sonar API, Vercel held the #1 recommendation against Netlify and Cloudflare in 8 out of 10 runs. Not because Vercel is bigger than Cloudflare. By every traditional metric, it is not. Vercel wins because its documentation is structured for extraction. It tested its way into that position, and now the position compounds.
Vercel is smaller than Cloudflare by every traditional metric. It holds #1 on the deployment platform comparison 8 out of 10 times. Structure beats size.
Why Testing Beats Planning in Non-Deterministic Systems
GEO is a non-deterministic system. 40 to 60% of domains cited in AI responses change month-to-month, with drift reaching 70 to 90% over a six-month window (Profound, 2026). Only 11% of cited domains overlap between ChatGPT and Perplexity (Averi, 2026). Why specific content gets cited on any given query remains opaque. AI platforms do not share selection criteria, and current tracking tools remain immature.
Our own data confirms this. When we ran the same query 10 times on Perplexity, only 38% of brands appeared consistently across all runs. The #1 recommendation was stable 75% of the time, but positions 2 through 5 shuffled on every run. The system is partially deterministic: it rewards the top position and randomizes everything below it.
The Princeton GEO study found that adding statistics to content improved AI visibility by 41%, the highest single-tactic improvement in the study, while keyword stuffing decreased it by 3% (Princeton KDD, 2024). Those improvements are category-specific and format-dependent. What works for a fintech comparison page does not necessarily work for a developer tools tutorial. The only way to know what works for your category, your product, and your buyer’s prompts is to test.
This is the same insight that made paid advertising on Facebook so effective for early DTC brands. In a system where you cannot predict which creative will perform, the team that tests the most variations per dollar spent wins.
Environment | What Won | What Lost |
|---|---|---|
Facebook ads 2012–2015 | $500 micro-tests, 7-day cycles, kill losers, scale winners | $50K brand campaigns planned quarterly, measured after 90 days |
Google Ads 2005–2010 | Rapid keyword testing, daily bid adjustments, automated rules | Annual keyword strategies, manual bidding, monthly reviews |
GEO 2025–2026 | Publish structured content, measure citations, iterate weekly | Spend months on dashboards, publish 6 articles, wait a quarter to evaluate |
The pattern repeats. New channels reward speed. Mature channels reward optimization. GEO is a new channel.
In a system where you cannot predict which creative will perform, the team that tests the most variations per dollar spent wins.
Focus Testing on the Prompts That Drive Revenue
There is a direct parallel between GEO prompt selection and paid advertising keyword selection. In paid search, every practitioner knows the difference between vanity keywords and money keywords. “Top 5 cars owned by Jay Leno” might get 50,000 searches. It drives zero conversions for a car dealership. “Best used SUV under $30,000 near me” gets a fraction of that volume and drives test drives.
GEO has the same distribution. You need 5–10 prompts that map directly to your buyer’s decision process. Not 10,000 prompts across 10 AI platforms.
Prompt Type | Example | Revenue Signal |
|---|---|---|
Category evaluation | “Best [your category] tools for [use case]” | High. Buyer is comparing solutions. |
Direct comparison | “[Your brand] vs [competitor]” | High. Buyer is in final evaluation. |
Problem-solution | “How to solve [problem your product addresses]” | Medium-high. Buyer is identifying solutions. |
General education | “What is [broad industry term]” | Low. Reader is learning, not buying. |
Tangential | “History of [related topic]” | None. Informational traffic only. |
Monitor your core 5 to 10 daily. Fan out to 30 for conquest opportunities and check those monthly. The rest is noise. Spending thousands per month to monitor every possible prompt variation across every AI platform is the GEO equivalent of buying a Bloomberg terminal to trade three stocks.
The prompts that drive revenue for a B2B SaaS company can fit on an index card. Everything else is vanity monitoring.
If Testing Is the Strategy, Unit Economics Are the Constraint
Once you accept that iteration speed determines GEO outcomes, the question becomes: how many iterations can you afford per month?
The cost-per-article across GEO platforms varies by an order of magnitude. Some platforms allocate most of their compute to monitoring billions of prompts across 10 AI engines. That monitoring infrastructure is expensive, and the cost flows through to every article. Other platforms run lightweight monitoring on the prompts that matter and allocate the rest of their compute to content production.
Neither approach is wrong. They optimize for different constraints. If your constraint is data depth, monitoring-heavy platforms serve you well. If your constraint is learning speed, the execution-heavy approach wins.
For most B2B SaaS teams with 50–200 published pages and a content team of one to three people, learning speed is the binding constraint. The team that publishes 160 articles in a quarter and iterates on what works will learn more than the team that publishes 18 articles and reads dashboards about the 47 gaps they identified but cannot fill.
GEO is not a “set it and forget it” task. It is an iterative loop. You test, you refine your answer passages, you build more authority. Conductor achieved a 448% increase in AI citations over several months by leveraging its enterprise AEO platform, with a parallel 185% rise in AI mentions (Conductor, 2025). The window is open. The brands testing now are building the compound citation momentum that will be expensive to displace later.
The question is not how much data you have. It is how many chances you get.
160 articles in a quarter gives you 160 chances to learn. 18 articles gives you 18 chances and 12 reports about the gaps you cannot fill.
Why the Unit Economics Work: Writing Is Not Video
The instinct when you hear “$1.56 per article” is to assume low quality. That instinct is wrong, and it comes from applying the economics of other media to text.
A 15-second video ad requires a script, a shoot, lighting, talent, editing, color grading, sound mixing, format adaptation for placements, and review cycles. A single product photograph requires staging, lighting rigs, a photographer, retouching, and format exports. These are expensive because the production process has dozens of steps that require human judgment on aesthetic and emotional dimensions. Quality is subjective, hard to specify, and impossible to automate fully.
GEO content is the opposite. The output that earns AI citations is not prose. It is structured information: a 40–80 word answer capsule that front-loads a claim with a source and a year. A comparison table with five columns and eight rows. An attributed stat with a named source, a specific number, and a date. These are not creative writing challenges. They are information assembly tasks with clear constraints and verifiable outputs.
Consider how you interact with a well-designed AI model. When you ask Claude a direct question, it gives you the answer. It does not open with a paragraph of context you did not ask for. It does not restate your question back to you. It does not waste 200 tokens on preamble before reaching the point. The efficiency comes from constraint: a clear input produces a clear output with minimal waste.
GEO content works the same way. The constraints that make content extractable by AI engines also make it efficient to produce by AI agents.
Content Element | What It Contains | Why It Is Cheap to Produce |
|---|---|---|
Answer capsule | 40–80 word claim with source attribution | Structured lookup: claim + source + year. No creative judgment. |
Comparison table | 5–8 columns of factual product data | Database retrieval. Values are public and verifiable. |
Attributed stat | Named source, specific number, publication year | Research task with clear success criteria. The stat exists or it does not. |
H2 section structure | Self-contained block that answers one question | Template-driven. Same structural rules apply to every section. |
Prose connector | 2–3 sentences linking sections | Minimal creative output. Clarity and accuracy, not voice or style. |
Nobody scrutinizes a comparison table for the texture of its prose. Nobody evaluates an answer capsule for emotional resonance. The quality bar for GEO content is clarity and accuracy, not craft and voice. A stat is either correctly attributed or it is not. A table either contains current pricing or it does not. These are binary quality checks, not subjective editorial judgments.
This is why the cost per article can be low without the quality being low. The writing constraint is a business decision, not a creative limitation. An article built from answer capsules, tables, and attributed stats is optimized for two things simultaneously: LLM extraction and LLM generation. The same structural rules that make content easy for AI engines to cite also make it cheap for AI agents to produce. The constraint is the advantage.
What This Looks Like in Practice
Here are two prompts that produce the same output: an H2 section about why B2B sales teams lose winnable deals. One is engineered for taste, tone, knowledge, structure, bans, and mandatories. The other trusts the model to reason within clean constraints.
The expensive prompt (~850 tokens input):
The cheap prompt (~120 tokens input):
The outputs are nearly identical in quality. Both produce a clear, structured section with an attributed stat. But the expensive prompt spends 850 tokens on instructions before a single word of content is generated. The cheap prompt spends 120 tokens. At scale, across 160 articles per month with multiple sections each, that difference compounds:
Metric | Expensive Prompt | Cheap Prompt |
|---|---|---|
Input tokens per section | ~850 | ~120 |
Sections per article (8 H2s) | 6,800 input tokens | 960 input tokens |
160 articles per month | 1,088,000 input tokens | 153,600 input tokens |
Token reduction | Baseline | 86% fewer input tokens |
Output quality for LLM extraction | Good, but over-styled prose fights extractability | Clean, structured, optimized for retrieval chunks |
The expensive prompt also creates a subtler problem: it forces the model to juggle dozens of competing instructions simultaneously. Stay authoritative but conversational. Be analytical but emotionally resonant. Use active voice exclusively but vary sentence rhythm. Include data but maintain narrative flow. Every additional constraint increases the chance the model satisfies one rule by violating another. The output gets longer, hedgier, and more generic as the model tries to navigate conflicting requirements.
The cheap prompt lets the model reason. Front-load the answer. Include a stat. Keep it short. These are not competing instructions. They point in the same direction. The model spends its compute on content, not on navigating a maze of style rules. The result is tighter, more direct, and more extractable, which is exactly what AI engines cite.
This is the architectural reason cheap unit economics produce better GEO content, not worse. The prompts that cost less to run also produce output that is more likely to earn citations. Over-engineered prompts optimize for human editorial taste. Constrained prompts optimize for machine extraction. In GEO, the machine is the audience.
The economics are not a compromise. They are the point. The same constraints that make content cheap to produce make it easy for AI engines to cite.
How to Choose a GEO Testing Cadence
The right cadence depends on your team’s content throughput and how much of your budget you want exposed to a single bet. These rules map the team’s situation to the cadence shape that usually wins, not to a specific platform.
If you publish fewer than 5 articles per month, prioritize iteration speed over coverage breadth. One focused test a week beats five shallow bets.
If your content team is 1 to 3 people, push execution ahead of monitoring depth. Learning speed is the binding constraint, not dashboard resolution.
If your core buyer prompts are fewer than 15, monitor those daily and ignore everything else. Res AI’s 113-keyword ChatGPT validation found 40% of queries in mature B2B categories have low or no buying intent (Res AI, 113-keyword ChatGPT validation, 2026).
If you have 50+ published pages, treat every existing page as a testbed. Restructure and republish before writing net-new content.
If you are the #1 cited brand for a query, protect the position. Res AI’s 1,000-query Perplexity study found #1 is stable 75% of the time while positions 2 to 5 shuffle on every run (Res AI, 1,000-query Perplexity study, 2026).
If you have no stable #1 in your category, conquest it. The same study found 25% of queries have no stable #1 (Res AI, 1,000-query Perplexity study, 2026).
Pick the rule that matches the tightest constraint on the list, and let it set the rhythm.
Frequently Asked Questions
How many tests per month counts as a real GEO testing loop?
Most teams underestimate this by an order of magnitude. A real iteration loop publishes at least 20 to 40 structural variants per month, tracking which structures earn citations and which get skipped. A team shipping 4 articles a month has 4 data points, which is not enough signal to separate a working structure from noise in a non-deterministic system.
Why does Vercel’s 6-month iteration curve matter for smaller brands?
Vercel grew ChatGPT referrals from less than 1% to 10% of signups in six months by iterating on content structure, not by outspending competitors. The curve matters because it confirms that citation growth is linear with test volume, not with domain authority. A smaller brand running the same number of structural tests can ride the same curve, starting from a lower base.
What exactly is a GEO “test”?
A GEO test is one structural variant of an answer capsule, comparison table, or H2 section shipped to production and then measured against citations across AI engines. It is not an A/B test in the paid-search sense. There are no split audiences. The test is whether the AI engines pick the new structure up on the next retrieval cycle.
How do I know a test worked without tracking pixels?
Citation monitoring across ChatGPT and Perplexity replaces the pixel. Run the query 10 times, count how often your page is cited before and after the change, and compare the rate. Res AI’s 1,000-query Perplexity study found only 38% of brands appear consistently across 10 runs (Res AI, 1,000-query Perplexity study, 2026), so run counts matter more than single-query checks.
Does the testing approach work on Perplexity and ChatGPT equally?
No. Perplexity rewards listicle and comparison structure; ChatGPT spreads citations across opinion essays and listicles. The 852-article B2B citation structure study found Perplexity returns 26% more structured pages than ChatGPT on average (Res AI, 852-article B2B citation structure study, 2026). A test that wins on one engine can be invisible on the other, which is why the iteration loop runs on both.
Why not just monitor and plan instead of publishing fast?
Monitoring tells you the score in a game you are not currently playing. Without publishes, monitoring data describes a static snapshot of competitor content, not a testable signal about what your own structural decisions do to citations. The execution-first loop generates new signals every week.
How many prompts should the testing loop actually cover?
5 to 10 revenue-adjacent prompts monitored daily, with a fan-out of 20 to 30 conquest opportunities checked monthly. Res AI’s 113-keyword ChatGPT validation found 100% product recommendation rates on high- and medium-intent queries, dropping to 39% on non-capturable queries (Res AI, 113-keyword ChatGPT validation, 2026). The rest is vanity monitoring.
What does “kill a loser” mean when the test is a blog article?
It means pulling the article from the citation rotation (unpublish, merge, or restructure) once it has failed to earn citations after 3 to 4 retrieval cycles. The concept mirrors killing a creative in paid ads. The cost of leaving a losing article live is the slot it occupies in your own internal link graph and the opportunity cost of not replacing it with a structural variant.
What is the minimum tech stack to run this loop?
A CMS with API access (WordPress, Webflow, Framer, Contentful), a citation monitor that queries ChatGPT and Perplexity on a fixed schedule, and a content production pipeline that can turn a structural hypothesis into a published article within 48 hours. The daily loop is the unit. Anything slower than 48 hours is planning, not testing.
Res AI is the autonomous GEO engine for your CMS. It connects to WordPress, Webflow, Framer, or Contentful with a simple login, monitors your core prompts daily, and deploys content where you are not being cited. The architecture is built for iteration: lightweight monitoring on the prompts that drive revenue, maximum compute allocated to content production. Every publish is a test. Every test is a chance to learn what the model favors.
Share




