Skip to main content
Back to Blog
Research

You Can't A/B Test Your Way to AI Visibility

Every optimization discipline in marketing runs on experimentation. You change something, measure the result against a control group, and decide whether the change worked. SEO, paid media, email, landing pages, pricing. The feedback loop is the same everywhere. GEO breaks it. You cannot split an LLM's responses by user segment. You cannot create a holdout group. And the brands that need answers most are the ones with the least data to work with.

TL;DR

A/B testing requires control groups, randomization, and impression data. AI search engines provide none of these. The brands with the lowest AI visibility face a cold-start problem: 47 known GEO tactics, limited resources, and no way to isolate which intervention moved the needle. Before/after comparisons are contaminated by model updates, competitor changes, and platform-specific propagation timelines. Honest GEO measurement requires statistical rigor the industry has not yet built.

Abstract photograph representing the challenge of experimentation in AI visibility, where traditional testing frameworks no longer apply.

Why A/B Testing Does Not Apply to AI Search

A/B testing works because you control the delivery environment. You serve version A to half your visitors and version B to the other half. You measure the difference. The randomization eliminates confounders. The sample size gives you statistical power.

AI search engines do not work this way. When a buyer asks ChatGPT "What is the best project management tool for remote teams?", there is one model generating one response. You cannot serve a different version of reality to a control group. There is no split. There is no randomization. There is no impression data telling you how many times the model considered your brand and chose not to mention it.

Traditional search at least gives you Search Console data. You can see impressions, clicks, and position for every query. AI platforms give you nothing. No impression logs. No click-through rates. No audience segmentation. Perplexity shows source links, but you cannot see how many users received a response that mentioned you versus one that did not.

The foundational GEO paper (Aggarwal et al., KDD 2024) tested nine optimization methods across 10,000 queries and found that adding citations and statistics increased visibility by 30-40%. That research was possible because the authors controlled the generative engine. Practitioners do not have that access. You cannot inject a modified version of your content into ChatGPT's retrieval pipeline and compare it against the unmodified version. The experiment that proved GEO works is the one experiment you cannot replicate in production.

The Brands That Need Answers Most Have the Least Data

If your brand is already being cited consistently by AI engines, you have a baseline. You can make a change, monitor your Share of Voice, and observe whether it moves. The measurement is noisy, but at least there is a signal to measure.

Most brands are not in that position. Across 77 brands we analyzed, the median AI SOV score was 3.8 out of 100. A large share of brands have zero or near-zero citation rates on most queries. Their baseline is flat. Zero today, zero tomorrow. Every optimization playbook tells you to "test and iterate." There is nothing to iterate on when the signal is absent.

These brands face a cold-start problem. They know the research says that adding statistics to content improves visibility by 30-40% (Aggarwal et al., KDD 2024). They know that off-site brand mentions correlate with AI citations (Ahrefs, 75K brand study). They know that content freshness increases citations by 67% within 90 days. But they have 47 known GEO tactics to choose from, limited resources, and no way to determine which tactic will move the needle for their specific brand, in their specific category, against their specific competitors.

So they pick one. They invest weeks of effort. They wait. And then they cannot tell whether the tactic failed, whether they did not wait long enough, or whether a confounding variable obscured the result.

What Marketers Can Control in Traditional Search vs. AI Search

The gap between traditional and AI experimentation is not a matter of degree. It is structural. The tools, data, and methods that make SEO experimentation possible do not exist in GEO.

CapabilityTraditional SEOAI Visibility (GEO)
Randomized control groupsYes (page-level split testing, geo-splits)No. One model, one response per query.
Impression dataGoogle Search Console provides impressions per queryNo impression data from any AI platform
Click attributionDirect click tracking via GA4Most AI-influenced visits appear as direct or branded organic
Feedback latencyHours to days (indexing + ranking update)Days to months, varies by platform
Output determinismSame query returns same SERP (mostly stable)Only 30% of brands maintain visibility across consecutive identical queries
Variable isolationChange one page element, hold others constantModel updates, competitor changes, and retrieval shifts happen simultaneously

SEO practitioners have spent two decades building experimentation infrastructure on top of Search Console data, crawl analytics, and page-level split testing tools like SearchPilot. GEO practitioners are starting from zero. The measurement primitives that make experimentation possible have not been built yet.

Every Before/After Comparison Has Confounders

Even if you can observe a change in your AI visibility after an intervention, you cannot attribute it to your action without controlling for everything else that changed at the same time. In practice, many things change at the same time.

ConfounderWhat It DoesHow Often It Happens
Model updatesRetraining or architecture changes shift citation behavior globallyMonthly or more frequently. Google's Gemini 3 upgrade (Jan 2026) changed citation patterns overnight.
Competitor content changesA competitor publishes a strong comparison page and captures citations you previously heldContinuous. 40-60% of cited sources rotate monthly.
Retrieval index updatesBing re-indexes pages, changing what ChatGPT Search can retrieveDaily. ChatGPT's cached index retains content 30+ days but refreshes aggressively for popular pages.
Third-party mentionsA Reddit thread or YouTube review mentioning your brand changes the retrieval poolUnpredictable. 85% of AI brand mentions originate from third-party pages.
Seasonal and news cyclesQuery intent shifts with industry events, product launches, or news coverageQuarterly at minimum. Product launch season, industry conferences, regulatory changes.
Citation freshness decayContent that earned citations last week loses citation priority as newer content appearsContinuous. Citation performance begins declining after 4-5 days without updates.

Any one of these confounders can produce a change in your AI visibility that looks like the effect of your intervention. Your SOV goes up the week after you add statistics to your product pages. Was it the statistics? Or was it the model update that happened three days later? Or the Reddit thread a customer posted? Or the competitor who quietly took down their comparison page?

Without a control group, you cannot distinguish signal from noise. And in AI search, there is no natural control group.

Each AI Platform Runs on Its Own Clock

The experimentation problem is compounded by the fact that each AI platform discovers and processes content differently. A content change that shows up on Perplexity within hours might take weeks to appear on ChatGPT, or never surface on Gemini at all. This means a single intervention produces different outcomes on different timelines across different platforms, making before/after comparisons even harder to interpret.

PlatformContent DiscoveryPropagation SpeedPrimary Citation Sources
PerplexityLive web fetch at query timeHours. Fetches current page content per query.Reddit (46.7%), industry directories
ChatGPT SearchBing-powered cached index + live fetchMinutes with IndexNow, days to weeks otherwise. Cache retains content 30+ days.Wikipedia (47.9%), third-party sites (48.7%)
Google AI OverviewsGoogle's search indexHours to days for already-indexed pages. Dependent on Googlebot crawl frequency.YouTube (18.2%), only 38% from top-10 organic results
GeminiGoogle's infrastructure + direct retrievalDays to weeks.Brand-owned websites (52.15%)
ClaudeTraining data (no web search in base model)Months. Requires retraining cycle.Training corpus only

This divergence is not a minor complication. It means that the same content change produces five different natural experiments with five different timelines, five different confounding structures, and five different measurement challenges. A tactic that "works" on Perplexity in a week (where you can see the effect quickly due to live fetching) might take months to register on ChatGPT's base model, or might never show up because the chat layer and the API layer are fundamentally different products.

Any honest experimentation framework must account for these platform-specific timelines. Measuring "did my visibility improve?" without specifying on which platform, over what timeframe, and against what baseline is not an experiment. It is a guess with a dashboard attached.

What the Industry Is Doing Instead (and Why It Is Not Enough)

The GEO industry has not ignored the experimentation problem. Several approaches have emerged. None of them fully solve it.

1. Before/After Case Studies

The most common approach. A brand makes GEO optimizations, waits weeks or months, and measures the change. Published case studies show dramatic results: one agency reported a 2,300% increase in monthly AI traffic after optimization. Another documented a 753% surge in LLM traffic over five months.

These numbers are real. The problem is that before/after comparisons without controls do not isolate causation. Did the optimization drive the improvement? Or did a model update coincide with it? Or was it a rising tide that lifted all boats in that category? The case studies cannot say, because no control group existed to answer the question.

2. Correlation Analysis

Some practitioners correlate content features with citation rates. Pages with FAQ schema achieve a median 22% increase in AI citations versus pages without (Relixir, 2025). Content updated within 30 days receives 3.2x more citations than content older than 90 days. These findings are valuable directionally. But correlational evidence cannot tell any individual brand whether implementing FAQ schema will improve their citations, because the correlation may be driven by a third variable (e.g., the type of brand that implements schema also tends to have better content).

3. Switchback Testing

The strongest approach available today. Optimize content for GEO in one product category while leaving a comparable category untouched. Compare the trajectories. If the optimized category improves while the control does not, the causal inference is stronger. We discussed this in our attribution framework.

The limitation is practical. Most brands do not have clean category splits. Their content changes affect multiple product lines simultaneously. A pricing page update touches every product. A new case study references multiple offerings. The "untouched control" gets contaminated. And even when the split is clean, the non-determinism of LLM outputs means you need weeks of daily sampling to distinguish a real effect from noise.

The Question GEO Cannot Answer Yet

Every GEO practitioner is trying to answer one question: "After I made this change, did my citation probability move beyond what background trends would have predicted?"

That question is harder than it looks. "Background trends" includes model updates, competitor behavior, retrieval index changes, third-party mentions, and seasonal shifts. "Citation probability" is noisy by nature. "This change" is hard to isolate when content changes tend to cluster. The question is well-defined. The methods to answer it rigorously are not yet standard in the industry.

What this question requires is not a dashboard. It requires a statistical model that can separate the intervention effect from everything else that moved at the same time. It requires a control series. It requires enough data points to produce credible intervals on the estimate, and honest uncertainty quantification when the evidence is inconclusive.

The industry is not there yet. The current toolkit is: make a change, watch the dashboard, and form an opinion. Opinions are better than nothing. They are not better than evidence.

Solving this is the next frontier in GEO. The brands and platforms that figure out how to measure intervention effects with statistical rigor will have an enormous advantage over those still relying on intuition. It is a hard problem. It is also the most important problem in the space.

What You Can Do Today

The experimentation problem does not mean GEO is not worth pursuing. It means you should be deliberate about how you approach it. Here is what is actionable right now:

  1. Establish a baseline before you change anything. Start daily monitoring of your AI visibility across platforms. You need weeks of stable data before you can detect whether an intervention moved the needle. Without a baseline, every change is a shot in the dark.
  2. Make one change at a time. If you restructure your pricing page, add schema markup, and publish three blog posts in the same week, you will never know which one mattered. Space interventions apart. Give each one a measurement window.
  3. Log everything. Record every content change with a timestamp and a description of what changed. Record every external event you notice: model updates, competitor launches, third-party mentions. When your SOV moves, you will want this context to interpret the movement.
  4. Prioritize high-evidence tactics. Start with interventions backed by peer-reviewed research: add specific statistics with citations (30-40% improvement, Aggarwal et al.), structure content in answer-capsule format, and update stale content within 90 days. These have the strongest evidence base and the highest probability of producing a detectable effect.
  5. Accept uncertainty honestly. If your SOV improves after an intervention, the right conclusion is "this is consistent with the change working" not "the change worked." If it does not improve, the right conclusion is "no detectable effect in this window" not "the tactic failed." Honest framing protects you from over-investing in false positives and abandoning tactics that need more time.

You cannot experiment without a baseline

Start monitoring your AI visibility daily across ChatGPT, Perplexity, Gemini, and more. Build the measurement foundation that makes every future intervention interpretable.

References

  1. Aggarwal, P., et al. "GEO: Generative Engine Optimization." KDD 2024, Princeton/Georgia Tech/IIT Delhi. arxiv.org/abs/2311.09735
  2. Ahrefs. "LLM Brand Visibility Study: 75K Brands Analyzed." 2025. ahrefs.com
  3. Superlines. "AI Search Statistics 2026: 60+ Data Points on Visibility, Citations, and Traffic." 2026. superlines.io
  4. Yext. "AI Visibility in 2025: How Gemini, ChatGPT, and Perplexity Cite Brands." 2025. yext.com
  5. Relixir. "Structured Data and AI Citations: Analysis of 50 Domains." 2025. (Cited by name; no stable public URL verified.)
  6. ALM Corp. "Google AI Overview Citations From Top-Ranking Pages Drop Sharply in 2026." 2026. almcorp.com
  7. LLMRefs. "OpenAI Has a Cached Index for ChatGPT Search." 2025. llmrefs.com
  8. G2. "Buyer Behavior in 2025." company.g2.com
  9. Gartner. "Gartner Predicts Search Engine Volume Will Drop 25% by 2026, Due to AI Chatbots." Feb 2024. gartner.com

Get Your Report

Request your first analysis today to see where you stand.

Daniel Wang

Founder · UC Berkeley MIDS

Previously at Nordstrom, Bloomberg, Hexagon (now Octave)

Related reading