The academic GEO evidence base is now stronger than any other emerging marketing channel. Ten papers. Fifteen industry studies. 680 million citations analyzed. The foundational KDD 2024 paper reports 30-40% relative visibility improvement from statistics addition. SE Ranking reports 93% more citations for pages with 19+ data points. Ahrefs reports a 0.664 correlation between branded mentions and AI visibility across 75,000 brands. All valuable; none of them answer the question a marketing director actually needs answered: "How many percentage points will my SOV move if I implement this?" Between the lab and the boardroom lies a gap that no correlation study can bridge.
This post examines what happens when GEO tactics leave the research paper and enter production. We draw on the first controlled practitioner experiments published in early 2026, synthesize the best available estimates of real-world SOV impact, and map the implementation paths that platform-level data supports. It is the companion to our evidence-ranked GEO tactics list, which covers the 12 strongest tactics by measured effect size.
TL;DR
The first controlled GEO experiments (Otterly.AI, March 2026) found listicle inclusion and footer text repetition highly effective while llms.txt and author pages showed no measurable impact. Realistic SOV estimates: on-site moderate changes (statistics, answer capsules) produce 1-3pp; combined on-site and off-site overhauls produce 5-15pp; all estimates carry low-to-medium confidence because no controlled before-and-after studies with absolute pp effect sizes exist. The footer text discovery exploits how LLMs treat repeated content as confirmed fact. Two distinct time horizons operate: RAG retrieval reflects changes in days to weeks, training data updates take months to years. Only 11% of domains are cited by both ChatGPT and Perplexity; engine-specific optimization outperforms generic strategies (CMU AutoGEO, 35.99% improvement). Sill's experimentation platform addresses the measurement gap with hierarchical Bayesian estimation, 10x prompt sampling, and affected/comparison query controls to produce per-platform SOV effect sizes.

No published GEO study reports absolute percentage-point SOV changes from controlled before-and-after experiments as of March 2026.
The foundational GEO paper (Aggarwal et al., KDD 2024) reported relative improvements: 30-40% visibility gains from statistics addition and citation. SE Ranking's study of 129,000 domains reported cross-sectional correlations: pages with 19+ data points averaged 5.4 citations versus 2.8 without. The Ahrefs study of 75,000 brands reported correlation coefficients: branded mentions at r = 0.664, YouTube at r = 0.737. All valuable. None of them answer the question a marketing director actually needs answered: "If I implement tactic X on my site, how many percentage points will my SOV move?"
Amos Weiskopf articulated the fundamental limitation: "There is no Search Console for LLMs. There is no index you can query. There is no crawl report." The systems are stochastic. SparkToro found less than a 1-in-100 chance that ChatGPT gives the same brand recommendations twice for the same prompt. Single observations are unreliable; rigorous measurement requires statistical sampling across many prompts run multiple times.
This does not mean the research is wrong. It means the field has strong directional evidence with weak precision. Knowing that statistics addition correlates with 93% more citations is useful. Knowing whether that translates to 1pp or 5pp of SOV movement for your specific brand in your specific category is the question nobody has yet answered at scale.
Otterly.AI's March 2026 experiments found listicle inclusion and footer text repetition highly effective; llms.txt and author pages showed no measurable impact.
Otterly.AI ran the first published series of controlled GEO experiments in March 2026, testing specific tactics in isolation with before-and-after measurement. The results are qualitative rather than quantitative (no percentage-point numbers published), but the relative rankings are the closest thing the field has to experimental evidence.
| Tactic | Type | Difficulty | Experimental Result |
|---|---|---|---|
| Adding brand to listicles/rankings | Off-page | Medium | Highly effective |
| Footer text with unique factual info | On-page | Very easy | Highly effective |
| YouTube long-form video | Social/UGC | Hard | Confirmed effective |
| Product/service directory listing | Off-page | Easy | Moderate impact |
| AI-written content (structured) | On-page | Medium | Outperformed human-written |
| Reddit thread replies | Social/UGC | Medium | Effectiveness varies by industry |
| Author pages and bios | On-page | Very easy | Limited impact |
| llms.txt implementation | Technical | Very easy | No measurable impact |
Two findings stand out. Listicle inclusion (off-page) and footer text repetition (on-page) were the only tactics rated "highly effective." Both confirm patterns visible in the broader research: off-page mentions drive discovery, while structured on-site content drives extraction. The AI-written content result is notable as well; when optimized for structure and relevance, AI-generated content outperformed human-written content. Raw AI output did not. Structure is the variable, not authorship.
The llms.txt result deserves emphasis. SE Ranking had already found no correlation across 300,000 domains. Otterly's controlled experiment confirmed it: zero measurable impact on AI traffic. LLMs crawl these files, but crawling is not citing. Philipp Götza (Search Engine Land, January 2026) called this evidence a "ladder of misinference." Brands implementing llms.txt as a GEO strategy are investing in a tactic with two independent negative results and no positive ones.
LLMs treat repeated footer text as confirmed site-wide fact, making footer statements a low-effort, high-yield GEO tactic confirmed by Otterly's experiments.
Otterly's most surprising finding is also the one with the clearest mechanism. Adding unique factual information to a site footer was rated "highly effective" with "very easy" implementation difficulty. The reason is architectural: footer content appears on every page of a site. When an LLM crawls or indexes multiple pages from the same domain, the footer text appears repeatedly across the corpus. LLMs interpret this repetition as confirmed, site-wide factual information.
A footer stating "Founded in 2019. Serving 4,200 customers across 38 countries" gets indexed not once but hundreds or thousands of times, once per crawled page. The model treats this as high-confidence factual data about the entity. The tactic exploits a genuine property of how transformer models aggregate information across training and retrieval contexts.
This is not well-covered in academic GEO literature; no published paper tests footer content specifically. It emerged from practitioner experimentation. The implication is straightforward: every brand should audit what factual claims their footer makes. A footer that says only "Copyright 2026" is a missed opportunity. A footer with specific, verifiable facts about the company gives AI engines extractable brand signals on every indexed page.
Off-site brand mentions (r=0.664) are the strongest AI visibility predictor, but 87% of the GEO recommendations Sill generates for brands are on-site fixes.
The most important factors for AI visibility are overwhelmingly off-site. Brand web mentions show a correlation of 0.664 with AI citation (Ahrefs, 75,000 brands); that is three times stronger than backlinks at 0.218. AirOps found that 85% of brand mentions in AI answers come from third-party pages, not owned domains. Chen et al. (University of Toronto, 2025) documented that 69-82% of AI search citations are earned media, dwarfing brand-owned content. Pavel Israelsky (Search Engine Land, January 2026) stated it directly: "On-site optimization is the factor with least impact on the most important GEO KPI" for whether your brand appears at all.
And yet on-site changes are where teams should start. Across 748 GEO recommendations generated for 62 brands through Sill's recommendation engine, 87% were on-site structural fixes. The reason is not that on-site tactics have higher impact ceiling; they do not. The reason is controllability and speed. A content team can add answer capsules, restructure headings, and implement schema markup in a single sprint. Building branded mentions across YouTube, Reddit, Wikipedia, and press outlets takes 6-12 months of sustained effort.
Think of it as a candidate pool problem. Off-site factors determine whether your brand enters the pool of entities an AI engine considers for a given query. On-site factors determine whether your pages are selected from that pool when retrieved. Both layers matter, but they operate on different timescales. On-site optimization in weeks; off-site authority-building over quarters. A brand that invests only in on-site work will have well-structured pages that no AI engine retrieves. A brand that invests only in off-site mentions will get retrieved but lose the citation to a competitor with better content structure.
On-site moderate changes estimate 1-3pp SOV improvement; combined on-site and off-site overhauls estimate 5-15pp. All estimates carry low confidence.
Nobody publishes reliable before-and-after SOV measurements. But by combining the academic research (relative improvements), observational cross-sectional data (correlations), and the few practitioner experiments available, we can synthesize directional estimates. These should be treated as planning ranges, not predictions. Every brand operates in a different competitive context; a tactic that moves SOV by 5pp in a niche with thin competition might produce less than 1pp in a saturated category.
| Change Type | Est. SOV Impact | Confidence | Basis |
|---|---|---|---|
| Minor on-site edit (metadata, title, readability) | 0-1pp | Low | SE Ranking FAQ data (+11% relative); Otterly: llms.txt = no impact |
| Moderate on-site (statistics, capsules, restructure) | 1-3pp | Low | KDD 2024 30-40% relative; baseline-dependent |
| Major on-site overhaul (multi-page, full GEO) | 2-5pp | Low | Combination effects (+5.5% over single tactic) |
| Footer text optimization | 1-3pp | Low | Otterly: "highly effective"; no pp numbers published |
| Listicle/ranking inclusion (off-site) | 3-8pp | Medium | Otterly: "highly effective"; 43.8% of ChatGPT page types are listicles |
| Press in notable outlets (5+ mentions in 6 months) | 3-10pp | Medium | 4.2x sustained citation rate; 61% of reputation responses from earned media |
| YouTube long-form presence | 2-7pp | Medium | r = 0.737 (strongest factor); 31.8% of social citations |
| Combined on-site + multiple off-site | 5-15pp | Medium | Radiant Elephant: 59pp (extreme case, empty niche) |
The Radiant Elephant case study deserves specific context. Within 60 days of publishing their first data study, they appeared in 67% of AI responses on key topics versus 8% before, a 59pp increase. That number is real but unreproducible at scale. They filled a content vacuum in a low-competition niche. Established brands in saturated categories will see far smaller absolute movements. The 5-15pp range for combined efforts is a more realistic planning target for competitive markets.
We are transparent about a hard truth: GEO measurement is fundamentally more difficult than traditional SEO measurement. All estimates in this table carry "Low" or "Medium" confidence because no controlled before-and-after studies with absolute pp effect sizes have been published. Sill's data across 139 brands and 86 industries shows the scale of variance: 23% of brands score zero SOV across all platforms, while the median sits at 15 out of 100. A 3pp gain means something very different at a starting SOV of 5 versus 45.
RAG-based retrieval reflects content changes within days to weeks; training data updates operate on cycles spanning months to years.
GEO practitioners frequently conflate two distinct mechanisms through which content changes reach AI engines, and this confusion leads to misaligned expectations. The first mechanism is real-time retrieval (RAG). Perplexity, ChatGPT with search, and Google AI Overviews perform web searches for each query and retrieve fresh content. Content updated today can appear in Perplexity responses within days. Pages refreshed within 90 days are significantly more likely to appear in results across these platforms (SE Ranking: 67% more citations for recently updated content).
The second mechanism is training data incorporation. Base model knowledge comes from periodic training runs that happen on release cycles spanning months to years. SearchPilot noted that "the slow feedback loop and batch update nature of changes to models' learned information makes it largely impossible to run statistical tests" on this pathway. A Wikipedia edit or a new press mention may take months to influence a model's base knowledge layer.
The practical implication: on-site content optimizations (answer capsules, statistics, freshness) influence the RAG pathway quickly. Off-site reputation signals (brand mentions, press coverage, Wikipedia presence) influence both pathways but the training data pathway dominates their long-term effect. Teams should expect on-site changes to show impact in 2-6 weeks through RAG, while off-site authority-building requires 3-6 months for sustained visibility shifts.
Only 11% of domains are cited by both ChatGPT and Perplexity; engine-specific optimization outperforms generic strategies per CMU's AutoGEO research.
Wu et al. (CMU, 2025) demonstrated with AutoGEO that engine-specific optimization rules consistently outperform generic strategies, achieving 35.99% average improvement with tailored approaches. Sill's own monitoring data confirms the fragmentation: 55% of brands show a 10+ point SOV spread between their best and worst performing platforms, and 91.6% of cited URLs appear on only one platform. A brand visible on ChatGPT may be invisible on Perplexity, and the tactics that fix one do not automatically fix the other.
| Platform | Dominant Citation Sources | High-Priority Tactics | Implementation Notes |
|---|---|---|---|
| ChatGPT | Wikipedia (47.9%), Forbes, G2, TechRadar | Wikipedia page, review platforms, Bing optimization | Matches Bing top-10 results 87% of the time; 60.5% of cited pages from last 2 years |
| Perplexity | Reddit (46.7%), YouTube (13.9%), Gartner | Reddit participation, content freshness, YouTube | Real-time web search; 50% of citations from current-year content |
| Google AI Overviews | Reddit (21%), YouTube (18.8%), Quora (14.3%) | Organic rankings, YouTube, answer capsules | 93.67% from top-10 organic; overlap grew from 32.3% to 54.5% |
| Claude | Databases/directories (68%), awards (19%) | Directory presence, Wikidata, longevity signals | Strongest big-brand bias; skews toward businesses 50+ years old |
| Gemini | Authoritative lists (49%), Google authority (23%) | List inclusions, Google Business Profile, local reviews | Local reviews dominate at 38% for local searches |
YouTube deserves special emphasis. It is the single most impactful channel for AI visibility, holding the #1 cited domain position in Google AI Overviews at 29.5% (BrightEdge, September 2025) and showing the strongest correlation with AI visibility of any factor at r = 0.737 (Ahrefs). As of January 2026, YouTube content appears in 16% of all LLM answers versus Reddit's 10% (Profound, 680M citations). Long-form videos account for approximately 94% of YouTube citations; Shorts receive minimal citation.
Review platforms present an important tactical nuance. Yelp and Trustpilot block AI crawlers entirely, while GetApp, Clutch, and SourceForge allow full access. This explains why GetApp captures 47.6% of B2B software citations in ChatGPT despite being less well-known than Yelp or Trustpilot. For B2B brands, GetApp and Capterra listings are more valuable for AI visibility than Trustpilot reviews, regardless of review volume. The crawler access policy of the platform matters more than the platform's reputation with human buyers.
SparkToro found less than a 1-in-100 chance ChatGPT gives the same brand recommendations twice; GPU batching causes cascading output differences.
Traditional SEO has SearchPilot. GEO has nothing comparable. The structural challenge is that LLM outputs are non-deterministic. Thinking Machines Lab traced this to GPU batching: batch size changes floating-point calculation order, causing cascading output differences even for identical prompts. SparkToro's January 2026 research confirmed the practical consequence: less than a 1-in-100 chance that ChatGPT produces the same brand recommendation list twice.
Five specific factors make controlled GEO experimentation difficult. First, stochastic outputs require large sample sizes to achieve statistical power. Second, content changes often coincide with other brand activity (PR campaigns, product launches, competitor moves) that confounds attribution. Third, platform fragmentation means effects may appear on one engine and not others. Fourth, no standard methodology exists. Fifth, most practitioner experiments test one brand on one set of prompts, producing sample sizes too small for reliable inference.
SearchPilot has developed a preliminary framework using control and variant page groups, but even they acknowledge the fundamental limitation: LLM outputs are probabilistic, and the measurement infrastructure the field needs does not yet exist in the open market. Citation patterns drift 40-60% month over month (Profound data), meaning baseline instability compounds the measurement challenge.
Sill's experimentation platform uses hierarchical Bayesian estimation with affected and comparison query controls to produce per-platform SOV effect sizes.
The measurement vacuum is the defining problem of GEO in 2026. Every observation-only dashboard can tell you your current SOV. None can tell you which change caused it to move. Sill's experimentation platform is designed specifically to close this gap. The approach uses three mechanisms. First, affected and comparison query groups: when a content change targets specific topics, we monitor both the affected queries and unrelated queries from the same brand as controls. Second, 10x sampling per prompt per platform: where most tools run each prompt once, we run it ten times to account for stochastic variance. Third, hierarchical Bayesian estimation: a statistical model that produces posterior distributions of treatment effects, stated with credible intervals rather than point estimates.
The goal is to produce the first rigorous per-platform pp effect estimates in production GEO. Combined with CMS content change detection, which timestamps on-site changes against the SOV timeline, the platform creates a closed loop: detect a change, measure the effect, feed the result back into the recommendation engine, and generate the next highest-impact suggestion based on what actually worked for that specific brand.
This matters because GEO is not a set of universal truths. A tactic that moves SOV for a B2B SaaS company may do nothing for a consumer electronics brand. Engine-specific, category-specific, and brand-specific effects are the norm. The only way to build reliable GEO knowledge is to run controlled experiments at scale and accumulate the evidence base the field currently lacks. Every brand that runs an experiment contributes to a growing body of production GEO evidence that benefits the entire ecosystem.
Effective GEO implementation follows five principles: start on-site, build off-site, optimize per-platform, measure rigorously, and refresh continuously.
The evidence, from both academic research and the first wave of practitioner experiments, converges on a clear implementation framework. First, start with on-site structural changes. Answer capsules, statistics density, schema markup, comparison tables, and footer text optimization can all be deployed within weeks, and the adoption gaps remain enormous (0% for answer capsules and schema, 1% for comparison tables per our content audit data). Second, invest in off-site authority systematically. Listicle inclusion and press coverage are the highest-impact off-site tactics confirmed by controlled experiments; they compound over 6-12 months.
Third, optimize per-platform, not generically. A brand with strong ChatGPT visibility but weak Perplexity presence needs Reddit participation and content freshness, not more Wikipedia optimization. Fourth, measure with appropriate rigor. Single-query spot checks are unreliable; statistical sampling across prompt panels with control groups is the minimum standard for actionable conclusions. Fifth, refresh continuously. Content updated within 90 days earns 67% more AI citations. The tactics that are safe for SEO and beneficial for GEO simultaneously are the ones that earn compounding returns over time.
The GEO landscape shifts monthly. Citation patterns drift 40-60% month over month. YouTube overtook Reddit as the top social citation source in January 2026. But the structural principles remain consistent: earn mentions, structure for extraction, publish original data, stay fresh, and measure what you do. The brands that build this discipline now will compound their advantage as the measurement infrastructure matures and the competitive landscape intensifies.
Sill monitors your AI visibility across six platforms, scores your content against the evidence-ranked tactics, and runs controlled experiments to measure what actually moves your Share of Voice.
Request your first analysis today to see where you stand.