Research

How AI Models Form Brand Opinions: The Content-to-Recommendation Pipeline

RankScience stopped offering A/B testing services in 2024. As of January 2026, ChatGPT, Gemini, and Perplexity all still recommend them for exactly those services. The company's own team documented the phenomenon: a brand identity they had deliberately abandoned persisted across every major AI platform for over a year, driven by training data, third-party mentions, and cached retrieval sources that no single team controls. This is the content-to-recommendation pipeline in action, a sequence of stages between what a brand publishes and what an AI model tells a buyer, that most marketing teams have never mapped and cannot directly observe. Understanding how this pipeline works is the first step toward changing what comes out the other end.

TL;DR

When a consumer asks an AI model about your brand, the response assembles itself from two layers: parametric memory encoded during training and retrieval-augmented generation from live web sources. The University of Toronto found 69-82% of AI citations come from earned media; social media contributes zero. Ahrefs' 75,000-brand study shows entity-level signals drive recommendations: YouTube mentions correlate at 0.737, branded web mentions at 0.664, while Domain Rating falls far behind. The pipeline is slow: AI-cited pages average 1,064 days old, and RankScience documented their discontinued services being recommended for over a year. Meanwhile, GenAI chatbots have become the #1 source influencing B2B vendor shortlists at 17.1%, ahead of review sites and salespeople. Kodec AI found 62% of simulated buyer queries return incorrect information. The levers that shift AI perception are different from traditional SEO: the GEO paper found Statistics Addition boosts visibility by 32% while Keyword Stuffing decreases it by 8%.

The Pipeline from Content to AI Recommendation

Google pays Reddit $60M/year and OpenAI ~$70M/year for training data. Reddit is also the #1 cited domain on Perplexity (6.6% of citations) and Google AI Overviews.

When a consumer asks ChatGPT "what's the best CRM for small businesses," the response draws from two distinct knowledge layers, and understanding how each one works explains most of what brands get wrong about AI visibility. The first is parametric memory: knowledge encoded during pretraining on massive text corpora. This is where the data licensing deals matter. Google pays Reddit $60 million per year for access to its content API (CBS News, February 2024); OpenAI pays an estimated $70 million per year for similar access (TechCrunch, May 2024). Reddit's 22+ billion posts and comments, YouTube's transcripts (which correlate at 0.737 with AI visibility per Ahrefs), and the broader web crawl all feed the models' base knowledge during training. Once encoded, this knowledge persists for months or years regardless of whether the original content changes, which is why RankScience's discontinued services still get recommended over a year after they were shut down.

The second layer is retrieval-augmented generation (RAG), where the model queries live web sources at inference time to supplement or update its parametric knowledge. Here, Reddit's dominance is even more direct. Profound's analysis of 680 million citations found Reddit is the #1 cited domain on both Perplexity (6.6% of all citations) and Google AI Overviews (2.2%), and the #2 cited domain on ChatGPT behind Wikipedia. Perplexity cited Reddit in over 20% of responses during early 2026 (Evertune). The University of Toronto's study found 69-82% of citations come from earned media (Britopian, October 2025), and their methodology categorized community discussion platforms like Reddit as earned media rather than social media. Traditional social platforms like Twitter and Instagram contribute minimally, but Reddit and YouTube function as primary content infrastructure for both the pretraining and retrieval layers.

AirOps confirmed the broader pattern: 85% of brand mentions in AI responses originate from third-party pages. The pipeline that constructs your brand's AI identity draws overwhelmingly from sources you do not author, with Reddit discussions, YouTube reviews, and industry publications forming the backbone of both what the model knows from pretraining and what it retrieves at query time.

Pipeline Stage	What Happens	Brand Control
Content Creation	Brand publishes pages, blog posts, product info	High (18-31% of citations)
Third-Party Coverage	Reddit, YouTube, press, reviews, forums mention the brand	Low (69-82% of citations)
Pretraining	Content encoded into parametric memory via data licensing ($60M Google-Reddit, ~$70M OpenAI-Reddit) and web crawls	None (persists months-years)
RAG Retrieval	Live web queries at inference; Reddit is #1 cited domain on Perplexity and AI Overviews	Low-Moderate (freshness, structured data)
Inference and Response	Model synthesizes answer, selects brands, attaches citations	None

How Pretraining Encodes Brand Knowledge

Meta's Llama 3.1 trained on 15 trillion tokens across 16,384 GPUs for 54 days; the content that enters this process shapes AI brand perception for months or years afterward.

A frontier language model's understanding of your brand begins in a pretraining run that processes trillions of tokens of text. Meta is the most transparent lab about these numbers: Llama 3.1 405B trained on 15 trillion tokens across 16,384 H100 GPUs over 54 days, at an estimated compute of 3.8 x 10²⁵ FLOPs (Meta, arXiv:2407.21783, July 2024). The training corpus for a model of this scale includes filtered Common Crawl data (an archive of over 250 billion web pages that adds 3-5 billion new pages in each monthly crawl), Reddit posts and comments accessed via data licensing agreements, YouTube transcripts (the New York Times reported in April 2024 that OpenAI used Whisper to transcribe over one million hours of YouTube video for GPT-4 training), Wikipedia, books, and code. During training, the model learns to predict the next token in a sequence, which means it encodes statistical associations between words, concepts, and entities. If "brand X" frequently co-occurs with "enterprise," "expensive," and "reliable" across millions of documents, those associations become the model's understanding of brand X. Brands exist inside the model as distributed patterns across billions of parameters, shaped entirely by what the training corpus contained, with no explicit brand database or structured representation.

Every pretrained model has a knowledge cutoff: the date after which no information from the training corpus was collected. Any content published, brand changes made, or products launched after this date do not exist in the model's parametric memory. The deployment gap compounds the issue: after pretraining concludes, models undergo months of post-training alignment (RLHF, safety evaluation, red-teaming) before reaching users. A model with a January 2025 knowledge cutoff might not be deployed until mid-2025, creating a window where the model's knowledge is already stale on arrival. Full pretraining runs happen roughly annually for each major lab and cost $50-200+ million per run, which means the parametric memory of the model a buyer is talking to right now was likely built on data collected 6-18 months ago. This is the slow layer of the pipeline: deep, persistent, and largely outside any brand's direct control.

Labs update deployed models through two mechanisms that operate on very different timescales. Post-training (fine-tuning, RLHF, direct preference optimization) is relatively cheap and frequent; it adjusts the model's tone, instruction-following, and safety behavior but does not change the knowledge cutoff or add new factual knowledge about brands. Continued pretraining extends the cutoff by training on newer data without starting from scratch, but it is more expensive and risks "catastrophic forgetting" of older knowledge. The retrieval layer is the fast path: Google AI Overviews uses Google's own search index, which crawls billions of pages daily and can surface content within hours of publication. ChatGPT Search queries Bing's index. Perplexity operates its own web crawler alongside Bing and claims near-real-time indexing for frequently crawled domains. The practical implication is that brands have two distinct channels to influence AI, each with radically different timescales, and a complete strategy requires investment in both.

Dimension	Pretraining Path	Retrieval Path
Time to influence	6-24 months	Minutes to days
Persistence	Months-years (encoded in parameters)	Ephemeral (re-retrieved per query)
What it affects	Brand identity, entity associations	Specific facts, current pricing, recent news
Cost to shift	Ecosystem-wide changes (web mentions, YouTube, Reddit)	Content updates, structured data, freshness
Brand control	Very low	Low-moderate
Data sources	Common Crawl, Reddit/YouTube data deals, books	Google/Bing search index, Perplexity crawler

What Gets Filtered Out

GPT-3's training filtered 45TB of Common Crawl down to 570GB, discarding 98.7%; 79% of top news sites now block AI training bots, but 95.4% of citations come from sites that blocked them.

Not all web content makes it into training. Labs filter aggressively for quality, deduplicate against older versions, and strip toxic or low-value material. When OpenAI built GPT-3's training set from Common Crawl, they reduced 45TB of compressed text to 570GB, discarding approximately 98.7% of the crawled web (Brown et al., NeurIPS 2020). Modern datasets are even stricter: FineWeb-Edu and DCLM use model-based quality scoring that removes roughly 90% of candidate data, and DeepSeek's preprocessing eliminated nearly 90% of repeated content across 91 Common Crawl dumps (Mozilla Foundation). The content your brand publishes may never reach the training corpus because it falls below a quality threshold, gets deduplicated against a more authoritative version of the same information, or sits on a domain the crawler never reached. Common Crawl itself captures only 3-5 billion URLs per monthly crawl, and its own staff reject claims of comprehensiveness: "Often it is claimed that Common Crawl contains the entire web, but that's absolutely not true."

Opt-outs add another exclusion layer, but they are largely ineffective. 79% of top news sites now block AI training bots via robots.txt, with 75% blocking Common Crawl's CCBot and 62% blocking GPTBot (BuzzStream). Yet 95.4% of GPTBot citations come from sites that blocked it, because models can still access content through historical crawls, cached Common Crawl archives, and non-compliant scrapers. A Duke University study found a 400% growth in robots.txt bypass rates between Q2 and Q4 2025, and overall non-compliance rose from 3.3% to 13.26% of requests. The New York Times lawsuit against OpenAI, filed in December 2023 and still in active litigation, alleges unauthorized use of millions of copyrighted articles; in January 2026, the judge ordered OpenAI to produce 20 million ChatGPT conversation logs for discovery. The legal and technical boundaries of what content enters training are contested in courts and ignored by crawlers in roughly equal measure.

What AI Models Actually Index

Ahrefs' 75,000-brand study found YouTube mentions correlate at 0.737 with AI visibility; Domain Rating, the traditional SEO authority metric, correlates far lower.

The signals that feed AI recommendations are entity-level, not page-level. Ahrefs analyzed 75,000 brands across ChatGPT, AI Mode, and AI Overviews and found that branded web mentions show the highest correlation with visibility (Spearman correlation of 0.664), YouTube mentions reach 0.737 as the single strongest factor, and brand search volume contributes at 0.392 (Ahrefs, December 2025). Content volume, measured by number of site pages, showed almost no relationship with AI visibility. Domain Rating, the metric traditional SEO has spent a decade optimizing, falls meaningfully behind these entity-level signals.

Kevin Indig's study of 1.2 million ChatGPT responses and 18,012 verified citations adds a content-level dimension to these entity findings: 44.2% of all citations reference material from the first third of a page, cited text contains 20.6% proper nouns compared to the typical 5-8%, and definitive constructions ("X is," "X refers to") appear nearly twice as often in cited passages (Search Engine Land, February 2026). The pattern this reveals is a two-stage selection process where the AI decides which brands to recommend based on entity salience, then selects supporting content based on how definitively that content states its claims.

Signal	Spearman Correlation	Signal Type
YouTube mentions	0.737	Entity-level
Branded web mentions	0.664	Entity-level
Brand search volume	0.392	Entity-level
Organic traffic	0.274	Behavioral
Content volume (pages)	~0	Page-level

Source: Ahrefs, "Top Brand Visibility Factors in ChatGPT, AI Mode, and AI Overviews (75k Brands Studied)," December 2025.

Wikipedia: The Anchor of AI Knowledge

Wikipedia accounts for 47.9% of ChatGPT's top-10 citations and an estimated 22% of training data; brands with Wikipedia pages rank 1.84 positions higher in AI recommendations.

Wikipedia occupies a unique position in the pipeline because it feeds both the pretraining and retrieval layers at massive scale. It comprises an estimated 22% of major models' training data and is the #1 most-cited domain in Google AI Mode, accounting for 11.22% of all tracked citations (ALM Corp). In ChatGPT, Wikipedia represents 47.9% of the top-10 most-cited domains (Profound, 680 million citations). The Wikimedia Foundation has formalized this role through licensing deals with Amazon, Meta, Microsoft, Mistral AI, Perplexity, and others; Google's deal dates to 2022. AI crawlers have caused a 50% surge in Wikimedia bandwidth since January 2024 (Index Lab).

The impact on brand visibility is measurable and blunt. Quoleady's study found 78.8% of tools recommended by ChatGPT had Wikipedia pages, and brands with Wikipedia pages ranked an average of 5.07 versus 6.91 for those without, a 1.84-position advantage. Among top marketing agencies cited by LLMs, 50% had Wikipedia pages. For brands below a certain scale, notability thresholds may prevent a Wikipedia article from existing at all, and that absence creates a structural gap in both parametric memory and retrieval that no amount of on-site GEO can fully compensate for. Wikipedia functions as foundational infrastructure for AI visibility, and its absence is structural.

How the Model Selects Brands at Inference

Seer Interactive's analysis of 362,188 LLM responses suggests the model recommends brands from parametric memory first, then retrieves citations to justify those choices afterward.

The pipeline's most opaque stage is inference: the moment when the model generates a response and decides which brands to name. Seer Interactive ran six behavioral tests across 362,188 LLM responses and concluded that citations are post-hoc (Seer Interactive, February 2026). The model generates its brand recommendation from parametric memory first, built from the pretraining and fine-tuning stages described above, and then searches for citations to support the choice after the fact. If this hypothesis is correct, and Seer is careful to note they cannot observe token generation logs directly, then the common assumption that "earning a citation earns a recommendation" is backwards. A brand can produce the most authoritative content in its category, earn consistent retrieval, and still never be recommended because the model's parametric memory does not associate the brand strongly enough with the query topic.

RankScience describes the same mechanism using slightly different terminology: AI platforms apply an "evidence check" (is this content trustworthy enough to cite?) and a separate "recommendation check" (does this brand have sufficient entity salience to name?). Passing one does not guarantee passing the other. A brand's blog post can be the best evidence source on a topic while the brand itself lacks the entity-level presence to be named in the answer. This is the mechanism behind ghost citations: your content gets cited, your competitor gets recommended. The inference stage amplifies whatever entity associations were encoded during pretraining and reinforced through retrieval, which is why the upstream layers of the pipeline matter so much.

Personalization Means No Fixed Answer

SparkToro found less than a 1-in-100 chance that the same prompt returns the same brand list across two runs; ChatGPT now rewrites search queries using stored user memories.

Even if you could observe the model's inference process directly, the output varies by user. ChatGPT's memory feature, upgraded in April 2025, rewrites search queries using stored user preferences: a vegan in San Francisco asking for restaurant recommendations gets different search queries and different results than a carnivore in Dallas, before the model even begins selecting brands (TechCrunch, April 2025). Google launched Personal Intelligence in AI Mode in January 2026, connecting Gmail, Photos, YouTube history, and Search history to personalize responses. Perplexity stores structured preferences (favorite brands, dietary needs, keywords) that persist across conversations and work across all models.

SparkToro quantified the baseline variability directly, even without personalization. Testing 2,961 prompts across 600 volunteers, they found less than a 1-in-100 chance that two runs of the same prompt return the same brand list, and less than a 1-in-1,000 chance of the same list in the same order. Even dominant brands in well-defined categories like headphones (Bose, Sony, Sennheiser) appeared in only 55-77% of responses. AirOps found only 30% of brands persist between consecutive answers and just 20% across five consecutive runs. What AI thinks about your brand is a probability distribution shaped by the model's training, the retrieval context, and the specific user asking. The brands that score highest in Sill's monitoring are those that appear consistently across these variable conditions rather than those that dominate a single snapshot.

Social Sentiment and AI Sentiment Are Different Metrics

Reddit threads feed AI models directly, but AI platforms reframe that content: BrightEdge found they flag different brands as negative 73% of the time on identical queries.

Traditional sentiment analysis tools measure what people say about your brand on Twitter and Instagram; some now include Reddit. AI model sentiment measures something different: how the model characterizes your brand when synthesizing and reframing that information for a user. Reddit threads feed AI models directly through both pretraining and retrieval, so what people say on Reddit does shape AI perception. But the model synthesizes rather than repeating Reddit's consensus. It synthesizes across hundreds of sources, weights them according to opaque relevance criteria, and produces a characterization that may diverge substantially from what any single platform's users actually said. Monitoring Reddit sentiment is necessary but insufficient; what matters is the specific characterization each AI model constructs after processing that input alongside everything else it knows.

BrightEdge's analysis illustrates how much the synthesis layer transforms the underlying data. ChatGPT and Google AI Overviews recommend the same brands 76% of the time but frame them in materially different ways: ChatGPT uses action-oriented language ("offers," "provides," "enables") while AI Overviews leans toward descriptive framing. The divergence grows sharper on negative sentiment. Google AI Overviews is 44% more likely to surface negative brand sentiment than ChatGPT (2.3% vs 1.6% of mentions), and the platforms flag different brands as negative 73% of the time on identical queries (BrightEdge, March 2026). Two models read the same Reddit threads and the same web crawl and produce opposite conclusions about whether a brand is trustworthy. Sill's own data confirms the scope of this platform divergence: 55% of brands have a 10+ point SOV spread between platforms, and 91.6% of cited URLs appear on only one platform.

The Training Data Time Machine

AI-cited pages average 1,064 days old (2.9 years); content updated within 90 days earns 67% more AI citations than stale pages.

Ahrefs analyzed 16.975 million cited URLs across seven AI platforms and found the average age of AI-cited content is 1,064 days, nearly three years. ChatGPT shows the strongest freshness preference among platforms, citing URLs 393-458 days newer than what appears in organic Google results, but even its preferred content is over a year old on average. RankScience's experience illustrates the lag at its most concrete: their team discontinued A/B testing services in 2024, yet ChatGPT, Gemini, and Perplexity all still recommended them for those exact services more than a year later (RankScience, January 2026). The model had no mechanism to learn about the change because the third-party content referencing RankScience as an A/B testing provider still existed across the web, feeding the same outdated association into every retrieval cycle.

The pipeline carries a temporal paradox, however. AirOps found that 35.2% of pages cited by ChatGPT were updated in the last three months, and 53.4% within six months; for commercial queries specifically, 83% of citations come from content refreshed within the past year (AirOps, August 2025). Pages not updated within 12 months are more than 2x less likely to be cited. The AI draws from old knowledge but favors recently updated sources when retrieval surfaces them, which means brands that consistently refresh their content gain a compounding advantage over those that publish once and move on. SE Ranking's 129,000-domain study found the same pattern: content updated within 90 days earns 67% more citations than stale pages.

The Buyer Who Never Googled You

GenAI chatbots are now the #1 source influencing B2B vendor shortlists at 17.1%, surpassing review sites, vendor websites, and salespeople (G2, 2025).

G2 surveyed 1,169 B2B decision-makers in April 2025 and found that GenAI chatbots have overtaken software review sites (15.1%), vendor websites (12.8%), market research firms (10.6%), and salespeople (8.8%) as the single most influential source for vendor shortlists; 29% now start research via ChatGPT more often than Google. McKinsey's consumer survey found 44% of AI search users call it their primary source for buying decisions, ahead of traditional search at 31% and brand websites at 9% (McKinsey, August 2025). Capgemini puts adoption even higher: 58% of 12,000 surveyed consumers across 12 countries have replaced traditional search engines with GenAI tools for product and service recommendations, up from 25% in 2023.

The downstream effect is already measurable in retail. Adobe Analytics tracked a 1,300% year-over-year increase in AI referral traffic to U.S. shopping sites during the 2024 holiday season, with Cyber Monday specifically up 1,950%. These are consumers who form purchase intent inside an AI interface, never visit a search engine, and arrive at a brand's site (if they arrive at all) with a recommendation already formed. Google AI Mode compounds the pattern: Semrush found 93% of AI Mode sessions end without a single click to any external website (Semrush, September 2025). For an increasing share of purchase journeys, these consumers never visit a search engine at all, and the AI recommendation serves as the entire research phase.

The Commerce Layer Adds Another Filter

ChatGPT's shopping feature uses a specialized model trained with RL for product queries, achieving 52% accuracy vs. 37% for standard search; Google's Shopping Graph feeds 50B+ listings into AI Mode.

Beyond the general recommendation pipeline, each major platform has built specialized commerce infrastructure that creates additional filters for brand visibility. ChatGPT launched Shopping Research in November 2025 with a GPT-5 mini variant trained specifically with reinforcement learning for product queries, achieving 52% accuracy on multi-constraint shopping tasks versus 37% for standard ChatGPT Search (OpenAI). Google's Shopping Graph contains 50+ billion product listings interpreted by Gemini models, and Shopping Ads now appear within AI Mode responses, reaching 75+ million daily active users (Search Engine Land, February 2026). Perplexity partnered with PayPal for in-app checkout with merchants including Abercrombie & Fitch, Ashley Furniture, and NewEgg, and shopping queries jumped 5x since launch.

These integrations create a structural layer that sits on top of the knowledge pipeline. Brands feeding structured product data into Google Merchant Center, Shopify's Agentic Storefronts (which placed 5.6 million stores inside ChatGPT, Copilot, AI Mode, and Gemini in March 2026), or Stripe's Agentic Commerce Protocol get surfaced in AI shopping experiences. Brands that don't are invisible in an entire class of high-intent queries. The protocols differ per platform: OpenAI's ACP is closed and partner-gated, while Google's Universal Commerce Protocol is an open standard co-developed with Shopify, Etsy, Wayfair, and 20+ retailers. Product-based brands that lack presence in these merchant feeds are structurally invisible for an entire category of high-intent, purchase-ready queries regardless of their content quality.

When the Pipeline Gets Your Brand Wrong

Kodec AI found 62% of simulated buyer queries return incorrect pricing or feature information for B2B SaaS brands.

Kodec AI ran 200+ simulated buyer query cycles across Series B+ SaaS companies and found 62% of queries returned incorrect pricing, discontinued features listed as current, or competitor capabilities attributed to the wrong company (GlobeNewsWire, December 2025). These are not edge cases or adversarial prompts; they are the kinds of questions an actual buyer would ask, and the errors include quoting prices for plans that no longer exist and attributing a competitor's integration to the wrong vendor. NP Digital's broader accuracy audit across 600 prompts found ChatGPT was fully correct only 59.7% of the time, with accuracy declining to 39.6% for Grok alongside a 21.8% outright error rate (NP Digital, February 2026). We covered the full accuracy crisis across platforms in an earlier analysis.

The Columbia Journalism Review tested 1,600 queries across eight AI search engines and found ChatGPT Search incorrect in 67% of its test set, while over 50% of Grok and Gemini citations linked to fabricated or broken URLs. The legal exposure is no longer theoretical: 729+ documented legal filings in U.S. courts cite AI-generated hallucinations, with sanctions reaching $30,000 (Sixth Circuit, March 2026), and Air Canada was ordered to pay damages after its chatbot fabricated a bereavement discount policy. The pipeline actively misrepresents brands, and the correction mechanism is slow because it requires the underlying content ecosystem to change before the models will update their outputs.

AI Platform	Fully Correct	Outright Error Rate
ChatGPT	59.7%	7.6%
Claude	55.1%	6.2%
Gemini	51.3%	8.0%
Perplexity	49.3%	12.2%
Copilot	45.8%	13.6%
Grok	39.6%	21.8%

Source: NP Digital, "AI Hallucinations and Accuracy Report," February 2026. Tested 600 prompts across six platforms.

The Compounding Advantage

Arcalea found the #1 entity averaged 62% AI Share of Voice across five industries; the typical gap between #1 and #3 was 5x.

The pipeline creates a self-reinforcing cycle. AI recommends a brand; consumers act on that recommendation by buying, reviewing, and discussing it; those actions generate new content that feeds back into both pretraining corpora and retrieval indices; the model recommends the brand more strongly in the next cycle. Arcalea analyzed 1,200+ AI responses across five industries and found the #1 entity averaged 62% AI Share of Voice (Arcalea, March 2026). In commercial debt collection, the leader captured 58% SOV while #2 got 19% and everyone else split 23%. The typical gap between #1 and #3 was 5x. Brands that appeared first maintained that position 70-80% of the time across repeated runs, suggesting that early advantage in AI visibility is self-reinforcing rather than transient.

AirOps found brands with both mentions and citations show a 40% higher likelihood of reappearing across consecutive answers versus citation-only brands. Springer Nature's 2025 research on AI recommendation systems describes the feedback mechanism directly: "user choices, influenced by algorithmic suggestions, are fed back into the system as new data, perpetuating and reinforcing specific behavioral patterns." Omniscient Digital frames this as the Matthew Effect in AI search: "systems that systematically reward early traction and visibility, making it harder for latecomers to catch up, regardless of merit." The window to establish AI visibility is narrowing as compounding effects lock in early movers and raise the cost for latecomers. Citation patterns still drift 40-60% month over month according to Profound's longitudinal data, meaning positions are contestable now, but the brands building entity salience today will carry a structural advantage into the next training cycle.

The Levers That Actually Shift AI Perception

The foundational GEO paper tested 10,000 queries: Statistics Addition boosts AI visibility by 32%; Keyword Stuffing decreases it by 8%.

The foundational GEO paper (Aggarwal et al., KDD 2024) tested nine optimization methods across 10,000 queries and 25 domains. The highest-performing tactics were Quotation Addition (+41%), Statistics Addition (+32%), and Source Citation (+30%). Keyword Stuffing, the reflex most SEO teams reach for first, decreased visibility by 8%, confirming that LLMs process meaning through embeddings rather than keyword frequency. SE Ranking's 129,000-domain study adds structural detail: sections of 120-180 words between headings average 4.6 citations versus 2.7 for sections under 50 words, content with 19+ statistical data points averages 5.4 citations versus 2.8, and pages with expert quotes average 4.1 versus 2.4 without (SE Ranking, November 2025).

The most striking finding on response speed comes from Seer Interactive: Wil Reynolds changed Seer's website footer and observed ChatGPT's brand narrative shift within 36 hours. The pipeline is slow at the training-data level but responsive to entity-level signals that reach the retrieval layer quickly. Structured data (Schema.org markup) increases AI citation rates 3.1x according to BrightEdge, with a 73% higher selection rate. For brands whose content earns citations but not recommendations, Dubois, Dawson, and Jaiswal offer a useful framework in the Harvard Business Review: "Share of Model" measures three dimensions of AI brand presence: mention rate, the gap between human and AI awareness, and brand-level sentiment (HBR, June 2025). The levers exist; they are different from the ones most marketing teams have been pulling, and the brands that map their pipeline first will compound the advantage.

See What AI Models Say About Your Brand

Sill monitors the actual responses of six AI platforms daily, tracking how each model characterizes your brand across both the retrieval and recommendation layers. See where your pipeline is broken.

Start Monitoring Talk to Us

References

RankScience. "AI Training Data Lag: What Happens When You Rebrand or Pivot." January 2026. rankscience.com
Profound. "AI Platform Citation Patterns." 680 million citations, August 2024 - June 2025. tryprofound.com
Evertune. "Perplexity Loves Reddit: Exploring LLMs' Top Sources." 200 million+ prompts, 2026. evertune.ai
CBS News. "Google-Reddit $60 Million Deal for AI Training." February 2024. cbsnews.com
TechCrunch. "OpenAI Inks Deal to Train AI on Reddit Data." May 2024. techcrunch.com
Britopian / University of Toronto. "Earned Media Dominates AI Search." October 2025. britopian.com
AirOps. "The Impact of Stale Content on AI Visibility." August 2025. airops.com
Ahrefs. "Top Brand Visibility Factors in ChatGPT, AI Mode, and AI Overviews (75k Brands Studied)." December 2025. ahrefs.com
Kevin Indig. "The Science of How AI Picks Its Sources." Search Engine Land, February 2026. searchengineland.com
BrightEdge. "AI Search 2025: Three Key Insights from BrightEdge's AI Overview and ChatGPT Analysis." 2025. brightedge.com
G2. "2025 Buyer Behavior Report." May 2025. g2.com
McKinsey & Company. "New Front Door to the Internet: Winning in the Age of AI Search." August 2025. mckinsey.com
Capgemini Research Institute. "What Matters to Today's Consumer." 2024. capgemini.com
Adobe Analytics. "The Explosive Rise of Generative AI Referral Traffic." 2025. adobe.com
Semrush. "Google AI Mode SEO Impact." September 2025. semrush.com
Kodec AI. "Rogue Sales Rep Problem in AI Search." GlobeNewsWire, December 2025. globenewswire.com
NP Digital. "AI Hallucinations and Accuracy Report." February 2026.
Columbia Journalism Review / Tow Center. "AI Search Engine Accuracy Study." 2025.
Meta. "The Llama 3 Herd of Models." arXiv:2407.21783, July 2024. arxiv.org
The New York Times. "OpenAI Transcribed Over a Million Hours of YouTube Videos to Train GPT-4." April 2024.
Brown et al. "Language Models Are Few-Shot Learners." NeurIPS 2020. neurips.cc
Mozilla Foundation. "Common Crawl in Generative AI Training Data." mozillafoundation.org
BuzzStream. "Publishers Block AI Study." buzzstream.com
ALM Corp. "Top Domains Cited by AI Search." almcorp.com
Quoleady. "LLMO Research: What Makes Tools Visible to ChatGPT." quoleady.com
Seer Interactive. "Ghost Citations: When AI Uses Your Content but Credits Your Competitor." February 2026.
SparkToro. "New Research: AIs Are Highly Inconsistent When Recommending Brands." sparktoro.com
Arcalea. "What Five Industries Reveal About AI Visibility in 2026." March 2026. arcalea.com
Springer Nature. "AI Recommendation Systems and Behavioral Feedback Loops." 2025. springer.com
Common Crawl Foundation. "Common Crawl Statistics." commoncrawl.org
Aggarwal et al. "GEO: Generative Engine Optimization." ACM SIGKDD (KDD), 2024. arxiv.org
SE Ranking. "How to Optimize for AI Mode." November 2025. seranking.com
Dubois, Dawson, and Jaiswal. "Forget What You Know About SEO. Here's How to Optimize Your Brand for LLMs." Harvard Business Review, June 2025. hbr.org

Get a Demo

Tell us about your brand and we'll be in touch to walk you through Sill.

Back to Blog

Research

How AI Models Form Brand Opinions: The Content-to-Recommendation Pipeline

TL;DR

The Pipeline from Content to AI Recommendation

Google pays Reddit $60M/year and OpenAI ~$70M/year for training data. Reddit is also the #1 cited domain on Perplexity (6.6% of citations) and Google AI Overviews.

Pipeline Stage	What Happens	Brand Control
Content Creation	Brand publishes pages, blog posts, product info	High (18-31% of citations)
Third-Party Coverage	Reddit, YouTube, press, reviews, forums mention the brand	Low (69-82% of citations)
Pretraining	Content encoded into parametric memory via data licensing ($60M Google-Reddit, ~$70M OpenAI-Reddit) and web crawls	None (persists months-years)
RAG Retrieval	Live web queries at inference; Reddit is #1 cited domain on Perplexity and AI Overviews	Low-Moderate (freshness, structured data)
Inference and Response	Model synthesizes answer, selects brands, attaches citations	None

How Pretraining Encodes Brand Knowledge

Meta's Llama 3.1 trained on 15 trillion tokens across 16,384 GPUs for 54 days; the content that enters this process shapes AI brand perception for months or years afterward.

Dimension	Pretraining Path	Retrieval Path
Time to influence	6-24 months	Minutes to days
Persistence	Months-years (encoded in parameters)	Ephemeral (re-retrieved per query)
What it affects	Brand identity, entity associations	Specific facts, current pricing, recent news
Cost to shift	Ecosystem-wide changes (web mentions, YouTube, Reddit)	Content updates, structured data, freshness
Brand control	Very low	Low-moderate
Data sources	Common Crawl, Reddit/YouTube data deals, books	Google/Bing search index, Perplexity crawler

What Gets Filtered Out

GPT-3's training filtered 45TB of Common Crawl down to 570GB, discarding 98.7%; 79% of top news sites now block AI training bots, but 95.4% of citations come from sites that blocked them.

What AI Models Actually Index

Ahrefs' 75,000-brand study found YouTube mentions correlate at 0.737 with AI visibility; Domain Rating, the traditional SEO authority metric, correlates far lower.

Signal	Spearman Correlation	Signal Type
YouTube mentions	0.737	Entity-level
Branded web mentions	0.664	Entity-level
Brand search volume	0.392	Entity-level
Organic traffic	0.274	Behavioral
Content volume (pages)	~0	Page-level

Source: Ahrefs, "Top Brand Visibility Factors in ChatGPT, AI Mode, and AI Overviews (75k Brands Studied)," December 2025.

Wikipedia: The Anchor of AI Knowledge

Wikipedia accounts for 47.9% of ChatGPT's top-10 citations and an estimated 22% of training data; brands with Wikipedia pages rank 1.84 positions higher in AI recommendations.

How the Model Selects Brands at Inference

Seer Interactive's analysis of 362,188 LLM responses suggests the model recommends brands from parametric memory first, then retrieves citations to justify those choices afterward.

Personalization Means No Fixed Answer

SparkToro found less than a 1-in-100 chance that the same prompt returns the same brand list across two runs; ChatGPT now rewrites search queries using stored user memories.

Social Sentiment and AI Sentiment Are Different Metrics

Reddit threads feed AI models directly, but AI platforms reframe that content: BrightEdge found they flag different brands as negative 73% of the time on identical queries.

The Training Data Time Machine

AI-cited pages average 1,064 days old (2.9 years); content updated within 90 days earns 67% more AI citations than stale pages.

The Buyer Who Never Googled You

GenAI chatbots are now the #1 source influencing B2B vendor shortlists at 17.1%, surpassing review sites, vendor websites, and salespeople (G2, 2025).

The Commerce Layer Adds Another Filter

ChatGPT's shopping feature uses a specialized model trained with RL for product queries, achieving 52% accuracy vs. 37% for standard search; Google's Shopping Graph feeds 50B+ listings into AI Mode.

When the Pipeline Gets Your Brand Wrong

Kodec AI found 62% of simulated buyer queries return incorrect pricing or feature information for B2B SaaS brands.

AI Platform	Fully Correct	Outright Error Rate
ChatGPT	59.7%	7.6%
Claude	55.1%	6.2%
Gemini	51.3%	8.0%
Perplexity	49.3%	12.2%
Copilot	45.8%	13.6%
Grok	39.6%	21.8%

Source: NP Digital, "AI Hallucinations and Accuracy Report," February 2026. Tested 600 prompts across six platforms.

The Compounding Advantage

Arcalea found the #1 entity averaged 62% AI Share of Voice across five industries; the typical gap between #1 and #3 was 5x.

The Levers That Actually Shift AI Perception

The foundational GEO paper tested 10,000 queries: Statistics Addition boosts AI visibility by 32%; Keyword Stuffing decreases it by 8%.

See What AI Models Say About Your Brand

Sill monitors the actual responses of six AI platforms daily, tracking how each model characterizes your brand across both the retrieval and recommendation layers. See where your pipeline is broken.

Start Monitoring Talk to Us

References

RankScience. "AI Training Data Lag: What Happens When You Rebrand or Pivot." January 2026. rankscience.com
Profound. "AI Platform Citation Patterns." 680 million citations, August 2024 - June 2025. tryprofound.com
Evertune. "Perplexity Loves Reddit: Exploring LLMs' Top Sources." 200 million+ prompts, 2026. evertune.ai
CBS News. "Google-Reddit $60 Million Deal for AI Training." February 2024. cbsnews.com
TechCrunch. "OpenAI Inks Deal to Train AI on Reddit Data." May 2024. techcrunch.com
Britopian / University of Toronto. "Earned Media Dominates AI Search." October 2025. britopian.com
AirOps. "The Impact of Stale Content on AI Visibility." August 2025. airops.com
Ahrefs. "Top Brand Visibility Factors in ChatGPT, AI Mode, and AI Overviews (75k Brands Studied)." December 2025. ahrefs.com
Kevin Indig. "The Science of How AI Picks Its Sources." Search Engine Land, February 2026. searchengineland.com
BrightEdge. "AI Search 2025: Three Key Insights from BrightEdge's AI Overview and ChatGPT Analysis." 2025. brightedge.com
G2. "2025 Buyer Behavior Report." May 2025. g2.com
McKinsey & Company. "New Front Door to the Internet: Winning in the Age of AI Search." August 2025. mckinsey.com
Capgemini Research Institute. "What Matters to Today's Consumer." 2024. capgemini.com
Adobe Analytics. "The Explosive Rise of Generative AI Referral Traffic." 2025. adobe.com
Semrush. "Google AI Mode SEO Impact." September 2025. semrush.com
Kodec AI. "Rogue Sales Rep Problem in AI Search." GlobeNewsWire, December 2025. globenewswire.com
NP Digital. "AI Hallucinations and Accuracy Report." February 2026.
Columbia Journalism Review / Tow Center. "AI Search Engine Accuracy Study." 2025.
Meta. "The Llama 3 Herd of Models." arXiv:2407.21783, July 2024. arxiv.org
The New York Times. "OpenAI Transcribed Over a Million Hours of YouTube Videos to Train GPT-4." April 2024.
Brown et al. "Language Models Are Few-Shot Learners." NeurIPS 2020. neurips.cc
Mozilla Foundation. "Common Crawl in Generative AI Training Data." mozillafoundation.org
BuzzStream. "Publishers Block AI Study." buzzstream.com
ALM Corp. "Top Domains Cited by AI Search." almcorp.com
Quoleady. "LLMO Research: What Makes Tools Visible to ChatGPT." quoleady.com
Seer Interactive. "Ghost Citations: When AI Uses Your Content but Credits Your Competitor." February 2026.
SparkToro. "New Research: AIs Are Highly Inconsistent When Recommending Brands." sparktoro.com
Arcalea. "What Five Industries Reveal About AI Visibility in 2026." March 2026. arcalea.com
Springer Nature. "AI Recommendation Systems and Behavioral Feedback Loops." 2025. springer.com
Common Crawl Foundation. "Common Crawl Statistics." commoncrawl.org
Aggarwal et al. "GEO: Generative Engine Optimization." ACM SIGKDD (KDD), 2024. arxiv.org
SE Ranking. "How to Optimize for AI Mode." November 2025. seranking.com
Dubois, Dawson, and Jaiswal. "Forget What You Know About SEO. Here's How to Optimize Your Brand for LLMs." Harvard Business Review, June 2025. hbr.org

Get a Demo

Tell us about your brand and we'll be in touch to walk you through Sill.

How AI Models Form Brand Opinions: The Content-to-Recommendation Pipeline

The Pipeline from Content to AI Recommendation

How Pretraining Encodes Brand Knowledge

What Gets Filtered Out

What AI Models Actually Index

Wikipedia: The Anchor of AI Knowledge

How the Model Selects Brands at Inference

Personalization Means No Fixed Answer

Social Sentiment and AI Sentiment Are Different Metrics

The Training Data Time Machine

The Buyer Who Never Googled You

The Commerce Layer Adds Another Filter

When the Pipeline Gets Your Brand Wrong

The Compounding Advantage

The Levers That Actually Shift AI Perception

See What AI Models Say About Your Brand

References

Get a Demo

What Is LLM Visibility? Definition, Measurement, and 2026 Benchmarks

Brand Sentiment Intelligence for Small Businesses: What AI Engines Actually Think About You

Browsers Are Becoming AI Answer Engines, and Most Brands Are Not in the Answer

How AI Models Form Brand Opinions: The Content-to-Recommendation Pipeline

The Pipeline from Content to AI Recommendation

How Pretraining Encodes Brand Knowledge

What Gets Filtered Out

What AI Models Actually Index

Wikipedia: The Anchor of AI Knowledge

How the Model Selects Brands at Inference

Personalization Means No Fixed Answer

Social Sentiment and AI Sentiment Are Different Metrics

The Training Data Time Machine

The Buyer Who Never Googled You

The Commerce Layer Adds Another Filter

When the Pipeline Gets Your Brand Wrong

The Compounding Advantage

The Levers That Actually Shift AI Perception

See What AI Models Say About Your Brand

References

Get a Demo