The Anatomy of a Page That AI Cites
When ChatGPT or Perplexity recommends a product, it links to specific pages. We analyzed 22,785 cited pages from Sill's monitoring pipeline, spanning 26,257 total citations across 11,405 unique domains. The data reveals which pages get cited, which get ignored, and what separates the two. Most of the distinguishing factors have nothing to do with domain authority.
TL;DR
We analyzed 22,785 cited pages from Sill's monitoring pipeline spanning 26,257 citations across 11,405 domains. 91.5% of pages are cited by only one AI platform. Only 8.5% achieve multi-platform citation, and just 27 pages (0.1%) are cited by all four platforms. Subcategory coverage is a 4.1x citation multiplier. YouTube leads all domains in citation efficiency at 3.4 citations per page. Domain authority shows near-zero or negative correlation with AI citation. The structural traits that predict citation have nothing to do with traditional SEO metrics.
The Dataset: 22,785 Pages Across Four AI Platforms
Sill's monitoring pipeline queries actual chat interfaces daily across ChatGPT, Perplexity, Gemini, and Google AI Overviews. For every page cited in an AI response, we record provenance metadata: which platforms cited it, how frequently, in response to which queries, and across which product subcategories.
The foundational GEO paper (Aggarwal et al., KDD 2024) established the principle that content-level optimizations affect AI citation rates, finding 30-40% improvements from adding statistics. The question our data answers is practical: across 22,785 real-world pages being cited by real AI platforms, what specific patterns emerge?
| Metric | Value |
|---|---|
| Total unique pages cited | 22,785 |
| Total citations | 26,257 |
| Unique domains | 11,405 |
| Brands monitored | 141 |
ChatGPT Dominates Citation Volume. The Edges Reveal More.
Across 22,785 cited pages, ChatGPT is the source of 60.6% of all page-level citations. Perplexity and Google AI Overviews each account for roughly 22%, while Gemini trails at 4.8%. This distribution reflects platform market share (ChatGPT holds roughly 79% of global generative AI web traffic, per Similarweb's GenAI Index), but the absolute numbers mask the more important finding.
| Platform | Pages Cited | % of All Pages |
|---|---|---|
| ChatGPT | 13,806 | 60.6% |
| Perplexity | 5,167 | 22.7% |
| Google AI Overviews | 4,961 | 21.8% |
| Gemini | 1,100 | 4.8% |
The important finding is not which platform cites the most pages. It is which pages get cited by multiple platforms. That is where the real quality signal lives.
Only 8.5% of Pages Get Cited by More Than One Platform
Platform diversity is the number of distinct AI engines that independently cite a given page. In our dataset of 22,785 pages, 91.5% are cited by exactly one platform. Only 1,945 pages (8.5%) achieve citation from two or more platforms. Just 277 pages (1.2%) reach three or more. And only 27 pages out of 22,785 (0.1%) are cited by all four platforms we track.
This matters because each AI platform has different citation DNA. ChatGPT relies on Bing's index and favors Wikipedia. Perplexity runs real-time web searches and favors Reddit. Google AI Overviews pull from organic top-10 results. Gemini favors authoritative lists and local reviews. When a page crosses all of these different retrieval systems, it has demonstrated a quality that transcends any single platform's bias.
| Platform Diversity | Pages | % of Total | Interpretation |
|---|---|---|---|
| 1 platform | 20,840 | 91.5% | Single-platform retrieval match |
| 2 platforms | 1,668 | 7.3% | Cross-platform authority signal |
| 3 platforms | 250 | 1.1% | Strong universal citation signal |
| 4 platforms | 27 | 0.1% | Elite cross-platform authority |
The most common multi-platform combination is ChatGPT + Google AI Overviews + Perplexity (28 pages), followed by the full four-platform set of ChatGPT + Gemini + Google AI Overviews + Perplexity (13 pages). These pages are worth studying because they represent the content that passes every major AI retrieval system's quality filter independently.
If your goal is to understand what makes a page AI-citable, platform diversity is the metric to optimize for. A page cited by four platforms is more informative than a page cited ten times by one platform, because the former has demonstrated universal relevance.
89% of Pages Are Cited Exactly Once
The citation frequency distribution follows a steep power law. Out of 22,785 pages, 20,315 (89.2%) are cited exactly once. Only 2,470 pages (10.8%) are cited two or more times. Just 579 pages (2.5%) reach three or more citations. At the far end, only 10 pages in the entire dataset achieved 11 or more citations.
This distribution has a direct strategic implication. Most pages that get cited by AI are one-time retrievals for a specific query. The pages that get cited repeatedly are the ones that answer multiple related queries. Breadth of relevance within a topic area is what drives repeat citation, and our subcategory data confirms this.
Subcategory Coverage Is a 4x Citation Multiplier
For each cited page, Sill tracks which product subcategories triggered the citation. A page cited for "Contact Center as a Service" only has a subcategory count of 1. A page cited for both "RV Maintenance & Repair Services" and "RV Parts & Accessories" and "Recreational Vehicles" has a subcategory count of 3. The relationship between subcategory coverage and citation frequency is the strongest signal in our data.
| Subcategory Coverage | Pages | Avg Citations | Multiplier vs. Baseline |
|---|---|---|---|
| 1 subcategory | 18,095 | 1.1 | 1.0x (baseline) |
| 2 subcategories | 181 | 2.6 | 2.4x |
| 3 subcategories | 18 | 4.5 | 4.1x |
| 4 subcategories | 3 | 5.3 | 4.8x |
Pages relevant to 3+ subcategories average 4.5 citations, compared to 1.1 for single-subcategory pages. That is a 4.1x multiplier. The mechanism is straightforward: a comprehensive comparison page that covers multiple product angles provides relevant answer fragments for a wider range of buyer queries. AI retrieval systems retrieve it for "best X for Y," "X vs Z," and "how to choose X" all from the same page.
This is the content strategy takeaway. Narrow pages that answer a single query get cited once and forgotten. Comprehensive pages that span adjacent subcategories become citation magnets. The data shows a near-linear relationship: each additional subcategory a page is relevant to adds roughly one additional citation.
The Domains AI Cites Most
Across our full dataset, Reddit leads in raw page count with 408 cited pages. YouTube ranks fifth by page count (104 pages) but leads in a more important metric: citations per page. YouTube pages average 3.4 citations each, compared to 1.0-1.1 for most other domains. A single YouTube video review generates more total citations than three average Reddit threads.
| Domain | Pages Cited | Total Citations | Citations/Page |
|---|---|---|---|
| reddit.com | 408 | 448 | 1.1 |
| youtube.com | 104 | 352 | 3.4 |
| en.wikipedia.org | 212 | 221 | 1.0 |
| linkedin.com | 160 | 170 | 1.1 |
| rtings.com | 85 | 135 | 1.6 |
| forbes.com | 68 | 96 | 1.4 |
| pcgamer.com | 59 | 90 | 1.5 |
| gartner.com | 63 | 76 | 1.2 |
| tomshardware.com | 38 | 73 | 1.9 |
| tomsguide.com | 84 | 94 | 1.1 |
YouTube's 3.4x citation efficiency stands out. A YouTube video review is cited across multiple queries and often by multiple platforms (YouTube dominates the multi-platform citation list, with 26 of the 100 highest-diversity pages). The Ahrefs study reported a 0.737 correlation between YouTube presence and AI visibility. Our data confirms this: YouTube pages are cited more frequently, across more platforms, than pages from any other domain.
The data also shows that specialized review sites (rtings.com at 1.6, tomshardware.com at 1.9, pcgamer.com at 1.5) outperform general-purpose platforms in citations per page. Depth of expertise within a category correlates with citation efficiency. Chen et al. (University of Toronto, 2025) found that 69-82% of AI search citations come from earned media. Our domain distribution confirms this: the top cited domains are overwhelmingly third-party sites, not brand-owned properties.
What Does Not Predict AI Citation
The absence of certain traditional SEO signals in highly cited pages is as informative as the presence of the traits above.
| Traditional SEO Signal | Correlation with AI Citation | Evidence |
|---|---|---|
| Domain authority | Slightly negative (r = -0.12 to -0.18) | SearchAtlas, 21,767 domains |
| Backlink count | Weak (r = 0.218) | Ahrefs, 75,000 brands |
| Keyword density | Negative (10% worse than baseline) | Aggarwal et al., KDD 2024 |
| FAQ schema markup | Negative (3.6 vs 4.2 citations/query) | Aggarwal et al., KDD 2024 |
| Branded search volume | Weak | Similarweb GenAI Index |
The SearchAtlas study of 21,767 domains measured domain authority correlation with AI visibility at r = -0.12 for ChatGPT and -0.18 for Perplexity. These are not weak correlations. They are slightly negative. Our data is consistent: the top cited domains are a mix of high-authority (Wikipedia, Forbes, Gartner) and moderate-authority (rtings.com, soundguys.com, techbloat.com) sites. Domain authority does not predict which pages get cited. Content relevance and structure do.
What the Top-Cited Pages Have in Common
Synthesizing the patterns across our dataset, the pages with the highest citation frequency and platform diversity share a consistent set of traits. These compound: a page with all of these traits is cited more frequently than a page with any subset.
Pages spanning multiple product subcategories average 4.5 citations vs. 1.1 for single-subcategory pages. Comprehensive comparison and review content that covers adjacent use cases generates the broadest citation surface area.
The foundational GEO paper found this to be the single highest-impact content optimization across 10,000 queries. Pages with vague claims ("significant improvement") are cited less than pages with specific numbers ("30-40% improvement, Aggarwal et al. 2024"). Research from Wan et al. (ACL 2024, UC Berkeley) confirms that LLMs favor factual density over stylistic authority signals.
YouTube pages in our dataset average 3.4 citations each, the highest of any domain. YouTube also dominates multi-platform citations, with 26 of the 100 highest-platform-diversity pages being YouTube videos. AI engines retrieve video content for product reviews, comparisons, and tutorials at disproportionate rates.
Comparison tables provide ready-made answer fragments for AI retrieval. The specialized review sites in our top-cited domains (rtings.com at 1.6 citations/page, tomshardware.com at 1.9) are built around structured product comparisons. Their citation efficiency confirms that extraction-friendly formats outperform narrative content.
Research shows that updating content within a 90-day window increases AI citations by 67%. AI retrieval systems that perform web searches (Perplexity, ChatGPT with browsing, Google AI Overviews) filter for recency. Stale content gets deprioritized regardless of its structural quality.
The top cited domains in our dataset are overwhelmingly third-party: Reddit, YouTube, Wikipedia, LinkedIn, Forbes, Gartner, PCGamer, Tom's Hardware. Off-site brand mention frequency (r = 0.664) outperforms backlinks (r = 0.218) as a predictor of AI visibility by a factor of three. A brand with no YouTube reviews and no Reddit presence is relying on a fraction of the available citation surface area.
Measuring Your Citation Footprint
Knowing what makes a page citable is useful. Knowing which of your pages are actually being cited, by which platforms, in response to which queries, is actionable.
Sill's monitoring pipeline tracks citation provenance at the page level. For every page cited in an AI response about your brand, we record the citing platforms, the queries that triggered the citation, the citation frequency over time, and the subcategory associations. Across 22,785 pages and 11,405 domains, this creates a dataset that reveals not just whether your brand appears, but which specific pages are doing the work.
The three-layer attribution model we described previously starts with this data. Simulated visibility (daily SOV tracking) tells you whether your overall presence is growing. Citation provenance data tells you which specific pages are driving that growth and which need attention.
The data in this post demonstrates the scale of the challenge. With 89% of pages cited only once and only 8.5% achieving multi-platform visibility, the difference between a page that contributes to AI presence and a page that does not comes down to the structural and topical traits described above. Without page-level citation data, content optimization is directionless. With it, every content investment can be targeted at the specific gaps that matter most.
See which of your pages AI engines actually cite
Sill tracks citation provenance across 22,785+ pages and four AI platforms daily. See your citation footprint, identify structural gaps, and optimize the pages that matter most.
References
- Sill Internal Data. "Citation Provenance Analysis: 22,785 pages, 26,257 citations, 11,405 domains." Sill Monitoring Pipeline, March 2026.
- Aggarwal, P., et al. "GEO: Generative Engine Optimization." KDD 2024, Princeton/Georgia Tech/IIT Delhi. arxiv.org/abs/2311.09735
- Wan, Y., et al. "Evidence-based evaluation of LLM persuasion." ACL 2024, UC Berkeley. arxiv.org/abs/2407.13008
- Ahrefs. "LLM Brand Visibility Study (75,000 brands)." ahrefs.com
- SearchAtlas. "Domain Authority vs. LLM Visibility (21,767 domains)." searchatlas.com
- Similarweb. "GenAI Brand Visibility Index 2026." Similarweb Research.
- Chen, Z., et al. "AI Search Engines and Earned Media Citations." University of Toronto, 2025.
Get Your Report
Request your first analysis today to see where you stand.
