GEO guides recommend specific structural optimizations: add statistics, use comparison tables, structure content with H2 headings, keep sections between 120 and 180 words. We scraped 1,238 pages that AI platforms actually cited across ChatGPT, Perplexity, Google AI Overviews, Gemini, Grok, and Copilot, extracted every structural feature we could measure from the markdown, and ran regression with full statistical significance testing. The structural features collectively explain about 5% of citation variance. That number is low, but it is consistent with what the largest external studies are finding: no single category of factors explains most of the variance. Fahlout's citation research found traffic explains 5% and backlinks 3.8%. SearchAtlas found domain authority correlations are weakly negative. The picture that emerges is not that structure is irrelevant; it is that structure is one layer among several, and likely not the one with the most leverage.
TL;DR
We scraped 1,238 pages that AI platforms actually cited and ran regression on 16 structural features with full significance testing. The structural features collectively explain about 5% of citation variance (adjusted R-squared = 0.045). After Bonferroni correction, two findings survived: niche authority/service pages predict higher citation rates (r = +0.154, p < 0.001), and listicles predict lower per-page citation rates (r = -0.124, p = 0.000013). Word count, heading count, statistics density, table presence, and image count showed no statistically significant relationship with citation frequency. The 5% is consistent with external studies: SE Ranking's SHAP analysis of 2.3M pages found domain traffic (0.63) and referring domains (0.56) dwarf content structure (0.20). Wellows found semantic completeness (r = 0.87) and vector alignment (r = 0.84) are the dominant citation predictors, with traditional domain authority declining to r = 0.18. Structural formatting is a real but minor layer in the hierarchy of what determines AI citation; the dominant factors are semantic relevance, brand signal density across the web, and niche authority.

1,238 non-YouTube pages cited by AI platforms were scraped via Firecrawl, then analyzed for 16 structural features using OLS, Poisson, and Spearman regression.
Sill's monitoring pipeline tracks AI responses daily across six platforms. When those responses cite external URLs, the pipeline stores them with citation frequency (how many times the URL appeared), platform diversity (how many distinct AI platforms cited it), and the citing context. For this analysis, we pulled every non-YouTube page with a successful Firecrawl scrape: 1,238 pages from our monitoring data.
From each page's scraped markdown, we extracted 16 features: word count, heading counts by level (H1, H2, H3+), list items (bulleted and numbered), link count, image count, bold text instances, table presence and row count, number/statistic density per 1,000 words, average section length, and content type classifications derived from URL patterns and title keywords (listicle/comparison, ecommerce, blog article, forum, program/service page).
We ran OLS regression with t-tests on every coefficient, Poisson regression (better suited to count outcomes), Spearman rank correlation (robust to the extreme skew in the data), Mann-Whitney U tests for categorical comparisons, bootstrap confidence intervals, and Bonferroni correction for multiple comparisons. We report only what survived rigorous testing.
76.1% of AI-cited pages appear only once in our data; 89.4% appear once or twice. Multi-citation pages are rare and structurally distinct.
The citation frequency distribution is heavily right-skewed. Three out of four cited pages were cited exactly once. Only 3.6% were cited five or more times. This distribution shapes every analysis that follows: any predictor needs to separate the rare high-citation pages from the large baseline of single-citation pages, and that separation is difficult for structural features alone.
| Citation Frequency | Pages | Share | Cumulative |
|---|---|---|---|
| 1 | 942 | 76.1% | 76.1% |
| 2 | 165 | 13.3% | 89.4% |
| 3 | 56 | 4.5% | 93.9% |
| 4 | 34 | 2.7% | 96.7% |
| 5+ | 41 | 3.3% | 100% |
The platform breakdown follows a similar pattern. ChatGPT cited 49.8% of pages in the dataset, Perplexity 25.4%, Google AI Overviews 23.8%, Grok 13.3%, Gemini 6.1%, and Copilot 1.7%. Only 1.4% of pages earned citations from four or more platforms. This concentration is consistent with what we reported in our platform divergence analysis: 91.6% of cited URLs appear on only one platform.
After Bonferroni correction across 16 tests, only page type predicted citation: niche authority pages positive (r = +0.15, p < 0.001), listicles negative (r = -0.12, p < 0.001).
We tested all 16 structural features against citation frequency and platform diversity using Pearson correlation, Spearman rank correlation, OLS regression with coefficient t-tests, and Mann-Whitney U tests for categorical variables. We applied Bonferroni correction for multiple comparisons (adjusted alpha = 0.003) and computed bootstrap 95% confidence intervals for the five strongest effects. Two findings survived every test.
| Feature | Pearson r | p-value | Bootstrap 95% CI | Survives Bonferroni |
|---|---|---|---|---|
| Program/service page | +0.154 | <0.000001 | [+0.069, +0.251] | Yes |
| Listicle/comparison | -0.124 | 0.000013 | [-0.161, -0.081] | Yes |
| Has tables | -0.062 | 0.030 | [-0.101, -0.018] | No |
| Avg section length | +0.071 | 0.012 | [-0.011, +0.186] | No |
| Word count | +0.003 | 0.920 | n/a | No |
| Numbers per 1K words | -0.022 | 0.442 | n/a | No |
| H2 headings | +0.010 | 0.730 | n/a | No |
The overall OLS model is statistically significant (F-test p < 0.001) with an adjusted R-squared of 0.045. The model detects real signal; it is the size of that signal relative to what remains unexplained that matters. Structural formatting features account for roughly 5% of what determines whether a page gets cited by AI.
Listicles represent 38% of cited pages by volume but predict lower per-page citation rates; Seer Interactive found listicle citations dropped 30% month-over-month in early 2026.
The negative listicle coefficient requires careful interpretation. Listicles and comparison pages make up 38% of our dataset and 43.8% of all ChatGPT-cited page types according to SE Ranking. They are the single most common format AI cites. The regression finding is not that listicles do not get cited; it is that on a per-page basis, a listicle is less likely to accumulate multiple citations or earn cross-platform visibility than other page types.
Seer Interactive's analysis of 2 million citations from November 2025 through February 2026 found ChatGPT listicle citations dropped 30% month-over-month, from 160,000 in December to 111,000 in January. The listicle share of total citations fell from 17.8% to 15.5% over three months, while Wikipedia nearly doubled its share and Reddit tripled. Thirteen of sixteen industries in the Seer dataset experienced listicle citation declines over the same period. The pattern in our data is consistent with this broader trend: listicles are common but their per-page citation strength is not what the volume numbers suggest.
The Mann-Whitney U test confirms the finding: pages classified as listicles had a mean citation frequency of 1.34 versus 1.74 for non-listicles (p = 0.0007). The distinction between volume (listicles are everywhere) and rate (individual listicles are not cited at above-average rates) is the key nuance. A marketing team writing a "Best X in 2026" listicle is entering the most competitive page type in AI search with a lower expected per-page return than a focused authority page.
No single factor category explains most AI citation variance; SE Ranking found domain traffic (SHAP 0.63) is the strongest individual predictor, followed by referring domains (0.56) and content length (0.20).
The 5% R-squared for structural features does not mean structure is irrelevant. It means structure alone is insufficient, and that the variance unexplained by our model is where the larger signals live. External studies that measured what we could not offer clues about where the remaining 95% resides.
SE Ranking's SHAP analysis of 2.3 million pages provides the most comprehensive decomposition available. Domain traffic is the strongest single predictor (SHAP value 0.63). Referring domains follow at 0.56. Content length, the strongest structural variable, registers at 0.20. Brand search volume comes in at 0.09-0.11. The hierarchy is clear: the signals that identify a page as belonging to a known, trafficked, linked-to domain carry three times the predictive weight of the strongest structural feature.
| Study | Factor | Finding |
|---|---|---|
| SE Ranking (2.3M pages) | Domain traffic | SHAP 0.63; sites with 1.16M+ visitors: 6.4 citations vs 2.4 for low-traffic |
| ConvertMate (80M+ citations) | Brand web mentions | r = 0.664 (strongest single-variable correlation measured) |
| Wellows (15,847 results) | Semantic completeness | r = 0.87; vector alignment r = 0.84; DA down to r = 0.18 |
| Fahlout | Cosine similarity | 7.3x more predictive than domain authority; DA flat across 0-80 |
| Sill (this analysis) | Structural features | R-squared = 0.045 (adjusted); page type is the only robust predictor |
Wellows' analysis of 15,847 AI Overview results found semantic completeness (r = 0.87) and vector embedding alignment (r = 0.84) are the two strongest predictors of AI Overview citation, both dwarfing traditional domain authority (r = 0.18, down from 0.43 a year earlier). Fahlout's research reached a convergent conclusion: cosine similarity between the query and the page content is 7.3x more predictive than domain authority. Pages with cosine similarity above 0.88 were selected 34.3% of the time versus 4.7% for pages below 0.75.
The emerging picture across all of these studies: what matters most is whether the content semantically matches the query and whether the domain has accumulated enough signal across the web (traffic, mentions, links, community presence) to register as a known entity. Page-level formatting is a real but minor factor within that hierarchy.
SearchAtlas found domain authority negatively correlated with LLM visibility (r = -0.10 to -0.21), yet ConvertMate found brand web mentions at r = 0.664. The metrics diverge.
SearchAtlas analyzed 21,767 domains and found traditional authority metrics are weakly or negatively correlated with LLM visibility: Domain Authority r = -0.10 on OpenAI, -0.21 on Perplexity, -0.13 on Gemini. Domain Rating and Domain Power show the same pattern. This finding aligns with what we reported in our SEO-GEO analysis: traditional SEO metrics do not predict AI visibility.
The apparent contradiction is that domain traffic (SE Ranking SHAP 0.63) and brand web mentions (ConvertMate r = 0.664) are both strong predictors. The reconciliation: traffic and mentions measure something different from DA and DR scores. A niche authority site like justcollect.com, the highest-cited page in our dataset (cited by all six platforms), may have a modest Domain Authority score while having deep brand signal density within its niche: mentions on collector forums, Reddit threads, YouTube references, and review sites. The composite score that third-party tools call "domain authority" does not capture this kind of concentrated niche presence.
This reframing explains why our regression found niche authority and service pages as the strongest positive predictor. These pages tend to belong to domains with concentrated expertise signals rather than high aggregate authority metrics. SE Ranking found that sites with Reddit mentions in the 35,000-718,000 range averaged 5.5 citations, and those with Quora mentions in the 3,800-93,000 range averaged 5.3. Community presence, not domain authority scores, appears to be the mechanism.
68.7% of AI-cited pages follow logical heading hierarchies and 80% include lists (AirOps, 2026). Structure is a baseline for citation, not what separates cited from uncited pages.
AirOps' 2026 State of AI Search report found that 68.7% of ChatGPT-cited pages follow logical heading hierarchies, 87% use a single H1 as their primary anchor, and nearly 80% include ordered or unordered lists. These numbers are high enough to suggest structure is a necessary condition for most citations. They are not high enough to explain why some structured pages get cited more than others. The structural features we tested are likely measuring the baseline that most cited pages share, not the variation between them.
The foundational GEO paper (Aggarwal et al., KDD 2024) tested nine optimization methods across 10,000 queries and found that adding citations to content improved visibility by 115% for rank-5 sites, adding quotations by 37%, and adding statistics by 22%. These are real effects measured under controlled conditions. Our regression does not contradict those findings; it adds context about where they sit in the overall hierarchy. The GEO optimizations work, but they work within a layer that accounts for a fraction of the total variance in who gets cited and who does not.
The practical implication: if your pages lack basic structural elements (clear headings, readable sections, supporting data), fixing that is worth doing. But if your pages already have those elements and your AI visibility is still low, adding more headings or more statistics is unlikely to close the gap. The next-highest-leverage interventions are building the brand presence, community mentions, and semantic authority that the larger studies identify as dominant factors. Our ranking of 12 GEO tactics by evidence strength reflects this hierarchy: branded web mentions (r = 0.664) and YouTube presence (r = 0.737) rank above every structural formatting tactic.
This analysis measured structural features extractable from markdown; it could not measure semantic relevance, domain reputation, content freshness, or E-E-A-T signals.
The low R-squared means one of two things: either structural features genuinely explain little of citation variance, or we measured the wrong structural features (or measured them poorly). Both explanations are probably partially true. Our features were extracted from scraped markdown, which captures heading structure and formatting but misses rendered layout, schema markup, page speed, and visual design. The content type classification is based on URL patterns and title keywords, which is imprecise. A page classified as "other" might be a homepage, a landing page, or a resource page that our regex did not catch.
More importantly, the variables with the most predictive power in external studies are ones we did not and could not measure from page content alone: domain traffic, referring domain count, brand search volume, Reddit and Quora mention frequency, review platform presence, and semantic cosine similarity between the page and the query that triggered the citation. Each of these would require joining with external data sources that are not part of the scraped page itself.
The sample of 1,238 pages also reflects the brands and industries in Sill's monitoring pipeline, which skews toward mid-market brands across specific verticals. A study of 2.3 million pages (SE Ranking) or 80 million citations (ConvertMate) will capture patterns that a 1,238-page dataset cannot. We report our findings alongside the external studies specifically because the combination is more informative than either source alone.
Sill tracks your AI Share of Voice, cited pages, platform divergence, and competitive positioning daily across six AI platforms. Structure is the starting point. Knowing where you stand is the next step.
Request your first analysis today to see where you stand.