Ask ChatGPT to recommend a brand in your category today. Ask again tomorrow with the same prompt. There is roughly a coin-flip chance that the top recommendation will be different. This is not an edge case we found by testing unusual queries; it is the baseline behavior across every platform Sill monitors. We tracked AI responses to the same prompts on the same platforms every day, then measured how much the competitive field changed between consecutive days. The answer: almost entirely. Only 2.7% of day-over-day competitor sets were identical. The average overlap between one day's response and the next was 37%. This post presents the full findings from Sill's daily monitoring data, broken down by platform, position tier, query intent type, and time horizon.
TL;DR
Sill's daily monitoring data reveals that AI brand recommendations are non-deterministic at a scale most marketers underestimate. Only 2.7% of day-over-day competitor sets are identical across all platforms, with an average overlap of just 37%. ChatGPT changes its top brand recommendation 50.1% of the time between consecutive days; Copilot changes it 78.7%; Gemini is most stable at 28.5%. For 67% of query-platform pairs, the monitored brand never appears at all; only 7.4% show 100% consistent daily appearance. Position fragility follows a clear gradient: primary position holds 69.8% of the time, secondary holds 58.2%, and 23.1% of secondary brands disappear entirely the next day. Weekly monitoring is less accurate than daily (21.4% change rate vs 18.1%) because volatility compounds rather than averaging. Comparison-intent queries are the most volatile (21.6% daily change) while evaluation queries are 8.6x more stable (2.5%). Single-point measurement captures less than half the competitive picture; reliable AI visibility measurement requires daily, multi-platform tracking with statistical baselines.

Only 2.7% of day-over-day AI competitor sets are identical; the average overlap between consecutive days is 37%, meaning 63% of mentioned brands change daily.
When Sill monitors a prompt across AI platforms, each response typically mentions an average of 6.1 brands. Run the same prompt on the same platform the next day, and on average only 37% of those brands carry over. The remaining 63% are different. Across all platform-query pairs in our monitoring data, only 2.7% of consecutive-day competitor sets were exactly identical.
This is consistent with what the SparkToro study found when testing AI recommendation consistency: the same prompt produces different brand lists across sessions. Our contribution is measuring this at a daily cadence over an extended monitoring window, which reveals the scale of the churn more precisely. A single spot check of your AI visibility captures less than half of the competitive picture; the other 63% rotates in and out over subsequent days. Our SparkToro analysis covered the implications of inconsistency for measurement design.
ChatGPT changes its primary brand recommendation 50.1% of the time between consecutive days; Copilot changes it 78.7%; Gemini is most stable at 28.5%.
The brand that AI engines place in the primary recommendation slot is supposed to be the strongest answer to the query. In practice, that slot reshuffles frequently. On ChatGPT, the top-recommended brand changes 50.1% of the time from one day to the next for the same prompt. This means the brand a buyer sees recommended first today has roughly even odds of being replaced by a different brand tomorrow.
| Platform | Top Brand Changes Day-Over-Day | Competitor Set Overlap |
|---|---|---|
| Copilot | 78.7% | 22.4% |
| ChatGPT | 50.1% | 35.5% |
| Grok | 37.9% | 41.9% |
| Google AI Overviews | 36.2% | 39.3% |
| Perplexity | 35.7% | 40.9% |
| Gemini | 28.5% | 36.9% |
Gemini is the most stable platform for primary recommendations, holding the same top brand 71.5% of the time. Copilot is the least stable, with 78.7% daily churn in the top slot and only 22.4% overlap in its full competitor sets. The range between platforms is itself a finding: a brand that appears reliably on Gemini may be invisible on Copilot, and any measurement that aggregates across platforms will hide these differences. Our platform divergence analysis covers why per-platform tracking is essential.
For 67% of query-platform pairs, the monitored brand never appears in AI responses; only 7.4% show 100% consistent daily appearance over the tracking window.
We measured how consistently brands appear across their tracked queries and platforms over the full monitoring window. The distribution is heavily skewed toward absence.
| Appearance Consistency | % of Query-Platform Pairs |
|---|---|
| Always appears (100%) | 7.4% |
| Usually appears (75-99%) | 8.4% |
| Appears half the time (50-74%) | 4.9% |
| Rarely appears (25-49%) | 5.1% |
| Almost never appears (1-24%) | 7.2% |
| Never appears (0%) | 67.0% |
The practical meaning: for most prompts on most platforms, a brand is simply absent from the AI response. Only 15.8% of query-platform pairs show the brand appearing at least 75% of the time. The remaining 84.2% are either entirely invisible or appear intermittently. This means a single spot check is deeply misleading; if you ask ChatGPT about your category once and see your brand, you may have caught one of the minority of queries where you appear consistently, or you may have caught a lucky day in a query where you appear only 30% of the time. Without daily tracking, there is no way to distinguish between those scenarios.
Brands in primary position retain it 69.8% of the time the next day; secondary brands hold at 58.2%, with 23.1% disappearing entirely.
AI responses assign brands to position tiers: primary (the lead recommendation), secondary (mentioned with positive framing), and mentioned (referenced without endorsement). We tracked how brands move between these tiers from one day to the next, and the transition rates reveal a clear fragility gradient.
| Starting Position | Holds Position | Promoted | Demoted | Disappears Entirely |
|---|---|---|---|---|
| Primary | 69.8% | n/a | 22.9% | 7.4% |
| Secondary | 58.2% | 13.6% | 5.0% | 23.1% |
| Mentioned | 15.7% | 37.3% | 5.9% | 41.2% |
The fragility gradient is steep. Primary position is the stickiest, but even there, roughly 3 in 10 brands lose it the next day. Secondary position is where the volatility becomes acute: nearly 1 in 4 secondary brands vanish from the response entirely, rather than simply being demoted to a mention. The mentioned tier is the most unstable; 41.2% of mentioned brands disappear the next day, while 37.3% are promoted to a stronger position. This means the mentioned tier functions less like a stable floor and more like a revolving door between visibility and invisibility.
Weekly SOV comparisons show 21.4% change rates versus 18.1% for daily, with 16.4% major swings versus 14.4%; volatility compounds rather than averaging out.
A reasonable assumption would be that weekly monitoring smooths out daily noise, giving you a more stable and accurate picture. The data shows the opposite. When we compared SOV scores seven days apart instead of one day apart, the change rate increased from 18.1% to 21.4%, and major swings (25+ point shifts) increased from 14.4% to 16.4%.
Volatility compounds rather than averaging. A brand that was primary on Monday, secondary on Wednesday, and absent on Friday will show a large swing in a weekly comparison while daily monitoring would have captured the trajectory. Weekly snapshots turn what was a visible pattern of declining position into a single unexplained jump. This is why Sill runs daily monitoring across all platforms: not because the daily data point is more important than any other, but because the sequence of daily points reveals direction and velocity that any single snapshot misses.
Comparison-intent prompts show 21.6% daily SOV changes with 16.2% major swings; evaluation-intent prompts show only 2.5% changes, making them 8.6x more stable.
The type of question a buyer asks materially affects how much the AI response shuffles between days. Sill classifies monitored prompts by intent type, and the volatility differences are substantial.
| Intent Type | Daily SOV Change Rate | Major Swings (25+ pts) | Avg Score Change |
|---|---|---|---|
| Comparison | 21.6% | 16.2% | 8.7 pts |
| Use case | 19.1% | 16.1% | 8.5 pts |
| Best-of | 18.4% | 14.6% | 9.7 pts |
| Evaluation | 2.5% | 2.5% | 1.3 pts |
Comparison queries are the most volatile because they ask the model to weigh alternatives, and the relative weighting shifts between sessions. Evaluation queries, which ask about a specific brand's strengths or weaknesses, are far more stable because the model draws on a more constrained set of sources. This has a direct measurement implication: if your monitoring focuses on "best X for Y" and comparison prompts, you should expect higher day-to-day variance and need a longer baseline before drawing conclusions about trend direction. Our AI search ROI framework covers why 8-12 week baselines are necessary for reliable SOV measurement.
Non-deterministic AI results require daily multi-platform monitoring with statistical baselines; single-day snapshots capture less than half the competitive picture.
Every finding in this analysis points to the same conclusion: AI brand recommendations are probabilistic, not fixed. A brand does not "rank" in a stable position the way it does on Google. It has a probability of appearing in any given response, and that probability varies by platform, by query intent, and over time. This makes single-point measurement fundamentally unreliable.
A monthly check of "what does ChatGPT say about our category" captures one sample from a distribution that changes daily. With 37% competitor set overlap between consecutive days, that single sample tells you what happened on that particular day, which may bear little resemblance to what happens on the other 29. The only reliable approach is frequent, repeated measurement across multiple platforms, building up a statistical picture of your appearance probability, position distribution, and competitive field over time.
This is the measurement philosophy behind Sill's monitoring architecture. Rather than reporting a single SOV number, Sill tracks daily responses across ChatGPT, Gemini, Perplexity, Google AI Overviews, Claude, and Copilot, building trend data that distinguishes genuine visibility shifts from the daily noise that affects every platform. When SOV moves, the question is whether the movement exceeds the baseline volatility for that query-platform pair, because a 10-point SOV change on ChatGPT falls within normal daily variance while the same change on an evaluation query would be highly unusual.
Sill monitors your brand across six AI platforms daily, building the statistical baseline that separates genuine visibility shifts from the noise that affects every AI engine.
Request your first analysis today to see where you stand.