How Do ChatGPT, Perplexity, and Google AI Overviews Decide Which Sources to Cite?
TL;DR
AI answer engines use Retrieval-Augmented Generation (RAG) to fetch relevant web pages, score them on authority, content structure, and freshness, then synthesize answers with citations. In a 2025 analysis of 8,500 Perplexity responses, 72% of cited sources had structured data markup, and the median cited page was updated within the prior 90 days. Understanding these mechanics is the foundation of effective AEO.
What is Retrieval-Augmented Generation (RAG)?
RAG is an architecture that combines a large language model with a real-time information retrieval system. Instead of relying solely on knowledge encoded during training, a RAG system queries an external index of documents, retrieves the most relevant passages, and feeds them to the language model as context for generating a response. The model then synthesizes an answer from those passages and attributes information to its sources.
The retrieval step typically uses vector search (converting text into numerical embeddings and finding the closest matches) combined with traditional keyword matching. Most production RAG systems use a hybrid approach: BM25 for keyword relevance plus dense retrieval for semantic relevance. The top candidate passages are then re-ranked by a cross-encoder model that scores each passage against the query for final selection.
In practical terms, this means the AI engine does not “know” your content from training. It finds your content in real time, evaluates its relevance and trustworthiness, and decides whether to cite it—every time a user submits a query. A page that is well-structured and authoritative gets retrieved repeatedly; a page that is poorly structured may never surface, regardless of its informational quality.
What signals does each platform weight?
Each AI answer platform runs its own retrieval pipeline with different weights on different signals. No platform publishes its exact ranking algorithm, but reverse-engineering studies, official documentation, and large-scale citation analyses provide a working model. The following comparison reflects findings from three independent analyses published between Q3 2025 and Q1 2026.
| Signal | ChatGPT Browse | Perplexity | Google AI Overviews | Gemini | Bing Copilot |
|---|---|---|---|---|---|
| Source freshness | Medium — web index updates less frequently | High — near real-time crawling | High — tied to Google’s main index | High — accesses Google index | Medium — tied to Bing’s index cycle |
| Domain authority | High — favors established domains | Medium — will cite niche sources with strong content | High — leverages existing PageRank signals | High — inherits Google’s authority graph | High — leverages Bing’s authority metrics |
| Structured data (schema) | Low — limited schema parsing observed | Medium — uses schema for entity extraction | High — schema is a documented ranking factor | Medium-High — benefits from Google’s schema infrastructure | Medium — schema helps but less weight than Google |
| Reddit/forum presence | Medium — cites Reddit for product queries | High — Reddit is heavily indexed; appears in ~18% of citations | Medium — Reddit surfaces in AI Overviews for experiential queries | Low-Medium | Medium — Bing indexes Reddit via partnership |
| Content structure (headings, lists, tables) | Medium-High | High — well-structured content retrieves consistently better | High — passage-level retrieval favors clean structure | Medium-High | Medium |
| Topical depth | High — comprehensive pages favored | Medium-High — concise, specific answers preferred | High — topical authority is a documented signal | High | Medium-High |
What content formats get cited most often?
Citation frequency varies by content format. A 2025 analysis by Surfer SEO of 12,000 Google AI Overview citations and a parallel study by Semrush of 5,000 Perplexity citations provide a ranked breakdown of which formats appear most often in AI-generated answers.
- Definitions and glossary entries — Present in 28% of AI citations across both studies. These map directly to “what is” queries, which constitute an estimated 15% of all AI prompt patterns.
- Step-by-step instructions — Present in 22% of citations. AI engines pull numbered lists and procedural content for “how to” queries. Pages using HowTo schema were cited 1.8x more often than unstructured how-to content.
- Comparison tables — Present in 18% of citations. Tables are structurally unambiguous, which makes them easy for retrieval models to parse and for language models to reference.
- Statistical data and original research — Present in 15% of citations. AI engines seek specific numbers to ground their answers. Pages containing proprietary data points are cited at roughly 3x the rate of pages that summarize others’ data.
- FAQ pages — Present in 11% of citations. The explicit question-answer structure aligns with how users prompt AI engines, creating a direct retrieval path.
- Expert opinion with attributed quotes — Present in 6% of citations. Less common but valuable for YMYL (Your Money Your Life) topics where AI engines prioritize sourced expert perspectives.
What gets a source excluded from AI citations?
Certain characteristics reliably prevent AI engines from citing a page, even when that page contains relevant information. Understanding these disqualifiers is as important as understanding positive signals. Based on crawl studies and citation absence analysis, the following factors correlate most strongly with exclusion.
- Blocked AI crawlers — If your robots.txt blocks GPTBot, PerplexityBot, or Google-Extended (now deprecated in favor of Google’s main crawler for AI Overviews), the respective platform cannot index your content. A 2025 Originality.ai study found that 26% of the top 1,000 websites block at least one AI crawler.
- Hard paywalls without structured preview content — AI engines cannot retrieve content behind authentication walls. Metered paywalls with visible first paragraphs are partially retrievable, but hard paywalls effectively make content invisible to RAG systems.
- Thin content — Pages with fewer than 300 words, minimal structure, or no substantive information are filtered out during the retrieval re-ranking step. Retrieval models assign near-zero relevance scores to thin pages.
- No schema markup — While not an absolute disqualifier, the absence of schema (particularly Article, FAQPage, and HowTo schema) reduces retrieval probability. Across Google AI Overviews specifically, pages without any structured data are cited at roughly half the rate of pages with relevant schema.
- Outdated content — Pages with publication dates older than 18–24 months and no visible update signals (dateModified, update notes) are deprioritized by Perplexity and Google AI Overviews. Freshness thresholds vary by topic: medical and financial content faces stricter recency requirements.
- Duplicate or syndicated content — If the same text appears on multiple domains, AI engines cite the canonical or highest-authority version. Syndicated content on lower-authority domains is typically excluded.
How can you test whether AI engines cite your content?
Testing your AI citation visibility requires systematic prompt testing across platforms. Manual testing takes 2–4 hours for an initial audit; automated tools can run ongoing monitoring after the baseline is established. Here is a five-platform testing process.
- Google AI Overviews — Search 20–30 of your target queries in Google (logged out, with AI Overviews enabled). For each query, note whether an AI Overview appears, which sources are cited, and whether your domain is among them. Record the exact passage cited. Google AI Overviews appear on approximately 30% of queries as of Q1 2026.
- Perplexity — Enter the same queries in Perplexity.ai. Perplexity always shows inline citations, making it the most transparent platform for testing. Check whether your domain appears in the citation list and which specific page is referenced. Perplexity’s index refreshes within hours for high-authority domains.
- ChatGPT Browse — Use ChatGPT with web browsing enabled. Ask the same queries in conversational format. ChatGPT shows source links when Browse is active. Note that ChatGPT’s web index is less fresh than Perplexity’s, so recently published content may not appear immediately.
- Gemini — Test queries in Google Gemini (gemini.google.com). Gemini has access to Google’s search index and will cite sources for factual queries. Its citation behavior differs slightly from AI Overviews because it is a conversational interface rather than a search augmentation.
- Bing Copilot — Test queries in Bing’s Copilot interface. Copilot uses Bing’s index and shows inline citations. It is particularly worth testing if your audience uses Microsoft Edge or Bing as a default search engine.
Record your results in a spreadsheet with columns for query, platform, cited domains, your citation status (yes/no/partial), and the cited passage. This becomes your AEO baseline. Repeat the test monthly to track progress.
For detailed guidance on AEO fundamentals, see What Is Answer Engine Optimization and How Does It Work?. For a comparison of automated citation tracking tools, see AEO Tracking Tools: How to Measure AI Search Visibility. If you want a professional audit of your AI citation visibility, SCALEBASE’s AEO service includes a full five-platform citation analysis.
Frequently Asked Questions
Does Perplexity cite Reddit more than other platforms?
Yes. Perplexity cites Reddit in approximately 18% of its responses, compared to roughly 8–10% for Google AI Overviews and 5–7% for ChatGPT. This is because Perplexity’s retrieval pipeline indexes Reddit aggressively and weights user-generated experiential content for product, recommendation, and troubleshooting queries.
Do AI engines prefer Wikipedia over other sources?
AI engines cite Wikipedia frequently for definitional and entity-related queries, but Wikipedia is not universally preferred. For specialized, technical, or industry-specific topics, AI engines often cite domain-specific sources with deeper expertise. Wikipedia functions as a default when no other authoritative source provides a clear, structured answer.
Can I pay to be cited by AI engines?
No. As of Q1 2026, none of the major AI answer engines offer paid citation placement. Citations are determined entirely by the retrieval and ranking algorithms. There is no advertising product that guarantees citation in AI-generated answers. The only path to citation is through organic content quality, structure, and authority.
How often do AI engines refresh their source index?
It varies by platform. Perplexity refreshes in near real-time, sometimes indexing new content within hours. Google AI Overviews reflects Google’s main search index, which crawls high-authority sites daily and lower-authority sites on a weekly or biweekly cycle. ChatGPT Browse’s web access is live but its behavior suggests a less aggressive crawl schedule. Gemini has similar freshness to Google AI Overviews. Bing Copilot reflects Bing’s index, which updates daily for major sites.
Does an llms.txt file help with AI citations?
The llms.txt standard is a proposed convention (similar to robots.txt) that helps AI systems understand a website’s structure and key content. Its adoption is growing but not yet universal among AI platforms. As of early 2026, Perplexity and some smaller AI engines reference llms.txt when available. Google and OpenAI have not confirmed that they use it as a ranking signal. It is a low-effort addition that may provide marginal benefit and is unlikely to cause harm.

Vigo Nordin
Co-Founder of SCALEBASE, a specialist AEO and SEO agency based in Mallorca, Spain. Focused on AI search optimization, entity building, and engineering citations across ChatGPT, Perplexity, and Google AI Overviews.
LinkedIn