How to Build an Expert Knowledge File for Claude

Where Your AI Gets Its Info

Most people don’t actually know where ChatGPT or Claude pulls their answers from. The full picture: they were trained on billions of pages of the public internet (Wikipedia, Reddit, news, blogs — basically anything free to access). When they need newer info, they do a live web search and read the top results.

The single biggest source they pull from is Reddit. And the implications are bigger than people realize.

Reddit signed a multi-year content-licensing deal with OpenAI in May 2024. Terms were never officially disclosed, but analysts estimate it’s worth roughly $60–70 million per year based on Reddit’s data-licensing revenue disclosures. That deal is what made Reddit the central training source it is today.

~$60-70M

Estimated annual value of the OpenAI / Reddit licensing deal (terms undisclosed)

#1

Reddit is the most-cited domain across major AI search platforms (5WPR 2026 Citation Source Index)

Independent research from 5WPR’s 2026 AI Platform Citation Source Index found Reddit accounts for roughly 40% of all citations across major AI search platforms — the highest of any single domain. ChatGPT briefly peaked near 60% Reddit citation share mid-2025 before settling lower.

So every time you ask ChatGPT or Claude for advice, the answer is heavily shaped by what’s upvoted on Reddit. Which means everyone using AI for advice is getting the same flavor of Reddit-derived answer.

The Hidden Layer Your AI Can’t See

Your AI doesn’t know about premium Substack newsletters behind a paywall. It can’t access podcast transcripts unless someone uploaded them. It misses the X threads from real operators that never get scraped. An entire layer of expert content is invisible to your AI by default — and that’s the layer where the actual insight lives.

The fix is straightforward: build a private knowledge file Claude can read every time you work together. Fill it with sources that AREN’T scraped by the major training runs — podcast transcripts, premium Substacks, expert X threads, niche industry newsletters — specific to YOUR field.

Now your AI is pulling from the people actually winning in your industry, not generic Reddit consensus. Here’s the exact setup I run.

01

Pick your 3-5 information sources

For my e-commerce + AI work, mine are: 5-10 premium Substacks from operators I trust, 3-4 podcasts I listen to weekly, the 10-15 X accounts that post real signal in my space, and a couple of industry newsletters. Whatever your field, pick the sources where YOUR experts actually publish.

02

Set up Apify to scrape + transcribe

I use Apify as the data layer. It has off-the-shelf actors that scrape Substack posts, transcribe podcast audio (YouTube + Spotify), pull X threads, and grab newsletter archives. Set up one actor per source. Runs on a schedule.

03

Build the knowledge directory

In Claude Code, create a Project (or a Claude Code folder) called something like knowledge/. Each source gets a subdirectory: knowledge/substacks/, knowledge/podcasts/, knowledge/x-threads/. Apify outputs land in the right folder automatically.

04

Wire Claude into the knowledge file

In your Project Instructions, point Claude at the directory: “Before answering any question about [my industry], search the knowledge/ directory for relevant context first.” Claude will now pull from your private corpus before falling back to its training data. The answers shift immediately.

05

Keep it fresh on a schedule

Set the Apify actors to run weekly so the knowledge stays current. The compounding part is real: after 3 months, you’ve got a private corpus of the smartest content in your field, all searchable by Claude. After 12 months, your AI is genuinely an expert in your space — not because you trained a model, but because you fed it the right reading list.

The Edge This Builds

Your competitors are asking the same ChatGPT or Claude that everyone has access to. You’re asking a version that’s been quietly fed the actual newsletters, podcasts, and threads where the real operators in your space publish. Same model, completely different answers. That’s the asymmetric edge.

The Real Win

The base model is a commodity. Everyone has access to the same Claude or ChatGPT. The thing that makes your AI useful in YOUR specific work is the knowledge you feed it. Build the knowledge file. Run it weekly. Three months in, you won’t recognize the quality of the answers.

How to Build An
Expert Knowledge
File For Claude

Build AI Systems That Run Your Work, Business, And Life

How to Build AnExpert KnowledgeFile For Claude

Build AI Systems That Run Your Work, Business, And Life

How to Build An
Expert Knowledge
File For Claude