CCleanSource
← All posts

2026-06-21 · 8 min read

Clean, human-written datasets for RAG: why pre-2022 text is becoming scarce

If you build retrieval-augmented generation (RAG) systems, your output quality is capped by the quality of what you retrieve. And the corpus most teams reach for first — 'just scrape the web' — is quietly getting worse. Since 2022, a growing share of public web text is itself machine-generated. Grounding or training on that text reintroduces the very errors and homogenization you were trying to avoid.

The short version

  • AI-generated text now makes up a large and rising fraction of new web content.
  • Training or grounding models on model output causes measurable degradation — often called 'model collapse' — and subtle factual drift.
  • Text authored before ~2022 is a useful proxy for 'verifiably human-written', because it predates widespread generative AI.
  • Public-domain pre-2022 works (e.g. Project Gutenberg) are license-clear and safe to redistribute, unlike most scraped web text.
  • The practical move: curate a clean, provenance-attested base corpus instead of scraping indiscriminately.

Why AI-generated training data hurts RAG

Two distinct problems show up. The first is model collapse: when models are trained on data produced by earlier models, rare patterns in the distribution's tails disappear first, and successive generations converge toward bland, averaged output. The second is provenance contamination: if your retrieval corpus contains AI hallucinations, your RAG system will faithfully cite them as if they were ground truth. Retrieval makes wrong sources more convincing, not less.

Both problems are hard to detect after the fact, because AI-generated text is fluent. Fluency is exactly what makes it dangerous as a reference: it reads authoritative whether or not it is correct.

What 'clean' actually means

A corpus is clean for grounding purposes when you can answer three questions about every document: Who wrote it? When? Under what license can you use it? For most scraped web text, all three answers are 'unknown'. For a curated public-domain library, all three are known and checkable.

Verifiably human, pre-2022, public-domain text is one of the few corpora where provenance is not a guess.

How to source pre-2022 public-domain text safely

  • Start from the public domain. In the US, works published before 1929 are public domain, plus many later works whose copyright lapsed.
  • Prefer curated sources over raw dumps. Project Gutenberg texts are hand-prepared and license-clear.
  • Attach provenance metadata at ingest time — author, publication year, source, license — so it survives into your vector store.
  • Chunk deliberately. For dense prose, ~800-token chunks with ~100-token overlap is a reasonable default.
  • Keep an editorial summary per document. It improves both human browsing and retrieval recall when embedded alongside chunks.

Where CleanSource fits

CleanSource is a curated library of pre-2022, public-domain, human-written texts — packaged as 'clean bundles' that ship the provenance attestation, an editorial summary, tags, themes, and suggested chunking right alongside the source links. The free tier lets you browse the full catalogue and download five bundles a month; themed packs and an unlimited tier exist for teams that want the whole curated set. The point is not volume. It is a verifiable floor you can cite.

Clean input is the cheapest quality lever in a RAG stack. As the rest of the web fills with slop, starting from a known-clean base is going to matter more, not less.

Start from a clean base

Browse the CleanSource library free, or get notified when new themed packs drop.