Just read that AI training data sets are mostly synthetic now

I was digging through a report from Epoch AI last week and found out something wild. Apparently over 60% of the data used to train new large language models is generated by other AI models, not human content. I always thought the magic was in scraping real human writing from the web. But the big labs are basically feeding AI's own output back into itself because they've scraped almost everything useful already. It makes me wonder if we're building models that just get better at sounding human without actually understanding anything deeper. Has anyone else looked into how this affects the quality of AI outputs long term?

2 comments

2 Comments

craig.nathan23d ago

Oh come on, is it really that bad though?

jesse98823d agoTop Commenter

Totally agree, I've been saying this for a bit now. I was watching a breakdown of how much GPT-4 was trained on synthetic data and it honestly creeped me out. Like these models are just getting better at mimicking the patterns they already learned, not actually adding any new real human nuance. I remember seeing a study that showed models trained mostly on synthetic data start sounding really generic and bland after a while, like every sentence is the same safe average. It feels like we're just building a feedback loop that smooths out all the weird messy parts of real human writing, which is usually what makes it interesting. This whole thing makes me worried we're gonna end up with AI that's super confident but actually knows less and less about the real world over time.