Allegations Surface Regarding OpenAI’s Use of Paywalled O’Reilly Books for AI Training
A recent study suggests OpenAI may have trained its AI models on paywalled O’Reilly books without proper licensing, raising copyright concerns.

So, OpenAI’s back in the hot seat—this time over how it trains its AI models. A fresh paper’s throwing shade, claiming the company might’ve used paywalled O’Reilly books without saying ‘pretty please’ (or, you know, getting licensing agreements). The scoop comes from the AI Disclosures Project, which noticed GPT-4o seems way too familiar with these paywalled gems compared to its older sibling, GPT-3.5 Turbo. Awkward.
Here’s how they figured it out: they used something called DE-COP (fancy, right?) to sniff out copyrighted stuff in the models’ training diet. Basically, if the AI can tell apart human-written text from its own paraphrased versions, it’s probably seen the original before. Sneaky, but clever.
Long story short, GPT-4o’s got a better memory for these O’Reilly books than its predecessors. But before you grab your pitchforks, the researchers admit their method’s not perfect. Maybe OpenAI just picked up the info from ChatGPT chit-chats. Who knows?
OpenAI’s been on a data diet, hunting for top-notch training material and even cutting deals with content bigwigs. But these allegations? They’re just more fuel for the fiery debate over AI training and who owns what. Stay tuned.