Earlier this week, DeepSeek, a well-equipped Chinese AI lab, released an “open” AI model that outperforms many competitors on popular benchmarks. The DeepSeek V3 model is large but efficient and simply handles text-based tasks like coding and essay writing.
It also seems to imagine it’s ChatGPT.
Posts To X – and TechCrunch's own testing – shows that DeepSeek V3 identifies itself as ChatGPT, OpenAI's AI-powered chatbot platform. When asked for clarification, DeepSeek V3 insists that it’s a version of OpenAI's GPT-4 model released in 2023.
This is definitely reproducing as of today. In 5 out of 8 generations DeepSeekV3 claims to be ChatGPT (v4), while only thrice it claims to be DeepSeekV3.
Gives you a rough idea of the distribution of your training data. https://t.co/Zk1KuppBQM pic.twitter.com/ptIByn0lcv
— Lucas Beyer (bl16) (@giffmana) December 27, 2024
The delusions run deep. If you ask DeepSeek V3 an issue about DeepSeek's API, you’ll receive instructions on tips on how to use the API. DeepSeek V3 even says a few of this jokes as GPT-4 – right right down to the punch lines.
So what's occurring?
Models like ChatGPT and DeepSeek V3 are statistical systems. Using billions of examples, they learn patterns in those examples to make predictions—like how “to whom” often precedes “it could be relevant” in an email.
DeepSeek hasn't revealed much concerning the source of DeepSeek V3's training data. But there’s no defect of public datasets containing text generated by GPT-4 via ChatGPT. If DeepSeek V3 had been trained on this, the model may need remembered a number of the GPT-4 output and now reproduces it verbatim.
“Apparently the model sees raw responses from ChatGPT in some unspecified time in the future, but it surely's not clear where that’s,” Mike Cook, a research fellow at King's College London specializing in AI, told TechCrunch. “It could possibly be 'random'… but unfortunately we've seen cases where people trained their models directly on the outputs of other models to attempt to leverage their knowledge.”
Cook noted that the practice of coaching models based on the outcomes of competing AI systems will be “very bad” for model quality, as it may result in hallucinations and misleading answers like those mentioned above. “Like making a photocopy of a photocopy, we lose increasingly information and connection to reality,” Cook said.
It could also violate the terms of use of those systems.
OpenAI's terms prohibit users of its products, including ChatGPT customers, from using output to develop models that compete with OpenAI's own.
OpenAI and DeepSeek didn’t immediately reply to requests for comment. However, Sam Altman, CEO of OpenAI, released what looked like a… dig at DeepSeek and other competitors on X-Friday.
“It is (relatively) easy to repeat something that you realize works,” Altman wrote. “It’s extremely hard to do something latest, dangerous and difficult once you don’t know if it would work.”
Admittedly, DeepSeek V3 is much from the primary model to misidentify itself. Google's Gemini and others Sometimes claim that they’re competing models. For example in Mandarin, Gemini says that it’s the Wenxinyiyan chatbot from the Chinese company Baidu.
And that's since the Internet, from which AI firms get most of their training data, is continuously evolving littered with AI gradient. Content farms use AI to create Clickbait. Flood bots Reddit And X. At one treasureBy 2026, 90% of the net could possibly be AI-generated.
This “contamination,” for those who will, did it quite difficult to thoroughly filter AI output from training datasets.
It's entirely possible that DeepSeek trained DeepSeek V3 directly on ChatGPT-generated text. Google was once accused eventually doing the identical thing.
Heidy Khlaaf, senior AI scientist on the nonprofit AI Now Institute, said the associated fee savings of “distilling” the knowledge of an existing model will be attractive to developers whatever the risks.
“Even if web data is now stuffed with AI output, other models that by accident trained on ChatGPT or GPT-4 output wouldn’t necessarily show output harking back to custom OpenAI messages,” Khlaaf said. “If DeepSeek had done the distillation partially using OpenAI models, that wouldn’t be surprising.”
However, it’s more likely that plenty of ChatGPT/GPT-4 data found its way into the DeepSeek V3 training set. On the one hand, which means that you can’t trust the model to discover itself. Even more concerning, nevertheless, is the likelihood that DeepSeek V3 could exacerbate a number of the model's problems by uncritically adopting and iterating on GPT-4 outputs Prejudices And Defects.