The US companies already scraped the data while they could. If anything, data scraping is far far more difficult now for everyone due to technical reasons.
Most of the new models are trained on synthetic data or higher quality of data or with RLHF. The reason deepseek is able to perform is likely because LLMs are very very new things, there are many low hanging fruits. Its no longer just about the data we already hit that limit for quite some time.
Honestly, even from the beginning it’s pretty obvious scraped data is going to have a ton of issues. There’s too much nonsense out there, both from misinformation and people just not able to communicate.
That’s before you get into the ethical aspects of stealing other people’s content and the way these things are being misused.
The US companies already scraped the data while they could. If anything, data scraping is far far more difficult now for everyone due to technical reasons.
Most of the new models are trained on synthetic data or higher quality of data or with RLHF. The reason deepseek is able to perform is likely because LLMs are very very new things, there are many low hanging fruits. Its no longer just about the data we already hit that limit for quite some time.
Honestly, even from the beginning it’s pretty obvious scraped data is going to have a ton of issues. There’s too much nonsense out there, both from misinformation and people just not able to communicate.
That’s before you get into the ethical aspects of stealing other people’s content and the way these things are being misused.