Видео недоступно для предпросмотра
Смотреть в Telegram
❓ When Will the Data for Training LLMs Run Out?
In the next 2 years, humanity might face the strangest shortage in history — running out of human-created texts. This will lead to language models (LLMs) depleting their training data, causing a scaling crisis. Researchers studying AI's impact on our world have come to this conclusion.
0️⃣ "Data Drought"
2026–2032 — researchers consider this period the most likely timeframe for the complete depletion of text data for training LLMs. It could happen even sooner if models are heavily overtrained due to the AI race and the scaling of popular LLMs.
Three Main Conclusions from Researchers
1️⃣ Textual data will become the bottleneck in developing more advanced LLMs.
2️⃣ Synthetic data from AI is still insufficiently studied. They are useful in narrow fields like mathematics and programming. Some believe such data can be dangerous as AI might make mistakes when creating them.
3️⃣ Private data, such as personal messages, are unlikely to be used on a large scale due to legal issues.
🔠 Solutions to the Crisis
Researchers propose several solutions for developing LLMs:
➡️ Synthetic data.
➡️ Training on other types of data.
➡️ Increasing data efficiency.
💲 Who Can I Sell My Data to
Companies are already offering internet users monetary rewards for their data, which can be used to train AI models. Here are some of them:
➡️ TIKI — for access to users' mobile devices. They are interested in user behavior within apps partnered with TIKI.
➡️ Caden — for access to personal accounts on Netflix and Amazon. Earnings range from $5 to $50 per month.
➡️ Invisible offers access to paid news articles in exchange for demographic and behavioral data, including information on vaccinations and users' political affiliations. The company plans to trade this data for digital subscriptions costing between $4 and $15 per month.
@hiaimediaen
In the next 2 years, humanity might face the strangest shortage in history — running out of human-created texts. This will lead to language models (LLMs) depleting their training data, causing a scaling crisis. Researchers studying AI's impact on our world have come to this conclusion.
Number of the day
300 trillion tokens — the amount of text created by humanity that is currently available for training AI models.
0️⃣ "Data Drought"
2026–2032 — researchers consider this period the most likely timeframe for the complete depletion of text data for training LLMs. It could happen even sooner if models are heavily overtrained due to the AI race and the scaling of popular LLMs.
Three Main Conclusions from Researchers
1️⃣ Textual data will become the bottleneck in developing more advanced LLMs.
2️⃣ Synthetic data from AI is still insufficiently studied. They are useful in narrow fields like mathematics and programming. Some believe such data can be dangerous as AI might make mistakes when creating them.
3️⃣ Private data, such as personal messages, are unlikely to be used on a large scale due to legal issues.
🔠 Solutions to the Crisis
Researchers propose several solutions for developing LLMs:
➡️ Synthetic data.
➡️ Training on other types of data.
➡️ Increasing data efficiency.
💲 Who Can I Sell My Data to
Companies are already offering internet users monetary rewards for their data, which can be used to train AI models. Here are some of them:
➡️ TIKI — for access to users' mobile devices. They are interested in user behavior within apps partnered with TIKI.
➡️ Caden — for access to personal accounts on Netflix and Amazon. Earnings range from $5 to $50 per month.
➡️ Invisible offers access to paid news articles in exchange for demographic and behavioral data, including information on vaccinations and users' political affiliations. The company plans to trade this data for digital subscriptions costing between $4 and $15 per month.
@hiaimediaen