⚡️ Biggest open text dataset release of the year: SmolTalk is a 1M sample big synthetic dataset that was used to train SmolLM v2.
TL;DR;
🧩 New datasets: Smol-Magpie-Ultra (400K) for instruction tuning; Smol-contraints (36K) for precise output; Smol-rewrite (50K) & Smol-summarize (100K) for rewriting and summarization.
🤝 Public Dataset Integrations: OpenHermes2.5 (100K), MetaMathQA & NuminaMath-CoT, Self-Oss-Starcoder2-Instruct, LongAlign & SystemChats2.0
🥇 Outperforms the new Orca-AgenInstruct 1M when trained with 1.7B and 7B models
🏆 Outperform models trained on OpenHermes and Magpie Pro on IFEval and MT-Bench
distilabel to generate all new synthetic datasets
🤗 Released under Apache 2.0 on huggingface
Apache 2.0
Synthetic generation pipelines and training code released.
Dataset: https://huggingface.co/datasets/HuggingFaceTB/smoltalk
Generation Code: https://github.com/huggingface/smollm
Training Code: https://github.com/huggingface/alignment-handbook/tree/main/recipes/smollm2
@opendatascience
TL;DR;
🧩 New datasets: Smol-Magpie-Ultra (400K) for instruction tuning; Smol-contraints (36K) for precise output; Smol-rewrite (50K) & Smol-summarize (100K) for rewriting and summarization.
🤝 Public Dataset Integrations: OpenHermes2.5 (100K), MetaMathQA & NuminaMath-CoT, Self-Oss-Starcoder2-Instruct, LongAlign & SystemChats2.0
🥇 Outperforms the new Orca-AgenInstruct 1M when trained with 1.7B and 7B models
🏆 Outperform models trained on OpenHermes and Magpie Pro on IFEval and MT-Bench
distilabel to generate all new synthetic datasets
🤗 Released under Apache 2.0 on huggingface
Apache 2.0
Synthetic generation pipelines and training code released.
Dataset: https://huggingface.co/datasets/HuggingFaceTB/smoltalk
Generation Code: https://github.com/huggingface/smollm
Training Code: https://github.com/huggingface/alignment-handbook/tree/main/recipes/smollm2
@opendatascience