This week we will discuss Cosmopedia, the 35 billion token synthetic textbooks, blogposts, stories, posts, and WikiHow articles dataset generated by Mixtral-8x7B-Instruct-v0.1.
- Cosmpdia Blog Post
- Cosmopedia Datasets: Full and 100K sample
- Cosmopedia GitHub repository
- cosmo-1B: a Cosmopedia pretrained model