The Synthetic Data Playbook: Generating Trillions of the Finest Tokens

Posted by joelinho95@reddit | LocalLLaMA | View on Reddit | 10 comments

Introducing the Synthetic Data Playbook: We generated over a 1T tokens in 90 experiments with 100k+ GPUh to figure out what makes good synthetic data and how to generate it at scale [https://huggingface.co/spaces/HuggingFaceFW/finephrase](https://huggingface.co/spaces/HuggingFaceFW/finephrase) https://preview.redd.it/hq6abr3p3ung1.png?width=1200&format=png&auto=webp&s=1dd47fa704669648c5fab08b1a02552c0b2fe8ce