1M datasets on HF !
Posted by qlhoest@reddit | LocalLLaMA | View on Reddit | 9 comments
This community is gold ! Congrats for pushing AI forward together with open datasets !
Posted by qlhoest@reddit | LocalLLaMA | View on Reddit | 9 comments
This community is gold ! Congrats for pushing AI forward together with open datasets !
sammoga123@reddit
The anti-AI crowd will say that all that data is stolen XDDD
Silver-Champion-4846@reddit
Depends. Some is reddit crawlled, some is wikipedia, etc
Environmental-Metal9@reddit
Quite a few synthetically generated or augmented in some way too!
Silver-Champion-4846@reddit
Many people draw the line at synthetic data since it's been generated by a frunteer model which has been trained on all the internet, which contains both ethically and unethically sourced elements, the majority being the latter.
StupidScaredSquirrel@reddit
Cool, where are the gooner datasets?
llama-impersonator@reddit
anthracite-org
Environmental-Metal9@reddit
Video, image, or text? Texts is definitely there, but roleplay logs and synthetic generated rp scenarios, as well as visual novels tend to dominate the space, but there are some lit erotica plain text ones there too. For the multimodal and image generation stuff I’m sure there are some but from what I understand people curate their own with a lot of scraping
NoStage9115@reddit
how many gooner datasets?
Environmental-Metal9@reddit
87