"Can't live without tool" for LLM datasets?
Posted by Secure_Archer_1529@reddit | LocalLLaMA | View on Reddit | 16 comments
I thought it would be interesting to know what tool people absolutely love using when it comes to LLM training - more specifically creating and preparing datasets?
Also, feel free to just share any knowledge you feel is a "cheatsheet" or too good to be true?
Have a great weekend!
lolzinventor@reddit
I realized that a local postgres database was much better for storing and managing large datasets. Then I realized the scripts I was writing were all very similar, with minor tweaks to prompts for example, so I created this tool, that stores both the prompts and data in a database.
It basically takes data from 1 column and a prompt from another, sends them both to an LLM and then puts the results in another. This process can be cascaded with many columns and prompts all referenced from the cli. Also with helpful db ingress / egress calls. All command line driven. Also multi threaded in case you are token rich.
https://github.com/chrismrutherford/cliDataForge
mlabonne@reddit
I made this repo that might be relevant to you: https://github.com/mlabonne/llm-datasets
I discovered the SemHash library (https://github.com/MinishLab/semhash) recently, and that's a really good one for near-deduplication. I recommend giving it a try, it works on CPU.
coderman4@reddit
As someone just diving into finetuning/dataset preparation, I've found your repo to be extremely helpful as far as organization of resources goes.
Thanks for creating it, this and augmenttoolkit are the two things that are key for myself currently during the learning process.
Secure_Archer_1529@reddit (OP)
Thanks for commenting. Very well made. Nicely structured and concise description. I appreciate the time you spend making this contribution. Is this a hobby for you or do you work with LLMs on the daily?
DinoAmino@reddit
See for yourself ... https://huggingface.co/mlabonne
No_Afternoon_4260@reddit
The quant guy! J espère Londres te plait !
DinoAmino@reddit
The Abliterator
fatih_u@reddit
It’s hobby then
MR_-_501@reddit
This might be a controversial take, but honestly just python with some regular expressions and string splits and concats are usually all you need. And sometimes llama-cpp with a decent model (i like to use gemma 27B) for data-cleaning/processing.
No_Afternoon_4260@reddit
Your not the only one, I feel a lot of us just returned to plain python after having tried a lot of over complicated framework
FullOf_Bad_Ideas@reddit
Tools for manual dataset investigation and small refinement are:
Tad OpenRefine Notepad++ with regex Sublime Text instead of Notepad++ while I'm booted into Windows.
Silly, but handling 2GB text files is not a given.
A lot of datset processing can be done with python scripts that deepseek writes well. And for cleaning/generating datasets with other LLMs, make sure to use an engine that does batched inference to not waste time on sending a request after another is completed.
toothpastespiders@reddit
Sadly, I've never found anything better than just scripting custom solutions to handle specific sources of data. Then going over it by hand with a little gui I tossed together to speed it up just a bit.
I'm big on quality over quantity when it comes to data. Just very, very, slowly putting together and tailoring it to meet my needs.
The only thing that's a little different, I think, in my setup is that I sometimes leverage an in-progress dictionary and note/short-term-memory system when working through data extraction of books that benefit from additional context. Making it more like actually reading a book rather than reading chunks from it. Then at the end have 'that' also processed into the dataset.
I'm mostly just doing it for fun though so I don't know if that's anything too unusual. Seems to work for me though.
JealousAmoeba@reddit
Guidance is extremely useful for generating structured content. Also, hot take, possibly the best dev experience in general for doing llama.cpp inference from Python - its API is well designed and it gives helpful live visual output as it’s generating tokens.
https://github.com/guidance-ai/guidance
Ok-Parsnip-4826@reddit
Python and llama.cpp. I secretly believe that people who use anything else are really just procrastinators with too much time.
Responsible-Front330@reddit
I am happy to be able to just load a Q2 quantized Llama 3.3 on my own RTX 3090. Training would be unthinkable for most mortals.
Secure_Archer_1529@reddit (OP)
Totally. I use cloud for training.