"Can't live without tool" for LLM datasets?

Posted by Secure_Archer_1529@reddit | LocalLLaMA | View on Reddit | 16 comments

I thought it would be interesting to know what tool people absolutely love using when it comes to LLM training - more specifically creating and preparing datasets?

Also, feel free to just share any knowledge you feel is a "cheatsheet" or too good to be true?

Have a great weekend!

[-]

lolzinventor@reddit

I realized that a local postgres database was much better for storing and managing large datasets. Then I realized the scripts I was writing were all very similar, with minor tweaks to prompts for example, so I created this tool, that stores both the prompts and data in a database.

It basically takes data from 1 column and a prompt from another, sends them both to an LLM and then puts the results in another. This process can be cascaded with many columns and prompts all referenced from the cli. Also with helpful db ingress / egress calls. All command line driven. Also multi threaded in case you are token rich.

python -m clidataforge process-all --stages "chunk:summary,summary:analysis,analysis:conclusion" --threads 4

https://github.com/chrismrutherford/cliDataForge

[-]

mlabonne@reddit

I made this repo that might be relevant to you: https://github.com/mlabonne/llm-datasets

I discovered the SemHash library (https://github.com/MinishLab/semhash) recently, and that's a really good one for near-deduplication. I recommend giving it a try, it works on CPU.

[-]

coderman4@reddit

As someone just diving into finetuning/dataset preparation, I've found your repo to be extremely helpful as far as organization of resources goes.

Thanks for creating it, this and augmenttoolkit are the two things that are key for myself currently during the learning process.

[-]

Secure_Archer_1529@reddit (OP)

Thanks for commenting. Very well made. Nicely structured and concise description. I appreciate the time you spend making this contribution. Is this a hobby for you or do you work with LLMs on the daily?

[-]

DinoAmino@reddit

See for yourself ... https://huggingface.co/mlabonne

[-]

No_Afternoon_4260@reddit

The quant guy! J espère Londres te plait !

[-]

DinoAmino@reddit

The Abliterator

[-]

fatih_u@reddit

It’s hobby then

[-]

MR_-_501@reddit

This might be a controversial take, but honestly just python with some regular expressions and string splits and concats are usually all you need. And sometimes llama-cpp with a decent model (i like to use gemma 27B) for data-cleaning/processing.

[-]

No_Afternoon_4260@reddit

Your not the only one, I feel a lot of us just returned to plain python after having tried a lot of over complicated framework

[-]

FullOf_Bad_Ideas@reddit

Tools for manual dataset investigation and small refinement are:

Tad OpenRefine Notepad++ with regex Sublime Text instead of Notepad++ while I'm booted into Windows.

Silly, but handling 2GB text files is not a given.

A lot of datset processing can be done with python scripts that deepseek writes well. And for cleaning/generating datasets with other LLMs, make sure to use an engine that does batched inference to not waste time on sending a request after another is completed.

[-]

toothpastespiders@reddit

Sadly, I've never found anything better than just scripting custom solutions to handle specific sources of data. Then going over it by hand with a little gui I tossed together to speed it up just a bit.

I'm big on quality over quantity when it comes to data. Just very, very, slowly putting together and tailoring it to meet my needs.

The only thing that's a little different, I think, in my setup is that I sometimes leverage an in-progress dictionary and note/short-term-memory system when working through data extraction of books that benefit from additional context. Making it more like actually reading a book rather than reading chunks from it. Then at the end have 'that' also processed into the dataset.

I'm mostly just doing it for fun though so I don't know if that's anything too unusual. Seems to work for me though.

[-]

JealousAmoeba@reddit

Guidance is extremely useful for generating structured content. Also, hot take, possibly the best dev experience in general for doing llama.cpp inference from Python - its API is well designed and it gives helpful live visual output as it’s generating tokens.

https://github.com/guidance-ai/guidance

[-]

Ok-Parsnip-4826@reddit

Python and llama.cpp. I secretly believe that people who use anything else are really just procrastinators with too much time.

[-]

Responsible-Front330@reddit

I am happy to be able to just load a Q2 quantized Llama 3.3 on my own RTX 3090. Training would be unthinkable for most mortals.

[-]

Secure_Archer_1529@reddit (OP)

Totally. I use cloud for training.