I created a tool to generate large synthetic datasets with visual feedback (with walk-through)

Posted by davernow@reddit | LocalLLaMA | View on Reddit | 9 comments

Hi everyone,

I’ve been working on Kiln AI, and I just added some pretty cool synthetic data generation tools:

Features

Topic-trees: generate a nested topic tree to build content breadth.
Great UI: our one-click apps for Mac and Windows provide a really nice UX for synthetic data generation (see video walkthrough).
Human Guidance: if it’s not generating quite what you want, you can inject guidance about content, style or format at any time.
Rating Tools: after generation, use our rating tools to rate the responses. Where needed, correct the output with LLM assistance.
Private and local: we can’t access your data. Run completely locally with Ollama (or bring your own API keys).
Structured data: works great with JSON formatted inputs and outputs (optional)
Auto-prompts: once you have rated a few examples, you can switch to our smart automatic prompts like few-shot, multi-shot, chain-of-thought or multi-shot-chain-of-thought (without code or writing prompts). Quality quickly get better with examples.
Open source library to load the dataset into any python project or notebook, or run the generation tasks from code. Open source OpenAPI REST API for building custom tools. The UI completely free and source-available on Github.
Collaborative and iterative: have a team? Share the dataset via Git or a shared drive, and everyone can pitch in. PM can make initial training data, QA can use it to get training+eval data to resolve issues, DS can load it for training/evals, etc.
Credits: the prompts for synthetic generation were extended from promptwrite. The code is all custom to Kiln.

What's Next

As you can probably guess, fine-tuning is coming next 😀. The goal is to make is super easy/fast to start from scratch, generate a large synthetic dataset, and evaluate a variety of methods (fine-tines, different models, prompting tactics, etc).

How to get started:

Try it out: Download for MacOS or Windows
Star it on GitHub: github.com/Kiln-AI/Kiln
Report any issues, or request/upvote feature-requests: github.com/Kiln-AI/Kiln/issues

I’d love any feedback, ideas or suggestions! Feel free to file issues or DM me.

[-]

Jinxx00111@reddit

I've been looking for a tool that would help me generate synthetic data. my data is unstructured json and contains multiturn conversation and function calls.

celsowm@reddit

Nice! Could this tool be useful for my issue: I need to create a dataset about things like: https://pge.rj.gov.br/entendimentos/enunciados

davernow@reddit (OP)

My Portuguese is rusty 😀. What’s the use case?

Legal opinions from state attorneys

VulcanizadorTTL@reddit

does it support generating tools / function calling datasets?

have you tested it on generating properties in other languages but english? for example if i specify a prompt to generate a structure do i need to specify the properties to be generated in spanish?

It uses tools under the hood when the model supports it. Both the input and output schemas are validated; nothing that doesn't conform is saved. It's technically saving the JSON structures, but it would be trivial to create a pre-processor to convert that to a tool call if you wanted to train for tool calling.

Just tried asking it for another language it worked. I added it to the "Human guidance" section of an English prompt. It probably works with prompts in other languages as well. How well this will works will mostly come down to model.

Key-Coyote-9552@reddit

Very cool, I was thinking of using AI to generate a synthetic data set for the hands-on portion of an analytics event I am hosting at my company. I’ll have to give this a shot. 👏🙌

UAAgency@reddit

This looks super interesting to me! Can you talk about some real world use cases ?

For sure, DM me!