[D] Released a 100k-sample dataset on Hugging Face
Posted by AdhesivenessSea9511@reddit | LocalLLaMA | View on Reddit | 6 comments
We’ve released a 100,000-sample Chain-of-Thought (CoT) dataset for fine-tuning local reasoning models.
Each sample includes explicit intermediate reasoning traces, rather than answer-only supervision. The goal is to improve reasoning consistency during supervised fine-tuning, especially for smaller local models.
We’re sharing it here to gather feedback from people working on local LLM fine-tuning and reasoning distillation.
I’d especially love feedback on:
- CoT length
- consistency of reasoning style
- whether full reasoning traces help or hurt smaller local models
Hugging Face:
https://huggingface.co/datasets/Kamisori-daijin/email-datasets-v2-100k
Chromix_@reddit
The scope of the dataset is quite limited, there are 100k variations of the same pattern, with the same short response pattern attached to it:
The trained response to the last point is basically:
This trains the model to hallucinate / make up details.
AdhesivenessSea9511@reddit (OP)
That’s a very fair concern, and I actually appreciate you pointing it out.
You’re absolutely right that the current dataset is structurally narrow in terms of prompt-template diversity. While it contains 100k samples, many are controlled variations of the same prompt-response pattern.
The main objective here was not factual knowledge injection, but supervised fine-tuning for reasoning/style consistency and structured response behavior.
That said, I agree that limited prompt diversity may encourage overfitting to stylistic priors and potentially increase the tendency to fabricate plausible details.
Really appreciate the feedback.
NandaVegg@reddit
I wonder where the origin of this you are absolutely right thing. It seems pretty epidemic over many models.
Dany0@reddit
Three words:
OpenAI
Mechanical turk
AutomataManifold@reddit
Yeah, I'm guessing the early RLHF had a pretty strong feedback loop around that.
You'll note that the structure here is a weak version of the five paragraph essay:
It's a very rigid format that is seldom the best way to present an argument, but GPT 3 in particular really liked it. You'll note how much of the comment is just restating the comment that it is replying to, thereby saying approximately nothing. With human-to-human communication, the polite framing is also serving as a social signal that the message has been received and understood, but since the LLM doesn't continuously learn, it will actually be entirely forgotten once this message leaves the context, since I rather doubt there's a retrieval system backing it up.
Words are frequently speech acts, and most LLM systems aren't hooked up to systems that allow them to actually follow through on what their words imply:
your mother made you angry?"Ironically, there's a good chance that this pattern is the result of something fairly similar to the dataset in the post, where the guidelines for the humans working on the RLHF data led to them disproportionately training on this format.
AutomataManifold@reddit
What evaluation did you use for measuring the effectiveness of this dataset? Without some way of measuring the reasoning effectiveness or what the training causes the model to forget you are apt to produce useless data.
If you are looking for style consistency, 100k is about 100 times as much data as you actually need: the LIMA results demonstrate strong style and structural training with only a few thousand data points. The key, of course, is that they produced a few thousand very unique, high-quality data points.