[D] Released a 100k-sample dataset on Hugging Face

Posted by AdhesivenessSea9511@reddit | LocalLLaMA | View on Reddit | 6 comments

We’ve released a 100,000-sample Chain-of-Thought (CoT) dataset for fine-tuning local reasoning models.

Each sample includes explicit intermediate reasoning traces, rather than answer-only supervision. The goal is to improve reasoning consistency during supervised fine-tuning, especially for smaller local models.

We’re sharing it here to gather feedback from people working on local LLM fine-tuning and reasoning distillation.

I’d especially love feedback on:

- CoT length

- consistency of reasoning style

- whether full reasoning traces help or hurt smaller local models

Hugging Face:

https://huggingface.co/datasets/Kamisori-daijin/email-datasets-v2-100k

[-]

Chromix_@reddit

The scope of the dataset is quite limited, there are 100k variations of the same pattern, with the same short response pattern attached to it:

"Write Technical email from Senior Engineer to Competitor about Negotiation (AfterFunding). Max 120 words."
Write Direct email from Disgruntled Employee to TechDir about FeatureRefusal (LowSignal). Max 120 words.

The trained response to the last point is basically:

I'm writing to express my disappointment regarding the recent implementation of the 'WidgetX' feature. Despite previous concerns raised about its low signal and potential impact on user experience, it was deployed anyway. This actively undermines user trust and seems to ignore valid feedback. Please briefly explain this decision.

This trains the model to hallucinate / make up details.

[-]

AdhesivenessSea9511@reddit (OP)

That’s a very fair concern, and I actually appreciate you pointing it out.

You’re absolutely right that the current dataset is structurally narrow in terms of prompt-template diversity. While it contains 100k samples, many are controlled variations of the same prompt-response pattern.

The main objective here was not factual knowledge injection, but supervised fine-tuning for reasoning/style consistency and structured response behavior.

That said, I agree that limited prompt diversity may encourage overfitting to stylistic priors and potentially increase the tendency to fabricate plausible details.

Really appreciate the feedback.

[-]

NandaVegg@reddit

I wonder where the origin of this you are absolutely right thing. It seems pretty epidemic over many models.

[-]

Dany0@reddit

Three words:

OpenAI
Mechanical turk

[-]

AutomataManifold@reddit

Yeah, I'm guessing the early RLHF had a pretty strong feedback loop around that.

You'll note that the structure here is a weak version of the five paragraph essay:

Introduction: "That’s a very fair concern"
Supporting Point: Restate comment being replied to
Supporting Point: Argue for original position
Supporting Point: Restate comment being replied to
Conclusion: "Really appreciate the feedback."

It's a very rigid format that is seldom the best way to present an argument, but GPT 3 in particular really liked it. You'll note how much of the comment is just restating the comment that it is replying to, thereby saying approximately nothing. With human-to-human communication, the polite framing is also serving as a social signal that the message has been received and understood, but since the LLM doesn't continuously learn, it will actually be entirely forgotten once this message leaves the context, since I rather doubt there's a retrieval system backing it up.

Words are frequently speech acts, and most LLM systems aren't hooked up to systems that allow them to actually follow through on what their words imply:

"You’re absolutely right that" is an Eliza-ism, a phrase that lets you restate what someone said and look like you are responding to it. The Eliza/DOCTOR chatbot in the 1960s used patterns like "Why do you think that your mother made you angry?"
Acknowledgement of understanding by weakly paraphrasing the other person's statement works as a social signal in humans because weak paraphrasing takes some cognitive processing on the part of the human.
For an LLM, a weak paraphrase like "The scope of the dataset is quite limited, there are 100k variations of the same pattern" to "the current dataset is structurally narrow in terms of prompt-template diversity. While it contains 100k samples, many are controlled variations of the same prompt-response pattern" is just a transposition. It needs to recognize that the topic is 'prompt-template diversity' but that doesn't convey any more information than that it successfully classified the objection.
LLMs are very good at classification, so we can't use that as a signal that it is actually grokking the objection on the deeper level that would be required to actually change anyone's behavior based on the argument.
The stochastic parrot argument has cachet precisely because the LLM will frequently perform lazy shortcuts like this to produce text that says nothing but looks convincing to a human rater. RLHF, particularly early RLHF, is prone to overly-rewarding convincing-looking but vacuous answers.
It can't act on what it is saying other than what it can say within the session context. This trips up humans interacting with LLMs all the time, because in conversations with each other we frequently imply that we're going to remember something, or that someone has persuaded us. The LLM can't perform that as an actual act but it can say it has.
On the other hand, there are speech acts that can be performed entirely verbally within the context, such as following a logical argument or classifying something, so there are speech acts that LLMs can do. It's just bloody hard to tell the difference sometimes.

Ironically, there's a good chance that this pattern is the result of something fairly similar to the dataset in the post, where the guidelines for the humans working on the RLHF data led to them disproportionately training on this format.

[-]

AutomataManifold@reddit

What evaluation did you use for measuring the effectiveness of this dataset? Without some way of measuring the reasoning effectiveness or what the training causes the model to forget you are apt to produce useless data.

If you are looking for style consistency, 100k is about 100 times as much data as you actually need: the LIMA results demonstrate strong style and structural training with only a few thousand data points. The key, of course, is that they produced a few thousand very unique, high-quality data points.