LongPage: 300 full novels with reasoning traces for training better writing LLMs

Posted by Senior_Evidence_3793@reddit | LocalLLaMA | View on Reddit | 10 comments

Current LLMs struggle with long-form creative writing because they lack hierarchical planning. LongPage solves this by providing the reasoning scaffolds that were missing.

What it is:

300 complete books (Project Gutenberg classics) with full reasoning traces
40,000 to 600,000+ tokens per book
Multi-layered planning: character archetypes, story arcs, world rules, scene breakdowns
Rich structural metadata (dialogue density, pacing, narrative focus)

Why it matters: This is the "Chain of Thought for creative writing" - explicit reasoning traces showing models how to plan character development, plot progression, and maintain thematic coherence across entire books.

Training applications:

Cold-start SFT → RL workflows with 3-component structure (prompt, thinking, book)
Inference-time scaffolding using reasoning traces as plans
Hierarchical training: book-level plans → chapter expansions → scene continuations

Currently 300 books, scaling to 100K. All reasoning generated by Qwen3-32B with iterative agent validation across scene → chapter → book levels.

HF Link: https://huggingface.co/datasets/Pageshift-Entertainment/LongPage

Anyone working on long-form generation? Would love to hear what training approaches you're planning to try with this.

[-]

ohHesRightAgain@reddit

I'm really looking forward to seeing where this goes. Fantastic idea.

[-]

Free-Mud5520@reddit

What is the end goal with this one?

[-]

Interesting_Nerve_67@reddit

Noice

[-]

SnakeIsBetterThanGo@reddit

wow, cant wait to see what anthropic does with this

[-]

Senior_Evidence_3793@reddit (OP)

Lol, better be excited about what we are going to do with it 😉
We have big plans with it, big plans

[-]

XMasterDE@reddit

Looks amazing

[-]

Stepfunction@reddit

Is there a repo for the code used to prepare the dataset? That would be incredibly useful.

[-]

Senior_Evidence_3793@reddit (OP)

Not a repo, but we did include a dataset compose file
https://huggingface.co/datasets/Pageshift-Entertainment/LongPage/blob/main/exampel_compose.py

[-]

youarebritish@reddit

This is an interesting idea, but how have the reasoning traces been validated? In my experience, even frontier LLMs are terrible at fiction analysis. When prompted to analyze a subplot in even a very simple story that's not in its dataset, they have never once given me an answer I would give a passing grade to.

I was reading this paper just the other day about how bad LLMs are at understanding analogies, and IMO this is one of the main reasons they are so bad at writing and understanding fiction. Analogy is to me one of the primary skills of a writer.

[-]

Senior_Evidence_3793@reddit (OP)

This part was actually quite painful to get working

TLDR: A lot of hand engineering and throwing tokens at the problem

Longer version:

So what we did was to separate the larger task of generating the synthetic reasoning traces into many small tasks. So basically, every single component of the CoT was generated by its own hand-engineered agent that performed multiple calls to produce the final component.

The hand engineering of all of these agents took around 2 months, and the inference for the 300-book has cost around 20K in inference costs, just to give you an idea about the scale of token consumption and manual effort that went into the dataset.

We also provide a short description of the agent stack in the README. And if you’re than still not convinced about the quality of the reasoning traces, I recommend taking a look at the dataset. 😉