Building a fully local PDF-to-audiobook workflow with Kokoro 82M, Qwen and llama.cpp

Posted by purellmagents@reddit | LocalLLaMA | View on Reddit | 34 comments

Hey everyone,

I’ve been building a local-first desktop PDF reader that can read technical books aloud and keep the spoken text highlighted while reading.

The original motivation was pretty practical: I read a lot of programming and technical books, but many publishers either don’t offer audio versions or charge extra for AI-generated audio. I wanted to see how far I could get with a completely local setup instead.

The app is built with Tauri 2.0 and runs locally on my Mac. For TTS I’m using Kokoro 82M. On my M1 Mac, there is a short initial wait while things warm up, but after that the generation is fast enough for normal listening. The current sentence / text segment is highlighted in the reader while the audio plays, so it still feels like reading along rather than just listening to a detached audio file.

The current pipeline is roughly:

Load and render the PDF in the desktop app
Extract readable text from the current section
Split the text into chunks suitable for TTS
Generate speech locally with Kokoro 82M
Play the audio while highlighting the corresponding source text

The two export modes I’m thinking about are:

A straight audiobook mode, where the PDF becomes a set of audio files optimized using llama.cpp with Qwen 3.5 0.8B or 2B model
A podcast-style mode, where the material is transformed into a more conversational format

The most interesting technical problems so far are:

Keeping the generated speech aligned with the original PDF text
Handling code snippets and tables in technical books
Making the first generation fast enough that the app still feels interactive

After loading the initial 15 sentences that get read aloud I need to process the next 15 ones to continue the reading smoothly or maybe taking a fully different approach how things get preprocessed.

That’s where the project is at right now. I’m still mostly building it for my own reading workflow, but if the result becomes useful enough and the codebase is not too embarrassing, I may open source it later.

[-]

More-Curious816@reddit

Do have a video demo? I'm curious about the quality of the voice

[-]

purellmagents@reddit (OP)

No not yet, but I can upload one in the next days.

[-]

Powerful_Ad8150@reddit

Sample of what achieved? Languages supported?

[-]

purellmagents@reddit (OP)

At the moment it supports German and English and two male and two female voices (my preferences). But as Kokoro supports Spanish, French, Hindi, Italian, Japanese, Portuguese and Mandarin Chinese and many different voices they could be integrated easily. I build this app for myself so I don't have to pay for audio anymore for example at Manning. I am sharing my journey here.

[-]

More-Curious816@reddit

Did you try [or have an idea] on using voice in German as source but the output in English?

To illustrate that
1. Collect samples of speech in German language of famous painter
2. train tts on it
3. he out of the tts in English

[-]

purellmagents@reddit (OP)

I didn’t try it yet. How I could imagine this in my app is that the pdf is turned into a webpage like structure with the translated text and then it can get read aloud in German and you can see the highlighted text. Or easier just generate a audio in german from a English pdf. Do you mean something like that?

[-]

Mayion@reddit

Can't wait for it to be trained on voice actors to mimic anime VA's in manga form.

[-]

More-Curious816@reddit

I was thinking about that too for a while. I don't have the hardware or experience currently but my plan was to train a model on VA and vibe code a light novel [epub and pdf] reader to display the text and read in the voice of the characters.

[-]

Mayion@reddit

Can you imagine? No more anime cliffhangers, just jump on the manga from where the anime stopped and enjoy the experience

[-]

DIBSSB@reddit

Hey, awesome project! A few questions:

VibeVoice support Any plans to add VibeVoice as a TTS option alongside Kokoro? Curious if you've looked into it at all.

Linux / Windows Do you plan to support Linux or Windows? Many ppl like me don't own a Mac so wondering if this will ever be usable for me.

PDF line breaks How are you handling PDF line breaks? Like when a sentence continues on the next line but the PDF treats it as a hard break — when I tried something similar, the TTS would pause at every line even mid-sentence, making the audio sound really unnatural. Are you cleaning the extracted text with an LLM before passing it to the audio model, or handling it some other way?

[-]

purellmagents@reddit (OP)

Thank you so much!

I just had a quick look. I would need to read into it more. You mean through qwen tts?

Yes through GitHub actions it’s possible to build the Tauri app for different platforms and I want to make it available for Linux and Windows to. If someone would be available for testing before I publish that would be great

At the moment I just do a simple regex cleanup and it sounds a bit funny. I was just thinking how to solve this. Integrating for this a local running Llm could make the listen experience very slow for readers with slower computers. I will have a deeper think about this today and share the outcome with you

[-]

DIBSSB@reddit

I will test for windows and linux both

I am talking about https://huggingface.co/microsoft/VibeVoice-ASR

[-]

English_linguist@reddit

What’s happened, vibevoice is rereleased ? Is this model better or worse than the og release?

[-]

human_bean_@reddit

It doesn't handle any custom emotions or voice changes depending on talking character? Like for actual novels.

[-]

Radi1229@reddit

Would you consider sharing it? I also want to build the same ffor me, because I'm learning through listening

[-]

purellmagents@reddit (OP)

Sure I could do that, after I cleaned up the code base and added a few small features to make it a more round app. Is there something you wish for in such a app instead of just pressing on play and pause?

[-]

Radi1229@reddit

Maybe to export the audio as mp3, so I can listen to it while having a walk

[-]

Steus_au@reddit

try qwen tts, it’s quite impressive.

[-]

purellmagents@reddit (OP)

what do you think is the most impressive about it?

[-]

ShengrenR@reddit

Kokoro is a very different type of TTS model - it's absolutely 'great for its size' and speed and all sorts of good things.. but just run the same sentence through it and then again in qwen tts or index tts2 or moss tts, etc - if you don't notice a big different, rock on, but I'm willing to bet you do. Kokoro will absolutely give you consistent, even generations through the whole 'audiobook' but it's not something I'd enjoy listening to for more than 2-3 minutes - the speaker will be saying each of the words, but they won't care about any of it.

[-]

purellmagents@reddit (OP)

That sounds like I have to look into qwen tts today to hear the difference. I just saw that they have a voice design option to, that’s super interesting. If it sounds more natural than kokoro and is runnable in a descent speed then I switch to it. Thanks for pointing that out!

[-]

ShengrenR@reddit

The voice design won't be consistent run-to-run, so for usability you'd want to generate the sample you want with that and then use the voice clone path. Qwen-tts and moss-tts both have those options - recommend taking a look at index tts2 as well for emotion steering. None of those things will hold a candle to kokoro for speed - kokoro is 82M params, while these others at few-B param models.. power and capabilities, but at the cost of speed preformance. Will depend on your hardware.

[-]

buildingstuff_daily@reddit

the highlighted text following along with the audio is the detail that elevates this from a cool hack to something actually usable for daily reading. without that sync feature you lose your place every time you look away

kokoro 82M being the TTS engine is interesting because at that size it can actually run on modest hardware. most local TTS options are either too robotic (pyttsx3 style) or require a massive model that needs a dedicated GPU. 82M hits a sweet spot

one question: how does it handle technical content? PDFs with code blocks, equations, tables, and diagrams are where most text-to-speech falls apart. does it skip those, read them literally (which sounds terrible), or does it do something smarter like summarizing what the table shows?

[-]

iMakeSense@reddit

I feel like this is reinventing the wheel a bit. There are already a bunch of webhosted flows that do this with a lot more features on Pinokio or Dione that handle things like epubs and what have you in addition to PDFs. Could you not fork one of those and build on top of it so you could avoid re-building the implementation wheel?

[-]

slavetothesound@reddit

Dione shut down this month. I wasn't familiar so I went looking for those apps.

[-]

iMakeSense@reddit

It's migrating to a different team

[-]

purellmagents@reddit (OP)

For sure partially it is. I always wanted to build a desktop app with tauri. That's why I started this project. And for me I love the idea of having a not web hosted app. This one just runs on my computer and cost nothing except a bit of computation power.

[-]

purellmagents@reddit (OP)

I am just starting this task, I want to turn the uploaded PDF optionally into a Podcast and for this it needs some text rewriting. Thought that could be a lot of fun to listen to if done right.

[-]

Eitamr@reddit

This is cool, good job!

[-]

purellmagents@reddit (OP)

Thank you!