yes basically converting the input audio directly to the high dimensional vector which llm understands, here is a implementation - https://github.com/fixie-ai/ultravox
real time depends so much on your hardware… so some benchmarks with different configurations would be good. I can tell you right away though that whisper large will produce seconds of delay for me on my machine, which makes it not "real time" imho.
For TTS would definitely recommend checking this fine tuned model that tops HuggingFace's TTS models page alongside kokoro, canopylabs/orpheus-3b-0.1-ft. Definitely check this out, I found this cooler than kokoro despite being way bigger. The big advantage of its is that it has a good control over emotions using special tokens
So, I have a similar pipeline for my web app (VAD-web, Whisper V3 Large Turbo, any LLM, and Kokoro), and I tried Orpheus, albeit through an inference provider (Chutes, I think, or maybe Replica). Way too slow for a STS-like pipeline compared to Kokoro. Kokoro can generate a paragraph in 1s or less, while Orpheus was taking around 30 seconds per paragraph. Orpheus obviously sounds much better, but the slowness killed it for me.
According to the developers of orpheus, they're working on smaller versions check out their checklist. It'll still be slower than Kokoro, however the inference difference isn't going to be that huge as now. https://github.com/canopyai/Orpheus-TTS
Actually you can try ultravox model it eliminate the stt, instead it have the stt+llm ( basically converting the audio to the high dimensional vectors which llm can understand directly), you can use the tts model later to get the better inference, but the issue is ultravox models are large and would require lot of computational power like gpus
I mean, I was going back and forth though between DeepSeek V3.1 and Gemini 2.5 Pro though for the LLMs. And since it's a web app, everything is through inference provider APIs. Now, one way you COULD speed up TTS with larger/slower TTS models though is splitting your LLM data by punctuation and then doing asynchronous API calls to the TTS. That's what I was going to experiment with, but my back-end is in PHP so I'd need to make a chunk of code in Node or Golang or something that's better with async to do that.
Sure will look into that, the only problem would be the tradeoff between the accuracy and the resources, Anyhow the output is from llm so we can tweak around to get the emotions tokens and use it with the orpheus model
Nice, I am looking for a decently fast stt and then tts implementation for my llamacpp personal agent. Would love to see a demo of the quality and speed. I hope i can get this to work at Realtime or close speeds on my machine and a 14b llm model as the inferance engine. got an rtx 4090 i am hoping to fit this all in to ad realtime speeds.
Step 1 ✔️ resolve dependency conflicts
Step 2 ✔️ resolve bin pickeling issue
Step 3 found transcribing too slow, switching to cuda
Step 4 resolve the gap after each reply
I use playback_finished event and tts_lock. So basically playback must finish before speak now. Not a big deal. It now works flawlessly as a standalone agent for less than 12gb.
* A separate settings file to set what you called "key settings" in the readme.
* Another setting to replace the default instructions in the agent.
* an easy docker install. Settings file could be mounted.
Does ollama just take care of the context size, or is that something that could be in the settings.
Is there anything magic about llama 3.1 8B, or could we use pull any Ollama model (so long as we set it in agent_client.py)? Maybe have that as a setting, too?
llm prompt template can be made as a separate file and can be loaded during the run
will dockerize the code base and exploring options for the Cuda supported docker images for faster transcription and tts
Yes ollama has builtin settings and llama latest model can also be used, I'm running on my mac hence chosen lightweight model, yes we can change the model configuration as well
Two issues: the transcribing speed is much slower than expected. After the reply, there is a big gap (increase over reply length) before I can speak again.
AryanEmbered@reddit
Thats not speech to speech
Thats speech to text to text to speech
ahmetegesel@reddit
So it is STTTS
trararawe@reddit
Actually it's STTTTTS
DaleCooperHS@reddit
No the guy just trained a full multimodal model in his basement Sherlock. LOL
martian7r@reddit (OP)
I wash had unlimited GPU and Dataset hack, would love try then lol
DeltaSqueezer@reddit
speech to speech is just speech to numbers to speech anyway.
martian7r@reddit (OP)
yes basically converting the input audio directly to the high dimensional vector which llm understands, here is a implementation - https://github.com/fixie-ai/ultravox
martian7r@reddit (OP)
yes mentioned in the title itself, to build speech to speech model would require huge data and resources which a normal dev lacks, cheers :)
__Maximum__@reddit
To be fair, they elaborated right in the title
M0shka@reddit
!remindme 1 week
RemindMeBot@reddit
I will be messaging you in 7 days on 2025-04-10 13:20:23 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
YearnMar10@reddit
real time depends so much on your hardware… so some benchmarks with different configurations would be good. I can tell you right away though that whisper large will produce seconds of delay for me on my machine, which makes it not "real time" imho.
well done nonetheless ofc!
martian7r@reddit (OP)
Yeah it depends on the hardware, I was running this on A100 machine with 100+ cpu cores 💀
YearnMar10@reddit
What’s the delay you get between speaking and receiving a spoken response back?
martian7r@reddit (OP)
Would love to hear your feedback and suggestions!
Extra-Designer9333@reddit
For TTS would definitely recommend checking this fine tuned model that tops HuggingFace's TTS models page alongside kokoro, canopylabs/orpheus-3b-0.1-ft. Definitely check this out, I found this cooler than kokoro despite being way bigger. The big advantage of its is that it has a good control over emotions using special tokens
CommunityTough1@reddit
So, I have a similar pipeline for my web app (VAD-web, Whisper V3 Large Turbo, any LLM, and Kokoro), and I tried Orpheus, albeit through an inference provider (Chutes, I think, or maybe Replica). Way too slow for a STS-like pipeline compared to Kokoro. Kokoro can generate a paragraph in 1s or less, while Orpheus was taking around 30 seconds per paragraph. Orpheus obviously sounds much better, but the slowness killed it for me.
Extra-Designer9333@reddit
According to the developers of orpheus, they're working on smaller versions check out their checklist. It'll still be slower than Kokoro, however the inference difference isn't going to be that huge as now. https://github.com/canopyai/Orpheus-TTS
martian7r@reddit (OP)
Actually you can try ultravox model it eliminate the stt, instead it have the stt+llm ( basically converting the audio to the high dimensional vectors which llm can understand directly), you can use the tts model later to get the better inference, but the issue is ultravox models are large and would require lot of computational power like gpus
CommunityTough1@reddit
I mean, I was going back and forth though between DeepSeek V3.1 and Gemini 2.5 Pro though for the LLMs. And since it's a web app, everything is through inference provider APIs. Now, one way you COULD speed up TTS with larger/slower TTS models though is splitting your LLM data by punctuation and then doing asynchronous API calls to the TTS. That's what I was going to experiment with, but my back-end is in PHP so I'd need to make a chunk of code in Node or Golang or something that's better with async to do that.
martian7r@reddit (OP)
Sure will look into that, the only problem would be the tradeoff between the accuracy and the resources, Anyhow the output is from llm so we can tweak around to get the emotions tokens and use it with the orpheus model
DeltaSqueezer@reddit
Would be great if you included an audio demo so we could hear latency etc. without having to run the whole thing.
martian7r@reddit (OP)
Sure will add the demo video and .exe setup file for easier use
no_witty_username@reddit
Nice, I am looking for a decently fast stt and then tts implementation for my llamacpp personal agent. Would love to see a demo of the quality and speed. I hope i can get this to work at Realtime or close speeds on my machine and a 14b llm model as the inferance engine. got an rtx 4090 i am hoping to fit this all in to ad realtime speeds.
JustinPooDough@reddit
I actually did a similar thing but with wake-words as well. Will upload very soon along with a different project.
I still think this approach is very feasible for most use cases and can run with acceptably low latency as well.
__JockY__@reddit
Please post this, I’m starting look at options for building this myself! I want an offline non-Amazon Alexa-like thing.
frankh07@reddit
Great job, how many GB does llama3.1 need and how many tokens per second does it generate?
martian7r@reddit (OP)
Depends on where you are running it, on A100 machine it is around 2k tokens per second pretty fast, ut uses 17gb of vram for 8b model
frankh07@reddit
Damn, that's really fast. I tried it a while back with Nvidia NIM on A100, it ran at 100 t/p.
martian7r@reddit (OP)
It's is using tensorRT optimization, with just ollama you cannot achieve such results
Trysem@reddit
Ia this speech to speech? Or ASR+ TTS?
martian7r@reddit (OP)
It's ASR+ TTS
You_Wen_AzzHu@reddit
Step 1 ✔️ resolve dependency conflicts Step 2 ✔️ resolve bin pickeling issue Step 3 found transcribing too slow, switching to cuda Step 4 resolve the gap after each reply
martian7r@reddit (OP)
Resolved the dependcy conflicts, it's is due to the kokoro onnx req, hence it should be installed separately without deps
yet to figure out
You_Wen_AzzHu@reddit
I use playback_finished event and tts_lock. So basically playback must finish before speak now. Not a big deal. It now works flawlessly as a standalone agent for less than 12gb.
StoryHack@reddit
Looks cool. Things I would love to see this get:
* A separate settings file to set what you called "key settings" in the readme.
* Another setting to replace the default instructions in the agent.
* an easy docker install. Settings file could be mounted.
Does ollama just take care of the context size, or is that something that could be in the settings.
Is there anything magic about llama 3.1 8B, or could we use pull any Ollama model (so long as we set it in agent_client.py)? Maybe have that as a setting, too?
martian7r@reddit (OP)
You_Wen_AzzHu@reddit
Two issues: the transcribing speed is much slower than expected. After the reply, there is a big gap (increase over reply length) before I can speak again.
You_Wen_AzzHu@reddit
As always , this dependency thingy is killing me. Which python version are you running this on?