Real-time conversational AI running 100% locally in-browser on WebGPU

[-]

Weary-Wing-6806@reddit

Sick, can’t believe it’s that smooth running fully in-browser. How are you handling audio streaming and context locally? Chunked or token-wise? Been working on real-time agents lately and curious how you’re keeping latency that low.

[-]

SquashFront1303@reddit

Can somebody please wrap it to an exe

[-]

GreenTreeAndBlueSky@reddit

The latency is amazing. What model/setup is this?

[-]

xenovatech@reddit (OP)

Thanks! I'm using a bunch of models: silero VAD for voice activity detection, whisper for speech recognition, SmolLM2-1.7B for text generation, and Kokoro for text to speech. The models are run in a cascaded, but interleaved manner (e.g., sending chunks of LLM output to Kokoro for speech synthesis at sentence breaks).

[-]

Mediocre_Leg_754@reddit

which library of silero VAD?

[-]

estebansaa@reddit

nice!

[-]

Useful_Artichoke_292@reddit

Is there any small multimodal as well that can take input as audio and give output as audio?

[-]

lordpuddingcup@reddit

think you could squeeze in a turn-detection model for longer conversations?

[-]

xenovatech@reddit (OP)

I don’t see why not! 👀 But even in its current state, you should be able to have pretty long conversations: SmolLM2-1.7B has a context length of 8192 tokens.

[-]

lordpuddingcup@reddit

Turn detection is more for handling when your saying something and have to think mid sentence, or are in an umm moment the model knows not to start looking for a response yet vad detects the speech, turn detection says ok it’s actually your turn I’m not just distracted thinking of how to phrase the rest

[-]

sartres_@reddit

Seems to be a hard problem, I'm always surprised at how bad Gemini is at it

[-]

rockets756@reddit

Yeah, speech detection with Gemini is awful. But when I use the speech detection with Google's gboard, it's just fine lol. Fixes everything in real time. I don't know what they are struggling with.

[-]

lordpuddingcup@reddit

There are good models to do it but it’s additional compute and sorta a niche issue and to my knowledge none of the multi modals include turn detection detectio

[-]

deadcoder0904@reddit

I doubt its a niche issue.

Its the first thing every human notices because all humans love to talk over others unless they train themselves not to.

[-]

lenankamp@reddit

https://huggingface.co/livekit/turn-detector
https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-turn-detector
It's an onnx model, but limited for use in english since turn detection is language dependent. But would love to see it as an alternative to VAD in a clear presentation as you've done before.

[-]

natandestroyer@reddit

What library are you using for smolLM inference? Web-llm?

[-]

xenovatech@reddit (OP)

I'm using Transformers.js for inference 🤗

[-]

GamerWael@reddit

Also, I was wondering, why did you release kokoro-js as a standalone library instead of implementing it within transformers.js itself? Is the core of kokoro too dissimilar from a typical speech to text transformer architecture?

[-]

xenovatech@reddit (OP)

Mainly because kokoro requires additional preprocessing (phonemization) which would bloat the transformers.js package unnecessarily.

[-]

GamerWael@reddit

Oh it's you Xenova! I just realised who posted this. This is amazing. I've been trying to build something similar and was gonna follow a very similar approach.

[-]

natandestroyer@reddit

Oh lmao, he's literally the dude that made transformers.js

[-]

natandestroyer@reddit

Thanks, I tried web-llm and it was ass. Hopefully this one performs better

[-]

GreenTreeAndBlueSky@reddit

Incredible. Source code?

[-]

xenovatech@reddit (OP)

Yep! Available on GitHub or HF.

[-]

worldsayshi@reddit

This is impressive to the point that I can't believe it.

Do you have/know of an example that does tool calls?

[-]

GreenTreeAndBlueSky@reddit

Thank you very much! Great job!

[-]

BusRevolutionary9893@reddit

Please.

[-]

worldsayshi@reddit

They posted it.

[-]

Niwa-kun@reddit

all on a single laptop?! HUH?

[-]

phormix@reddit

Gonna have to try integrating some of those with Home Assistant (other than Whisper which is already a thing)

[-]

lenankamp@reddit

Thanks, your spaces have really been a great starting point for understanding the pipelines. Looking at the source I saw a previous mention of moonshine and was curious behind the reasoning of the choice between moonshine and whisper for onnx, mind enlightening? I recently wanted Moonshine for the accuracy but fell back to whisper in a local environment due to hardware limitations.

[-]

ExplanationEqual2539@reddit

From When did kokoroTTS has Santa?

[-]

Key-Ad-1741@reddit

Was wondering if you tried Chatterbox, a recent TTS release: https://github.com/resemble-ai/chatterbox, I havent gotten around to testing it but the demos seem promising.

Also, what is your hardware?

[-]

xenovatech@reddit (OP)

Chatterbox is definitely on the list of models to add support for! The demo in the video is running on an M4 Max.

[-]

die-microcrap-die@reddit

How much memory on that Mac?

[-]

bornfree4ever@reddit

the demo works pretty okay on M1 from 2020. the model is very dumb but the SST and TTS are fast enough

[-]

Mediocre_Leg_754@reddit

Is the vicky's VAD reliable for running in the browser?

[-]

cogeng@reddit

I managed to get it to run on linux with chromium after setting the #enable-vulkan and #enable-unsafe-webgpu flags but the result is that the AI just moans at me.

No I'm not kidding. Yes it's very funny and slightly disturbing.

[-]

had12e1r@reddit

This is so cool

[-]

Medium_Win_8930@reddit

Great tools thanks a lot. Just a quick tip for people, you might need to disable the KV cache otherwise the context of previous conversations will not be stored/ remembered properly. This enables true multi turn conversation. This seems to be a bug, not sure if its due to the browser i am using or version, but i am surprised xenovatech did not mention this issue.

[-]

Aldisued@reddit

This is strange... On my Macbook M3, it is stuck loading both on the huggingface demo site as well as when I run it locally. Waited several minutes on both.

Any ideas why? I tried safari and chrome as browsers...

[-]

squatsdownunder@reddit

It worked perfectly with Brave on my M3 MBP with 36GB of RAM. Could this be a memory issue?

[-]

xenovatech@reddit (OP)

For those interested, here's how it works:
- A cascaded & interleaving of various models to enable low-latency & real-time speech-to-speech generation.
- Models: Silero VAD for voice activity detection, whisper for speech recognition, SmolLM2-1.7B for text generation, and Kokoro for text to speech
- WebGPU: powered by Transformers.js and ONNX Runtime Web

Link to source code and online demo: https://huggingface.co/spaces/webml-community/conversational-webgpu

[-]

CheetahHot10@reddit

this is awesome! thanks for sharing

[-]

cdshift@reddit

I get an unsupported device error on your space. For your github are you working on an install reader for us noobs to this?

[-]

dickofthebuttt@reddit

Try chrome; it didnt like firefox for me. Takes a hot minute to load the models, so be patient

[-]

cdshift@reddit

Thanks, u/dickofthebuttt

[-]

CheetahHot10@reddit

thank you dick, great name too

[-]

monerobull@reddit

Edge browser worked for me when firefox gave that error.

[-]

paranoidray@reddit

Ah, well done Xenova, beat me to it :-)

But if anyone else would like an (alpha) version that uses Moonshine, let's you use a local LLM server, let's you set a prompt here is my attempt:

https://github.com/rhulha/Speech2SpeechVAD

[-]

winkler1@reddit

Tried the demo/webpage. Super unclear what's happening or what you're supposed to do. Can do a private youtube video if you want to see user reaction.

[-]

paranoidray@reddit

Na, I know it's bad. Didn't have time to polish it yet. Thank you for the feedback though. Gives me energy to finish it.

[-]

LyAkolon@reddit

I recommend taking a look at OpenAI dev day recent videos. They discuss how they got the interruption mechnism working, and how the model knows where you interrupted it since it doesn't work like we do. It's really neat, and I'd be down to see how you could get that fit within this pipeline.

[-]

Numerous-Aerie-5265@reddit

Amazing, We neeed a server version to run locally, how hard would it be to modify?

[-]

hanspit@reddit

Dude this is awesome this is exactly what I wanted to make now I have to figure out how to do it on a locally hosted machine with docker. Lol

[-]

Numerous-Aerie-5265@reddit

Let us know if you make any headway!

[-]

gamblingapocalypse@reddit

Excellent!!!

[-]

kunkkatechies@reddit

does it use JS speech-to-text and text-to-speech models ?

[-]

xenovatech@reddit (OP)

Yes! All models run w/ WebGPU acceleration: whisper for speech-to-text and kokoro for text-to-speech.

[-]

everythingisunknown@reddit

Sorry I am noob, how do I actually open it after cloning the git?

[-]

solinar@reddit

You know, I had no idea (and probably still mostly don't), but I got it running with support from https://chatgpt.com/ using the o3 model and just asking each step what to do next.

[-]

kunkkatechies@reddit

Awesome ! How about RAM usage ?

[-]

Useful_Artichoke_292@reddit

Latency is so low amazing demo.

[-]

fwz@reddit

are there any similar-quality models for other languages, e.g. Arabic?

[-]

kkb294@reddit

Nice, can we achieve this on mobile.? If yes, that would be amazing 🤩

[-]

Trisyphos@reddit

Why website instead normal program?

[-]

FistBus2786@reddit

The web is the new normal. Nobody is going to download some random guy's program and run it on their machine.

[-]

Trisyphos@reddit

Then how you run it locally?

[-]

FistBus2786@reddit

You're right, it's better if you can download it and run it locally and offline.

This web version is technically "local", because the language model is running in the browser, on your local machine instead of someone else's server.

If the app can be saved as PWA (progressive web app), it can run offline also.

[-]

do-un-to@reddit

... little buddy.

[-]

skredditt@reddit

Do you mean to tell me there are models I can embed in my front end to do stuff?

[-]

HateDread@reddit

I'd love to run this locally with a different model (not SmolLM2-1.7B) underneath! Very impressive.

[-]

xenovatech@reddit (OP)

You can modify the model ID [here](https://huggingface.co/spaces/webml-community/conversational-webgpu/blob/main/src/worker.js#L80) -- just make sure that the model you choose is compatible with Transformers.js!

The Nicole voice has been around for a while :) Check out the VOICES.md for more information

[-]

ulyssesdot@reddit

How did you get past the no-async webgpu buffer read issue?

[-]

paranoidray@reddit

I think workers

[-]

CallMeBigPoppa95@reddit

w00t!

[-]

deleted_by_reddit@reddit

[removed]

[-]

mr_happy_nice@reddit

RX 6600, on win10, chrome

[-]

Upstairs_Lettuce_746@reddit

Nice

[-]

Benna100@reddit

Super cool. Could this work with screensharing?

[-]

Ni_Guh_69@reddit

The output audio is muffled for me and full of static noise any1 else ?

[-]

smallfried@reddit

Nice nice! What's that hardware that you're running on?

[-]

jmellin@reddit

Impressive! You’re cooking!!

I, as the rest of the degenerates, would love to see this open source so that we could make our own Jarvis!

[-]

xenovatech@reddit (OP)

It is open source! 😁 both on GitHub and HF

[-]

05032-MendicantBias@reddit

Great, I'm building something like this. I think I'll port it to python and package it.

[-]

HugoDzz@reddit

Awesome work as always !!

[-]

vamsammy@reddit

Trying to run this locally on my M1 Mac. I first issued "npm i" and then "npm run dev". Is this right? Where do the models get downloaded?

[-]

TutorialDoctor@reddit

Great job. Never thought about sending kokoro audio in chunks. You should turn this into an Tauri desktop app and improve the UI. I'd buy it for a one-time purchase.

https://v2.tauri.app/

[-]

Tomr750@reddit

have you got experience with speaker diarisation?

[-]

FlyingJoeBiden@reddit

Wild, is this open source?

[-]

xenovatech@reddit (OP)

I'm glad you like it! 🤗 And yes, it is open source!
- GitHub: https://github.com/huggingface/transformers.js-examples/tree/main/conversational-webgpu
- HF: https://huggingface.co/spaces/webml-community/conversational-webgpu/tree/main

[-]

hummingbird1346@reddit

Holy shit! I've been looking for something like this forever. If I change the hugginface (SmolLM2-1.7B) model address in the files App.jsx and worker.js, I would technically be able to run it with a differnet model right? Hopefully going for a gemma or qwen model when I'm fine with a little more latency? But damn it already looks so well done. This is exactly what I was looking for.

[-]

sartres_@reddit

Why did you use SmolLM2 over newer <2B models?

[-]

c_punter@reddit

Have you tried cloning/training your own voice models to use in it?

[-]

sourceholder@reddit

Also, is there a Jarvis option?

[-]

Kholtien@reddit

Will this work with and GPUs? I have a slightly too old and GPU (RX 7800XT) and I can’t get any STT or TTS working at all

[-]

onebaldegg@reddit

hmm. I'm getting this error. maybe my laptop can't run this?

[-]

OceanRadioGuy@reddit

If you make a Docker for this I will personally bake you a cake

[-]

cromagnone@reddit

I will deliver it.

👀 but really, it might get there.

[-]

IntrepidAbroad@reddit

If I make a Docker for this, will you bake me a cake as fast as you can?

[-]

JohnnyLovesData@reddit

For you and your baby

[-]

IntrepidAbroad@reddit

You do love data!

[-]

mattjb@reddit

The cake is a lie.

[-]

Thatisverytrue54321@reddit

[-]

IntrepidAbroad@reddit

Wait, what? That was nearly 18 years ago?!?

[-]

trash-boat00@reddit

The second voice will gonna be used in a sinful way

[-]

banafo@reddit

Can you give our asr model a try? Wasm, doesn’t need gpu and you can skip silero. https://huggingface.co/spaces/Banafo/Kroko-Streaming-ASR-Wasm

[-]

entn-at@reddit

Nice use of k2/icefall and sherpa! I’ve been hoping for it to gain more popularity.

[-]

seattext@reddit

how big is models? <100gb?

[-]

OfficialHashPanda@reddit

Just a couple gb. It uses smollm2 1.7B

[-]

deepsky88@reddit

OMG so amazing! This is a revolution! How much for the project?

[-]

xenovatech@reddit (OP)

$0! It’s open-source on GitHub and HF

[-]

sharyphil@reddit

Cool, this is the future.

Thank you for showcasing this, OP.

[-]

CountRock@reddit

What's the hardware/GPU/memory?

[-]

BuildAQuad@reddit

What kind of GPU are you running this with?

[-]

No-Break-7922@reddit

Why don't you turn this into a real-time translator and put all language course providers out of business already?

[-]

DominusVenturae@reddit

kokoro is only english. xttsv2 is a little slower and requires a lot more vram, but it knows like 12 languages.

[-]

YearnMar10@reddit

Kokoro isn’t only English.

[-]

CaptTechno@reddit

open-source this please!

[-]

Far_Buyer_7281@reddit

Kokoro is nice, but maybe chatterbox would be a cool option to add.

[-]

IntrepidAbroad@reddit

Niiiiiice! That was/is fun to play with - unsure how I got into a conversation about music with it and learned about the famous song "I Heard it Through the Grapefruit" which had me in hysterics.

More seriously - started to look at options for on-device conversational AI options to interact with something I'm planning to build so this was an option posted at just the right time. Cheers.

[-]