Live demo of LocalVQE: Tiny ~1M param audio model that cancels echo and noise in realtime

Posted by richiejp@reddit | LocalLLaMA | View on Reddit | 10 comments

[-]

basil232@reddit

Nice to see new models in this space. How much delay or time shift can the model tolerate when canceling echo? I would love to try this to cancel music or sound from, for example, a TV in the mic recordings of my smart speaker. However, the recordings would not be perfectly aligned. Can you say something about the "maximum echo distance"?

[-]

BobDerFlossmeister@reddit

How does this fare against DeepFilterNet?
Also the git repo linked in the model card doesn't seem to be accessible: https://github.com/LocalAI-io/LocalVQE

[-]

richiejp@reddit (OP)

Ah that link is some slop, the repo is https://github.com/localai-org/LocalVQE

DeepFilterNet doesn't do echo suppression as far as I can tell. I don't know how the quality compares for noise supression or how large DeepFilterNet is.

[-]

Silver-Champion-4846@reddit

How was this trained? IINNNTERESTINGGGGG

[-]

richiejp@reddit (OP)

On my 16gb NVIDIA RTX 5700 ti with a lot of PyTorch profiling.

[-]

Silver-Champion-4846@reddit

How would you rate it against the industry standard (Krisp, which is what Discord uses)?

[-]

richiejp@reddit (OP)

I'm not really sure if I have used Krisp. However I suspect that the cloud models are either generative or have a generative layer which rebuilds the speaker's voice without interference from background noise. LocalVQE just uses a mask which is a lot faster than having a diffusion or transformers layer, but with a mask (and a not very high resolution one) the noise bleeds into the voice.

[-]