Python package for working with LLM's over voice
Posted by Ok_Train_9768@reddit | Python | View on Reddit | 10 comments
Hi All,
Have setup a python package that makes it easy to interact with LLMs over voice
You can set it up on local, and start interacting with LLMs via Microphone and Speaker
What My Project Does
The idea is to abstract away the speech-to-text and text-to-speech parts, so you can focus on just the LLM/Agent/RAG application logic.
Currently it is using AssemblyAI for speech-to-text and ElevenLabs for text-to-speech, though that is easy enough to make configurable in the future
Setting up the agent on local would look like this
voice_agent = VoiceAgent(
assemblyai_api_key=getenv('ASSEMBLYAI_API_KEY'),
elevenlabs_api_key=getenv('ELEVENLABS_API_KEY')
)
def on_message_callback(message):
print(f"Your message from the microphone: {message}", end="\r\n")
# add any application code you want here to handle the user request
# e.g. send the message to the OpenAI Chat API
return "{response from the LLM}"
voice_agent.on_message(on_message_callback)
voice_agent.start()
So you can use any logic you like in the on_message_callback handler, i.e not tied down to any specific LLM model or implementation
I just kickstarted this off as a fun project after working a bit with Vapi
Has a few issues, and latency could defo be better. Could be good to look at some integrations/setups using frontend/browsers also.
Would be happy to put some more time into it if there is some interest from the community
Package is open source, as is available on GitHub and PyPI. More info and installation details on it here also
https://github.com/smaameri/voiceagent
Target Audience
Developers working with LLM/AI applications, and want to integrate Voice capabilities. Currently project is in development phase, not production ready
Comparison
Vapi has a similar solution, though this is an open source version
_rundown_@reddit
What’s your roadmap look like?
The speaker / microphone loop issue should be a fairly simple solve to start — if the LLM is speaking, stop listening and/or stop transcribing and/or stop sending transcriptions to the LLM.
If this also took care of interrupting the LLM, I’d be implementing it tomorrow.
Would love to see a config option that allows for an api key, base url, model name, etc so that we could use our own backends (in a simple way).
Keyword listen option…
Ok_Train_9768@reddit (OP)
Hi. Thanks for getting back, and for your suggestions on the issues also!
Next item would probably be to try to get some setup working via a browser, even if that means working on some corresponding JS library
As for Roadmap, would be great if that could be guided by people actually using the setup, and having feature request type things
I literally tried that where I stop the speaker stream if the LLM is speaking. It created a short but weird sound each time the microphone stream was turned back on. Probably some audio configuration thing I need to tweak.
Interrupting the LLM is an interesting one for sure. Not quite sure yet how to get that one working. Probably involves the LLM Speaker output running on a separate process/thread so that we can still listen for audio input, and cancel the "speaking" process when needed. Probably involves an asyncio/await/thread type rabbit hole
"If this also took care of interrupting the LLM, I’d be implementing it tomorrow." -> you mean you would look at implementing the package I shared?
"Would love to see a config option that allows for an api key, base url, model name, etc so that we could use our own backends (in a simple way)" -> could you explain a little more on this please? You mean using your own model for one of the speech-to-text or text-to-speech parts?
"Keyword listen option…" -> Like a "Siri, ....." type thing?
Thanks again for the feedback and comments! :)
_rundown_@reddit
Fair points.
Yes, I meant implementing your code. I already have a custom stt and tts setup locally. I was going to start with voice messages and videos, but live speaking is on my roadmap (hence interested in implanting your code and not having to develop it myself). Which dove tails into your next question…
The purpose for using stt and tts locally is to keep everything private. So, when implementing new OSS in this space, I only do so when code makes it easy for me to use my local setup. Thankfully, most things use the OpenAI “standard” for api, so it makes things easier for me and OSS devs. Abstracting a “config” class in your code to handle different api interfaces for elevenlabs and other tts (like openai) and assemblyAI another other stt (like whisper) would immediately make this more accessible to the community.
For keyword, yes exactly, like “Siri”, “hey Google”, etc. This would allow a user to specify a keyword to trigger the transcription vs it constantly transcribing.
Think it’s a good start. For my particular use case, i need more QoL features for it to be immediately implementable.
Ok_Train_9768@reddit (OP)
Btw, found this neat stt python lib yesterday. Not sure if you heard of it
https://github.com/KoljaB/RealtimeSTT
Author also has one for tts called, ReatimeTTS and once where we brings them together in a package called AIVoiceChat, for AI voice interaction type stuff, all available in the repos.
Probably neatest STT/TTS python have packages have seen yet. Sharing in case it's useful
_rundown_@reddit
Good find!
Ok_Train_9768@reddit (OP)
That is some great feedback. Appreciate that. Will let you know if I make some more progress on any of these. Good luck on your project also. Thanks again :)
Sweet_Computer_7116@reddit
Did you purposefully or accidently push your dist to main?
Ok_Train_9768@reddit (OP)
First time I publish a python package actually. Yeah, did a bit of searching on that, and seems best practice is to generally not commit that to source code, so will get that removed. Thanks for that!
Sweet_Computer_7116@reddit
No worries. I actually asked because I published my first package like yesterday. Exciting stuff!
Ok_Train_9768@reddit (OP)
Oh nice. Care to share it also?