making my own ai waifu app that can teach me any language.
Posted by aziib@reddit | LocalLLaMA | View on Reddit | 61 comments
using gemma-4-E4B-it for the llm
her voice is using omnivoice tts that i made the api using fastapi
3d model made by me using vroid studio
right now is support uploading image, search web, and using voice call and video call like grok ani.
i'm surprised by gemma 4 model that can follow my prompt well without uncensoring the model.
Haroombe@reddit
Tbh i would not want to learn language from an AI, it sounds so unnatural and uncanny. You are better off watching youtube videos or using anki
finevelyn@reddit
This may be true, assuming it has no effect on how much time you will spend learning, which is a big assumption to make.
Haroombe@reddit
I dont understand. Are you saying that this product (ai waifu) will get you to spend more time learning a language, and for that reason it will be helpful?
If that is your argument I would say just watch v-tubers in your language, or if someone is a goon maxxer watch AVs or adult stuff lol.
finevelyn@reddit
Pretty much. More time spent and more engaged. It doesn't matter if it's a waifu app or anything else that keeps you engaged, as long as you keep doing it.
Just watching videos you will most likely get bored out of your mind, and then stop learning altogether.
Haroombe@reddit
I would argue watching videos are more interesting than typing to a chatbot or talking to a chatbot. People already doomscroll or watch video essays for countless ours, thats the problem of our generation. Now imagine if you started watching the content you love, in the language you are trying to learn. Sure you won't understand at the beginning, but if the content is in a genre or niche you already know and are interested in (say cars) for example, your interest can supercede the language barrier.
z_latent@reddit
I can confirm that works. A friend of mine learned Japanese incredibly fast (<1 year) and incredibly well by watching tons and tons of let's plays in Japanese. He said music lyrics also helped in his case.
But I'd say, if someone finds it interesting to chat with an AI, it could work fine for them too. As long as it has correct pronunciation and usage of the language.
Blastronomicon@reddit
You’re correct btw. The more time spent interacting and using the language the faster it is learned. Learning a language process has all but been understood EXCEPT for how the Defense Language Institute does it because they claim local, micro level societal fluency for graduates in record times - as in graduates not only learn the whole language they also learn a specific micro dialect constrained down to village level fluency.
All we really know is that the program starts with 0 at home language usage. I believe cohorts are sectioned off, but some documentaries also say they’re mixed for out of class time and they have to use the language they study.
KitsuMateAi@reddit
Is it unity application, or web based? Share your secrets
Just wondering as Unity have support for onnx models, which allows to run them from the application.
Tomstachy@reddit
Is it unity app? Or web based?
ELPascalito@reddit
Albeit the VRM bones are jittery, this looks and sounds lovely! Are you running both the LLM and TTS on the same machine ,I presume this requires a moderately strong setup, especially in memory capacity, no?
aziib@reddit (OP)
yes still figuring out to fix those jittery, probably because the animation is generated using hy-motion. yes it's on the same machine for the llm + tts, omnivoice is quite light it's only takes 4gb if you load without whisper model and make the caption text before, and the llm i'm using quantized version q4_k_m gemma 4 e4b it, with 64k context window, still fit in my 16gb vram with flash-attention on.
ELPascalito@reddit
How lovely, but yeah it seems the memory needed is no joke, especially since such style of app interaction requires fast speeds, especially for text generation so the TTS can also start asap (unless streaming perhaps) anyhow best of luck!
honglac3579@reddit
I have it done using piper tts, it quite fast even on cpu and that make my pipeline to achieve real time conversation
ELPascalito@reddit
True, but Piper's quality was not enough to match my goal, it seems TTS models are either focused on full quality, or max speed, no medium quality to size choices 😅
honglac3579@reddit
Indeed, i often tweak and turn the post process audio to achieve my expected result (still not quite what i have expected), while larger model can do it by just adding some text or audio for zero shot voice cloning :v
aziib@reddit (OP)
yes. still figuring it out the memory problem, right now i just maintain maximum of 30 message logs so she will forgot early messages so it doesnt filling up the context window.
SmartCustard9944@reddit
For animations you can typically mix between them, but not sure which stack you are using.
ego100trique@reddit
Everyday we stay further away from god lmao
Glittering_News_1455@reddit
hey btw if you face issues with censorship there is this variant
https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive
I tested it and didn't face any censoring on pretty wild stuff
and also does Omnivoice work realtime for you? for my GTX 1050 it takes 20 seconds to generate a cloned voice
aziib@reddit (OP)
i'm using rtx 4060ti, it fast as you can see near instant, i also make the tts api sentence by sentence instead a whole message to make it more faster, and modified using custom api that make the steps to 32.
shoraaa@reddit
i'm literally making something lol, with less focus on the frontend and more focus on automonous waifu (incorporating proactice agentic mindset into 2d waifu)
FerLuisxd@reddit
How do you handle searching the web? Brave api?
aziib@reddit (OP)
duckduckgo search
misha1350@reddit
Man-made horrors
mpasila@reddit
E4B is probably not big enough for other languages outside of English (other than maybe Spanish and some other large languages), at least I didn't have much success. The bigger models perform much better.
Training-Event3388@reddit
This should be illegal
ransuko@reddit
Hmm. Is this... Tsundere... Mesugaki?
jikilan_@reddit
Oh come on! this is what you use for LLM?
By the way, where is the GitHub url? 😁
aziib@reddit (OP)
still got so many bugs, i've been thinking make it open source if it's ready.
jikilan_@reddit
Any plan to make her to have access to the camera? I bet the interaction would be very interesting
RedParaglider@reddit
We're curing cancer, right?
_-_David@reddit
Cool use of a small model. Do you plan to use the audio input capability of the model? If this is 90% tsundere waifu, no biggie. But if you're seriously interested in using it to learn another language, I'd make some adjustments.
aziib@reddit (OP)
yes, i've been working on the voice input, the only downside if i using built in browser speech to text it got so long waiting the response from the ai instead just typing the message and enter (i think it's a bug), and i tried setting up whisper is causing delay too because it ran on the cpu, still finding speech to text alternative that is fast.
honglac3579@reddit
I using groq api for whisper model, you can try that for experiment
_-_David@reddit
nvidia parakeet is fast and accurate. I'd try that out. https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3
honglac3579@reddit
Can't wait to see it on github my man of cultural
honglac3579@reddit
Btw, you can add speech to text model so that you can communicate with your waifu voice to voice, that would be blissful
SkyNetLive@reddit
You can’t waifu without ai. It’s right there
martinerous@reddit
Good stuff.
Some time ago I had an idea to build something like that using Nvidia's Audio2Face but, as it usually happens, did not have enough time. But at least I started something - finetuned my own FasterWhisper Turbo model for Latvian language with lower WER, finetuned VoxCPM to speak Latvian (and now they released VoxCPM2 and I need to train it again LOL), created my own UI frontend app for adventure roleplays... but no 3D avatars yet - I'm secretly hoping that somebody would create an out-of-the-box "drop your photo reference, get a real-time TTS talking head" solution, but nothing like that yet in sight.
fagenorn@reddit
Super nice way to learn a new language!
I recommend you check out Qwen3 TTS for the voice, I am working on finetuning a voice for my app and it's blowing my mind the quality (vs kokoro which I was using before).
It is a bit heavier, using gguf it uses around 2-3 gigs of vram and RTF is around 0.2 but it's so super worth it once you get it working. demo https://voca.ro/1gjKTnWxzwAP
aziib@reddit (OP)
i want to use it too, but it's too heavy on the vram, using omnivoice i can manage only 3gb vram by just loading the model without whisper and using manual transcribe using text for the refference voice. and it support many language that is why i'm using it.
NoLeading4922@reddit
How does her motion work? Is it also generated by some ai model?
aziib@reddit (OP)
the motion is generated using hy-motion, but it's not live, i generated the motion first and save it as fbx and will be played in the app at certain emotion.
NoLeading4922@reddit
You can try dartcontrol it is live
Conscious-Hair-5265@reddit
You gotta work on the jiggle physics
ProfessionalSpend589@reddit
Ok, I see it now how LLM can be dangeours.
semperaudesapere@reddit
What model are you using. This isn't natural sounding english.
sunshinecheung@reddit
lol, looks good
InstaMatic80@reddit
Wow that looks amazing! Would you please share more info about how to do it? I mean, you created a 3d model but how do you integrate those animations and audio? What’s the backend? Are you planning open source it?
PangurBanTheCat@reddit
I'm surprised more people haven't done this yet. I think Grok did something like it? But I haven't heard anything since.
tbh quite a large amount of people use AI for less wholesome purposes... just seems like a match made in heaven to add a visual waifu to one.
Complex_Tea_1244@reddit
I saw jitter twice near the cute cat things, hy is doing motion here too? I mean seriously, I wonder how that occured.
aziib@reddit (OP)
yes, hy motion has some issue when i applied to the 3d model, i will replaced with other, probably using better animation like ai motion capture from video like quick magic.
i_do_too_@reddit
Would love to learn how you created the animation
aziib@reddit (OP)
using hy-motion on comfy ui, still janky and i will replace later with better one.
Complex_Tea_1244@reddit
Bravo!
ThePirateParrot@reddit
Ahah, i tried something similar at some point. With mixamo animations, mood meter, lifecycle, etc. Then felt bored and it's in one of these abandoned project's folders.
ThomasMalloc@reddit
Nice. The most important part for language learning is good feedback on actual speech. If you had feedback from audio recording, it could be legit.
Dazzling_Equipment_9@reddit
It's only suitable for demonstration, isn't it?
logseventyseven@reddit
man that's completely fine but why does it look like a child? I mean I do ERP too so I'm not anti-gooner or something but this is just messed up since you're calling it a "waifu app"
Woof9000@reddit
very science
Beautiful_Egg6188@reddit
Yoooooooo!!