making my own ai waifu app that can teach me any language.

[-]

Haroombe@reddit

Tbh i would not want to learn language from an AI, it sounds so unnatural and uncanny. You are better off watching youtube videos or using anki

[-]

finevelyn@reddit

This may be true, assuming it has no effect on how much time you will spend learning, which is a big assumption to make.

[-]

Haroombe@reddit

I dont understand. Are you saying that this product (ai waifu) will get you to spend more time learning a language, and for that reason it will be helpful?

If that is your argument I would say just watch v-tubers in your language, or if someone is a goon maxxer watch AVs or adult stuff lol.

[-]

finevelyn@reddit

Pretty much. More time spent and more engaged. It doesn't matter if it's a waifu app or anything else that keeps you engaged, as long as you keep doing it.

Just watching videos you will most likely get bored out of your mind, and then stop learning altogether.

[-]

I would argue watching videos are more interesting than typing to a chatbot or talking to a chatbot. People already doomscroll or watch video essays for countless ours, thats the problem of our generation. Now imagine if you started watching the content you love, in the language you are trying to learn. Sure you won't understand at the beginning, but if the content is in a genre or niche you already know and are interested in (say cars) for example, your interest can supercede the language barrier.

[-]

z_latent@reddit

I can confirm that works. A friend of mine learned Japanese incredibly fast (<1 year) and incredibly well by watching tons and tons of let's plays in Japanese. He said music lyrics also helped in his case.

But I'd say, if someone finds it interesting to chat with an AI, it could work fine for them too. As long as it has correct pronunciation and usage of the language.

[-]

Blastronomicon@reddit

You’re correct btw. The more time spent interacting and using the language the faster it is learned. Learning a language process has all but been understood EXCEPT for how the Defense Language Institute does it because they claim local, micro level societal fluency for graduates in record times - as in graduates not only learn the whole language they also learn a specific micro dialect constrained down to village level fluency.

All we really know is that the program starts with 0 at home language usage. I believe cohorts are sectioned off, but some documentaries also say they’re mixed for out of class time and they have to use the language they study.

[-]

KitsuMateAi@reddit

Is it unity application, or web based? Share your secrets

Just wondering as Unity have support for onnx models, which allows to run them from the application.

[-]

Tomstachy@reddit

Is it unity app? Or web based?

[-]

ELPascalito@reddit

Albeit the VRM bones are jittery, this looks and sounds lovely! Are you running both the LLM and TTS on the same machine ,I presume this requires a moderately strong setup, especially in memory capacity, no?

[-]

aziib@reddit (OP)

yes still figuring out to fix those jittery, probably because the animation is generated using hy-motion. yes it's on the same machine for the llm + tts, omnivoice is quite light it's only takes 4gb if you load without whisper model and make the caption text before, and the llm i'm using quantized version q4_k_m gemma 4 e4b it, with 64k context window, still fit in my 16gb vram with flash-attention on.

[-]

ELPascalito@reddit

How lovely, but yeah it seems the memory needed is no joke, especially since such style of app interaction requires fast speeds, especially for text generation so the TTS can also start asap (unless streaming perhaps) anyhow best of luck!

[-]

honglac3579@reddit

I have it done using piper tts, it quite fast even on cpu and that make my pipeline to achieve real time conversation

[-]

ELPascalito@reddit

True, but Piper's quality was not enough to match my goal, it seems TTS models are either focused on full quality, or max speed, no medium quality to size choices 😅

[-]

honglac3579@reddit

Indeed, i often tweak and turn the post process audio to achieve my expected result (still not quite what i have expected), while larger model can do it by just adding some text or audio for zero shot voice cloning :v

[-]

aziib@reddit (OP)

yes. still figuring it out the memory problem, right now i just maintain maximum of 30 message logs so she will forgot early messages so it doesnt filling up the context window.

[-]

SmartCustard9944@reddit

For animations you can typically mix between them, but not sure which stack you are using.

[-]

ego100trique@reddit

Everyday we stay further away from god lmao

[-]

Glittering_News_1455@reddit

hey btw if you face issues with censorship there is this variant
https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive
I tested it and didn't face any censoring on pretty wild stuff

and also does Omnivoice work realtime for you? for my GTX 1050 it takes 20 seconds to generate a cloned voice

[-]

aziib@reddit (OP)

i'm using rtx 4060ti, it fast as you can see near instant, i also make the tts api sentence by sentence instead a whole message to make it more faster, and modified using custom api that make the steps to 32.

[-]

shoraaa@reddit

i'm literally making something lol, with less focus on the frontend and more focus on automonous waifu (incorporating proactice agentic mindset into 2d waifu)

[-]

FerLuisxd@reddit

How do you handle searching the web? Brave api?

[-]

aziib@reddit (OP)

duckduckgo search

[-]

misha1350@reddit

Man-made horrors

[-]

mpasila@reddit

E4B is probably not big enough for other languages outside of English (other than maybe Spanish and some other large languages), at least I didn't have much success. The bigger models perform much better.

[-]

Training-Event3388@reddit

This should be illegal

[-]

ransuko@reddit

Hmm. Is this... Tsundere... Mesugaki?

[-]

jikilan_@reddit

Oh come on! this is what you use for LLM?

By the way, where is the GitHub url? 😁

[-]

aziib@reddit (OP)

still got so many bugs, i've been thinking make it open source if it's ready.

[-]

jikilan_@reddit

Any plan to make her to have access to the camera? I bet the interaction would be very interesting

[-]

RedParaglider@reddit

We're curing cancer, right?

[-]

_-_David@reddit

Cool use of a small model. Do you plan to use the audio input capability of the model? If this is 90% tsundere waifu, no biggie. But if you're seriously interested in using it to learn another language, I'd make some adjustments.

[-]

aziib@reddit (OP)

yes, i've been working on the voice input, the only downside if i using built in browser speech to text it got so long waiting the response from the ai instead just typing the message and enter (i think it's a bug), and i tried setting up whisper is causing delay too because it ran on the cpu, still finding speech to text alternative that is fast.

[-]

honglac3579@reddit

I using groq api for whisper model, you can try that for experiment

[-]

_-_David@reddit

nvidia parakeet is fast and accurate. I'd try that out. https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3

[-]

honglac3579@reddit

Can't wait to see it on github my man of cultural

[-]

honglac3579@reddit

Btw, you can add speech to text model so that you can communicate with your waifu voice to voice, that would be blissful

[-]

SkyNetLive@reddit

You can’t waifu without ai. It’s right there

[-]

martinerous@reddit

Good stuff.
Some time ago I had an idea to build something like that using Nvidia's Audio2Face but, as it usually happens, did not have enough time. But at least I started something - finetuned my own FasterWhisper Turbo model for Latvian language with lower WER, finetuned VoxCPM to speak Latvian (and now they released VoxCPM2 and I need to train it again LOL), created my own UI frontend app for adventure roleplays... but no 3D avatars yet - I'm secretly hoping that somebody would create an out-of-the-box "drop your photo reference, get a real-time TTS talking head" solution, but nothing like that yet in sight.

[-]

fagenorn@reddit

Super nice way to learn a new language!

I recommend you check out Qwen3 TTS for the voice, I am working on finetuning a voice for my app and it's blowing my mind the quality (vs kokoro which I was using before).

It is a bit heavier, using gguf it uses around 2-3 gigs of vram and RTF is around 0.2 but it's so super worth it once you get it working. demo https://voca.ro/1gjKTnWxzwAP

[-]

aziib@reddit (OP)

i want to use it too, but it's too heavy on the vram, using omnivoice i can manage only 3gb vram by just loading the model without whisper and using manual transcribe using text for the refference voice. and it support many language that is why i'm using it.

[-]

NoLeading4922@reddit

How does her motion work? Is it also generated by some ai model?

[-]

aziib@reddit (OP)

the motion is generated using hy-motion, but it's not live, i generated the motion first and save it as fbx and will be played in the app at certain emotion.

[-]

NoLeading4922@reddit

You can try dartcontrol it is live

[-]

Conscious-Hair-5265@reddit

You gotta work on the jiggle physics

[-]

ProfessionalSpend589@reddit

Ok, I see it now how LLM can be dangeours.

[-]

semperaudesapere@reddit

What model are you using. This isn't natural sounding english.

[-]

sunshinecheung@reddit

lol, looks good

[-]

InstaMatic80@reddit

Wow that looks amazing! Would you please share more info about how to do it? I mean, you created a 3d model but how do you integrate those animations and audio? What’s the backend? Are you planning open source it?

[-]

PangurBanTheCat@reddit

I'm surprised more people haven't done this yet. I think Grok did something like it? But I haven't heard anything since.

tbh quite a large amount of people use AI for less wholesome purposes... just seems like a match made in heaven to add a visual waifu to one.

[-]

Complex_Tea_1244@reddit

I saw jitter twice near the cute cat things, hy is doing motion here too? I mean seriously, I wonder how that occured.

[-]

aziib@reddit (OP)

yes, hy motion has some issue when i applied to the 3d model, i will replaced with other, probably using better animation like ai motion capture from video like quick magic.

[-]

i_do_too_@reddit

Would love to learn how you created the animation

[-]

aziib@reddit (OP)

using hy-motion on comfy ui, still janky and i will replace later with better one.

[-]