brahh85

Me visiting this sub

Posted by Scutoidzz@reddit | LocalLLaMA | View on Reddit | 79 comments

Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026

Posted by 9r4n4y@reddit | LocalLLaMA | View on Reddit | 52 comments

brahh85@reddit

i want to try this ggml implementation, but i barely have time [https://github.com/pwilkin/openmoss](https://github.com/pwilkin/openmoss)

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python

Posted by mudler_it@reddit | LocalLLaMA | View on Reddit | 40 comments

brahh85@reddit

thank you so much for the project, its awesome because of the universal support for almost all hardware do you plan to include the generation of subtitles?

Don’t bite me for that question please…

Posted by Thin_Pollution8843@reddit | LocalLLaMA | View on Reddit | 79 comments

brahh85@reddit

People spends money in more stupids things, like fancy cars , homes , vacations, home appliances, clothes, sneakers, smartphones... spending money on hardware is money well spent , when there is a global shortage and the prices of gpus are skyrocketing (my MI50 went x3.5 in price in less than 6 months) and clouds are having restrictions on plans (and more to come). To be honest, when i ordered them 6 months ago i had seconds thoughts for some time, but when im seeing prices getting crazy, i feel that i did it right buying 3 , and that probably i should have bought one more.

One letter to appease them all

Posted by ivari@reddit | LocalLLaMA | View on Reddit | 70 comments

It was fun while it lasted... They're advertising now.

Posted by Local-Cardiologist-5@reddit | LocalLLaMA | View on Reddit | 43 comments

brahh85@reddit

OP enjoys the idea that open weight will cease to exist, and that labs will turn their backs on us, the problem with that view is that we never had more open weight models than now, we never had more labs contributing, the gap between closed weights and open weights is shorter than ever, we have models from 0B to 1.7 T , and the best model per parameter, qwen 3.6 27B , was just released one month ago, by alibaba, 2 weeks later SAE-Res qwen 3.5 was released by alibaba, and then qwen webworld was released by alibaba.

Next year we're getting 0.5T model from Grok

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 200 comments

brahh85@reddit

thats kinda the problem, grok released some weights , but in a way in which is useless to add anything to the community. Cohere has a valuation of $7 billion, xai of $250 billion, and the contribution to the community from cohere was way more significant.

TTS Benchmark Comparison (all known TTS up until May 2026)

Posted by UkieTechie@reddit | LocalLLaMA | View on Reddit | 60 comments

brahh85@reddit

related to tts, using one in a MI50 is a bit of chaotic due pytorch and dependencies , but this one uses ggml [https://github.com/ServeurpersoCom/omnivoice.cpp](https://github.com/ServeurpersoCom/omnivoice.cpp) so it works with vulkan, cuda , metal, cpu... and so far is the best i found for my language (i had to clone a voice to get the accent)

Is there any reason for an uncensored model if you have no interest in roleplaying?

Posted by vick2djax@reddit | LocalLLaMA | View on Reddit | 271 comments

brahh85@reddit

Some models would let you die instead of giving you a medical advice. Some model wont translate strong words. Some models wont let you debug a coding problem, because they think you are hacking. Some models wont answer in historical questions. Some models wont answer you about nowadays news, even if you feeding it with them, because they refuse to believe that scenarios like usa bombed iran could happen, and the model will think that are you are trying to write fake news. A heretic model saves you a lot of these corporation stupid things

GPT 5.5 "secret sauce" is just having the thinking be some stupid caveman mode?

Posted by JustFinishedBSG@reddit | LocalLLaMA | View on Reddit | 154 comments

Rejoice, if Qwen doesn't release any new local model, it's a blessing in disguise

Posted by crowtain@reddit | LocalLLaMA | View on Reddit | 19 comments

brahh85@reddit

>So yes let's pray that Qwen stop releasing their models Sure, having qwen stopping releasing new models is the best for local AI, like suicide is the best way to live a life.

I fine-tuned Cohere Transcribe to support diarization and timestamps

Posted by iamMess@reddit | LocalLLaMA | View on Reddit | 25 comments

Qwen has no incentive to release new open source models quickly because the glazing on this sub makes it unnecessary.

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 41 comments

Qwen3.7 Max Preview hitting Arena 13 is a bigger signal than the rank

Posted by Top-Cardiologist1011@reddit | LocalLLaMA | View on Reddit | 3 comments

brahh85@reddit

i remember reading when junyang lin quit that the main problem alibaba had was the lack of resources for the qwen team, and the resignation was a way to draw attention to this issue , making the bosses responsible of this and i think the bosses are trying to shut up doubts by releasing models more often, so we forget about that episode probably they put on qwen team more resources, and also qwen discarded the 122B or the 9B to gain speed on upgrade the architecture faster, rather than delivering a full set of models per version we were surprised by 3.5 27B, then 3.6 27B left us astonished , then a 3.7 27B that surpasses that would increase the hype even more. Probably the strategy of alibaba is what we see in this reddit a lot, trying to produce a model so intelligent and good for code that people kicked away from clouds by prices and cuts starts using qwen , increasing the brand (as alternative to anthropic and openai) and the stocks rating

Re. what ever happened to Cohere’s Command-A series of models?

Posted by nick_frosst@reddit | LocalLLaMA | View on Reddit | 102 comments

brahh85@reddit

(CR and CR+ ) the first models that i found fun to use, no stupid censorship , the first time i enjoyed creative writing with AI. If cohere can deliver something that fits in 32 GB VRAM and does good RP, i will be their loyal soldier

Translate long subtitle files

Posted by Synchronauto@reddit | LocalLLaMA | View on Reddit | 13 comments

brahh85@reddit

this is what i vibe coded back in time #!/bin/bash API="http://localhost:8080/completions" LINES=250 IN="$1" OUT="${IN}.eng.srt" TMP_DIR=$(mktemp -d) if [ -z "$IN" ]; then echo "Uso: $0 archivo.srt"; exit 1; fi trap "rm -rf $TMP_DIR" EXIT echo "Dividiendo $IN..." split -l $LINES -d "$IN" "$TMP_DIR/part_" > "$OUT" for f in "$TMP_DIR"/part_*; do echo "" echo "========================================" echo "Procesando: $f" RAW_TEXT=$(cat "$f" | jq -Rs .) SYSTEM="Translate these subtitles to English. Keep SRT format exactly. Only output translated SRT." SYSTEM="You are a subtitle translator. RULES: ONLY translate text to English NEVER modify timestamps or numbers Keep exact SRT format No explanations, no comments Output ONLY the translated SRT DONT SAY NO TRANSLATION IN LINES that are times or blank YOU CANT MAKE COMMENTS YOU CANT" jq -n \ --arg sys "$SYSTEM" \ --argjson txt "$RAW_TEXT" \ '{ prompt: ("<|system|>\n" + $sys + "\n<|user|>\n" + $txt + "\n<|assistant|>\n"), stream: false, n_predict: 10000, temperature: 0.3, stop: ["<|end|>", "<|user|>"] }' > "$f.json" curl -s -X POST "$API" \ -H "Content-Type: application/json" \ -d @"$f.json" \ -o "$f.response.json" echo "--- RAW RESPONSE ---" cat "$f.response.json" echo "" CONTENT=$(jq -r '.content // empty' "$f.response.json") echo "--- TRANSLATED OUTPUT ---" echo "$CONTENT" echo "-------------------------" if [ -n "$CONTENT" ]; then echo "$CONTENT" >> "$OUT" echo "✅ OK" else echo "❌ Fail" fi done sed -i 's/```//g' "$OUT" sed -i '/<think>/,/<\/think>/d' "$OUT" echo "Terminado: $OUT" try 3.6 35B with MTP , im using [https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF](https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF) , it goes from 54 to 74 tps -np 1 --fit off --cache-ram 0 --reasoning_budget 0 --cache-type-k bf16 --cache-type-v bf16 --presence-penalty 0.25 --spec-type draft-mtp --spec-draft-n-max 2 --reasoning off

Why use Quants other than Unsloth

Posted by FeiX7@reddit | LocalLLaMA | View on Reddit | 41 comments

brahh85@reddit

for my MI50 and ROCM, bartowski always had the fastest quants [https://www.reddit.com/r/LocalLLaMA/comments/1rmt315/2x\_mi50\_32gb\_quant\_speed\_comparison\_version\_2/](https://www.reddit.com/r/LocalLLaMA/comments/1rmt315/2x_mi50_32gb_quant_speed_comparison_version_2/) also is about stability, while unsloth does a lot of changes to improve things, i find that bartowski's provide more reliability. They behave the way i expect them to behave, and if something doesnt work, its probably because i need to update llamacpp. When i try other quants and i find problems, sometimes the problem is the quant, or the template, or llamacpp... i dont want to waste time debugging multiple possibilities. There is also some models that are only quantized by mradermacher also sloth doesnt do heretics models also ubergarm and its quants for ik\_llama , that are the best for extreme quantizations like IQ1 , IQ2 or IQ3 for huge models so there is plenty of reason why is great there is more than one provider

Weird performance depending on quant

Posted by WhiskyAKM@reddit | LocalLLaMA | View on Reddit | 8 comments

brahh85@reddit

for CPU is like what the previous redditor said (it converts easier to fp16), for gpu its also like this for old hardware (MI50) , newer gpu have native int4 that triggers hardware acceleration different quants need different computation , this is a comparison [https://www.reddit.com/r/LocalLLaMA/comments/1rmt315/2x\_mi50\_32gb\_quant\_speed\_comparison\_version\_2/](https://www.reddit.com/r/LocalLLaMA/comments/1rmt315/2x_mi50_32gb_quant_speed_comparison_version_2/) , thats why if i look for high speed i go Q4 instead of IQ5 or IQ4 , unless i need extra accuracy (coding)

GitHub - pwilkin/openmoss: OpenMOSS pure C++ pipeline based on GGML

Posted by ilintar@reddit | LocalLLaMA | View on Reddit | 8 comments

brahh85@reddit

beyond the hell that python is, there is another hell, using old and non-nvidia hardware in pytorch , because many python TTS engines just ignore it. So your project is a silver line , because it gives support (thanks to ggml) and hardware acceleration to TTS, which is critical for this use case. Thank you so much for your altruism.

MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

Posted by ai-infos@reddit | LocalLLaMA | View on Reddit | 80 comments

brahh85@reddit

now rocm is easier, amd supported it again, unoficially [https://www.reddit.com/r/LocalLLaMA/comments/1t86j45/comment/oku2hli/?context=3](https://www.reddit.com/r/LocalLLaMA/comments/1t86j45/comment/oku2hli/?context=3)

Is SillyTavern the most underrated frontend? Could it be an interface with potential trapped in a silly name? Or is it just for a niche?

Posted by Spiderboyz1@reddit | LocalLLaMA | View on Reddit | 82 comments

brahh85@reddit

yeah, so many trashy people in this thread, their only contribution is disrespect that aim to demotivate the heroes that dedicate a part of their lives to create and share something awesome and free with the community.

More Qwen3.6-27B MTP success but on dual Mi50s

Posted by legit_split_@reddit | LocalLLaMA | View on Reddit | 33 comments

brahh85@reddit

if you want to upgrade ROCM [https://www.reddit.com/r/LocalLLaMA/comments/1syqoby/comment/oix3sw6/?context=3](https://www.reddit.com/r/LocalLLaMA/comments/1syqoby/comment/oix3sw6/?context=3) i upgraded to 7.12 with [https://www.reddit.com/r/LocalLLaMA/comments/1s8thlo/build\_script\_for\_llamacpp\_for\_rocm\_including\_mi50/](https://www.reddit.com/r/LocalLLaMA/comments/1s8thlo/build_script_for_llamacpp_for_rocm_including_mi50/) and was super easy

vLLM ROCm has been added to Lemonade as an experimental backend

Posted by jfowers_amd@reddit | LocalLLaMA | View on Reddit | 93 comments

brahh85@reddit

about educational resources, the ones in the playbooks are "human friendly", what about having some "AI friendly" , for example, hardcore level examples and explanations that i can feed to qwen 3.6 27B context , for example to write kernels for the gpu so if i want to write a kernel in hip, on one side i have a human explanation, and on the other side i have the AI boosted with 100k tokens in context related to the matter probably it would work better in models with 1T parameters , but the idea is the same

Unpopular Opinion: The DGX Spark Forum community of devs is talented AF and will make the crippled hardware a success through their sheer force of will.

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 253 comments

ZAYA1-74B-Preview: Scaling Pretraining on AMD

Posted by TKGaming_11@reddit | LocalLLaMA | View on Reddit | 34 comments

brahh85@reddit

rather than something that beats qwen 3.6, i want something that complement it. For example, gemma is great for creative writing, and i would want more models like that.

DeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid.

Posted by spencer_kw@reddit | LocalLLaMA | View on Reddit | 166 comments

brahh85@reddit

these recently shortages in cloud tokens werent caused by saas , but by openclowns , they are worse than the companies that will crash, in fact, those companies are also screwed by them now

DeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid.

Posted by spencer_kw@reddit | LocalLLaMA | View on Reddit | 166 comments

brahh85@reddit

There wont be outages, just because prices will kick out many people first. Its the same that happens with healthcare, the hospital are supplied, but many people cant afford go to them.

Common and Obscure Models and Ways to Find Them [ Human Written ]

Posted by iMakeSense@reddit | LocalLLaMA | View on Reddit | 20 comments

Common and Obscure Models and Ways to Find Them [ Human Written ]

Posted by iMakeSense@reddit | LocalLLaMA | View on Reddit | 20 comments

brahh85@reddit

>I don't know why people keep touting whisper. These are more accurate, hallucinate less, and or run faster Not true for non-english. And for english, people already have working workflows using whisper , and when they see the alternative models making a mistake , they prefer the old (and predictable) one rather than the new. In my case, for english audio, parakeet wasnt able to catch a lot of the words that whisper did, probably because of the nature of the audio i gave it (not clean), so it was more useless than whisper(if parakeet was 70%, whispers was 95%). And for spanish audio , parakeet started to output english in the middle of the transcription , so it was trash for the use case. >Feels like GIMP vs. Krita. thats how you make gimp users hate you for free >Whisper hallucinates because it's train off Youtube data. exactly my use case, transcribe podcasts from youtube

Common and Obscure Models and Ways to Find Them [ Human Written ]

Posted by iMakeSense@reddit | LocalLLaMA | View on Reddit | 20 comments

brahh85@reddit

well, this is not the standard , but the closes relative to llamacpp [https://github.com/leejet/stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp)

DeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid.

Posted by spencer_kw@reddit | LocalLLaMA | View on Reddit | 166 comments

brahh85@reddit

the problem with cloud tokens is that soon we wont be able to pay them, because the demand will exceed the offer by a lot, thats why multiple plans are suffering cuts in last weeks. In the near future we will be very happy to have bought a GPU and turn kwh into millions of tokens per day, because we wont have another source of inference. And the next step will be buying solar panels, because datacenters will end absorbing the power from the people, specially in some regions of usa.

Open Weights Models Hall of Fame

Posted by Equivalent_Job_2257@reddit | LocalLLaMA | View on Reddit | 32 comments

Kv cache quantization: ignorance, or malice?

Posted by wombweed@reddit | LocalLLaMA | View on Reddit | 94 comments

brahh85@reddit

talking about models and how they resist cache quantization [https://www.reddit.com/r/LocalLLaMA/comments/1suh3sz/gemma\_4\_and\_qwen\_36\_with\_q8\_0\_and\_q4\_0\_kv\_cache/](https://www.reddit.com/r/LocalLLaMA/comments/1suh3sz/gemma_4_and_qwen_36_with_q8_0_and_q4_0_kv_cache/)

GPT 5.5 just leaked its chain of thought to me in codex, and it looks like an idea from 5 months ago in this sub.

Posted by Homeschooled316@reddit | LocalLLaMA | View on Reddit | 61 comments

brahh85@reddit

thats like speculative prefill [https://www.reddit.com/r/LocalLLaMA/comments/1t0vp3w/pflash\_10x\_prefill\_speedup\_over\_llamacpp\_at\_128k/](https://www.reddit.com/r/LocalLLaMA/comments/1t0vp3w/pflash_10x_prefill_speedup_over_llamacpp_at_128k/) but being speculative prefill more fancy

Help with MI50 and llama.cpp/ROCm 7.2

Posted by WhatererBlah555@reddit | LocalLLaMA | View on Reddit | 7 comments

brahh85@reddit

this is the one i tried [https://www.reddit.com/r/LocalLLaMA/comments/1pkvc85/comment/ntysctk/?context=3](https://www.reddit.com/r/LocalLLaMA/comments/1pkvc85/comment/ntysctk/?context=3) this is another that should work [https://www.reddit.com/r/LocalLLaMA/comments/1s8thlo/build\_script\_for\_llamacpp\_for\_rocm\_including\_mi50/](https://www.reddit.com/r/LocalLLaMA/comments/1s8thlo/build_script_for_llamacpp_for_rocm_including_mi50/) this is a comparison of rocm and vulkan with mi50 [https://www.reddit.com/r/LocalLLaMA/comments/1rmt315/2x\_mi50\_32gb\_quant\_speed\_comparison\_version\_2/](https://www.reddit.com/r/LocalLLaMA/comments/1rmt315/2x_mi50_32gb_quant_speed_comparison_version_2/) what command are you using for llamacpp? what quant is your model?

GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

Posted by Holiday_Purpose_3166@reddit | LocalLLaMA | View on Reddit | 24 comments

brahh85@reddit

this is changing natural language for another language with more density of information per token, so the CoT is kept and shouldnt be too damaged , in fact, for the moe even improves , and for the dense is the same. Back in time i tried the models to use chinese for reasoning traces, for the same reason.

Deepseek V4 AGI comfirmed

Posted by Swimming-Sky-7025@reddit | LocalLLaMA | View on Reddit | 186 comments

brahh85@reddit

This is the gap between chinese models and american models, the models that the pentagon use in iran would have killed all the kids and given all the oranges to trump.

Forgive my ignorance but how is a 27B model better than 397B?

Posted by No_Conversation9561@reddit | LocalLLaMA | View on Reddit | 286 comments

brahh85@reddit

the part about not releasing the 397B weights is an assumption you made based in zero facts , unless you want to give credit to twitter rumors, or to medias that elevated those rumor to news (clickbait) without providing a single evidence . So far, all the qwen LLM had their weights released.

duda sobre descargarse IA de forma local

Posted by Individual-Party1661@reddit | LocalLLaMA | View on Reddit | 6 comments

duda sobre descargarse IA de forma local

Posted by Individual-Party1661@reddit | LocalLLaMA | View on Reddit | 6 comments

brahh85@reddit

en teoria, no te hace falta desinstalar windows , yo instalaria ubuntu en uno de los ssd que tienes, y luego el sistema de arranque de ubuntu (grub) te permite elegir si arrancas linux o windows cada vez que enciendes el PC yo en cada SSD que tengo hay instalado un SO diferente, porque esto me permite que si algo se rompe siempre tengo otro para usar y ademas, si algun dia quieres actualizar tu SSD por otro mejor, asi es mas facil clonar tu SSD con tu OS a ese nuevo disco, y disfrutar de la velocidad sin tener que reinstalar. Yo ahora mismo estoy usando ubuntu 24.04

duda sobre descargarse IA de forma local

Posted by Individual-Party1661@reddit | LocalLLaMA | View on Reddit | 6 comments

brahh85@reddit

with just a 3060 , try qwen 3.5 9B if we count your RAM and 3060, try gemma 4 26B it should be a bit slow, because a part of the model will sit on RAM , but you will get a good idea of what you can get if you like it, but consider it too slow, then buy the Nvidia P102-100 to speed up things. If you look at buying an old gpu, i would recommend you the more VRAM the better, thats why i bought MI50 with 32 GB . 10 GB look like an improvement , but maybe for just a bit more of money you can get a GPU that allows you to do more things , like the difference between loading a 9B model or a qwen 3.5 27B or gemma 4 31B. Its like comparing a knife with a sword. From my perspective, if you arent doing too much, probably the 3060 and your RAM could be enough for your needs. About OS, give linux a try, if you like it and things runs smoothly, just move to linux. Ive been using it for the last 20 years. For some setups linux is the only way to unlock the full potential of their hardware , because windows is just bloatware. About gpu, maybe other people recommend you buying a 3090 , and if you can afford it, i will support that idea too, for example if you are coding all the time with claude code, the combo of the 3060 and the 3090 looks great.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Posted by PerceptionGrouchy187@reddit | LocalLLaMA | View on Reddit | 117 comments

brahh85@reddit

can you try this as draft model? [https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive) any result would be interesting , it will show if the abliteration degrades the model or not, or if some areas get improved (probably math and coding)

32 gb or 64 gb of ddr5

Posted by Worried-Register4465@reddit | LocalLLaMA | View on Reddit | 10 comments

brahh85@reddit

as daily driver i would use a model around 30B , and a 100B-400B for world knowledge or the task that the 30B cant accomplish, so i dont have to resort to api models. As far as it writes faster than i read , dual channel works ok.

32 gb or 64 gb of ddr5

Posted by Worried-Register4465@reddit | LocalLLaMA | View on Reddit | 10 comments

brahh85@reddit

Thanks for the tip, I plan to buy the motherboard and the CPU after i get the ddr5. If only we had quadchannel CPU by then it would be perfect.

FlashAttention (FA1–FA4) in PyTorch - educational implementations focused on algorithmic differences

Posted by shreyansh26@reddit | LocalLLaMA | View on Reddit | 1 comments

brahh85@reddit

Thank you so much. I was looking for something like this since [https://www.reddit.com/r/LocalLLaMA/comments/1s614i8/built\_a\_simple\_pytorch\_flashattention\_alternative/](https://www.reddit.com/r/LocalLLaMA/comments/1s614i8/built_a_simple_pytorch_flashattention_alternative/) writing kernels for my gpu is something i never thought i would be able to. I always had in my mind the line of linus "Do you pine for the nice days of minix-1.1, when men were men and wrote their own device drivers?"

32 gb or 64 gb of ddr5

Posted by Worried-Register4465@reddit | LocalLLaMA | View on Reddit | 10 comments

brahh85@reddit

if you can afford it, go as big as you can. With time, 64 GB will end being useful. But if you buy 32, you will start regretting spending the money there and not in 64. If prices were more decent, i would recommend you go 128. Thats what i have in mind, buy 128 when the prices are right.

FT - China’s Alibaba shifts towards revenue over open-source AI

Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 132 comments

Hugging Face launches a new repo type: Kernels

Posted by clem59480@reddit | LocalLLaMA | View on Reddit | 24 comments

brahh85@reddit

it isnt [https://github.com/goabiaryan/awesome-gpu-engineering](https://github.com/goabiaryan/awesome-gpu-engineering) in hugging face with 2 clicks you get source code , the github is nice as a guide to learn , but the process to feed code to an AI is way more tortuous

Hugging Face launches a new repo type: Kernels

Posted by clem59480@reddit | LocalLLaMA | View on Reddit | 24 comments

brahh85@reddit

i want to vibe code some kernels, and having a selection of good ones in the same place saves me a lot of effort in my experiments [https://huggingface.co/kernels-community/kernels](https://huggingface.co/kernels-community/kernels)

EXAONE 4.5 released

Posted by Secure_Smoke_4280@reddit | LocalLLaMA | View on Reddit | 42 comments