what's best nsfw (roleplay) model I can run on rtx 3060 12gb?
Posted by gfy_expert@reddit | LocalLLaMA | View on Reddit | 48 comments
greetings, what nsfw model can I run on rtx 3060 12gb to fit on vram?
I'm ok with something smaller 8b be it trained or standard
is kobold app still good/top choice now ?
other specs: 5700x3d overclocked, 32gb 3600, 990 pro 1tb
any help would really great be appreciated!
thank you so much!!! and please don't downvote, I've searched a lot, but no relevant post past months.
PangurBanTheCat@reddit
Cydonia-22B-v1.2-IQ4_XS.gguf
gemma-2-27b-it-SimPO-37K.i1-IQ3_XXS.gguf
EVA-Qwen2.5-32B-v0.0-Q2_K.gguf
magnum-v4-22b-IQ4_XS.gguf
Mistral-Small-22B-ArliAI-RPMax-v1.1.i1-IQ4_XS.gguf
input_a_new_name@reddit
lyra-gutenberg-mistral-nemo, mistral-nemo-gutenberg-doppel, violet_twilight-0.2
these three are imo the best all-around 12B finetunes for rp. and i've tried basically everything in the 12B range at this point.
ArsNeph@reddit
I'm a fellow 3060 12GB user. I wouldn't run any less than 7B, so at 8B, Llama 3 finetunes like Stheno 3.2 8B and similar models are quite good. However, frankly, they're a little dumb compared to larger models. I'd highly recommend moving up to Mistral Nemo 12B and it's finetunes, at Q5KM you can fit 16k context, at Q6, you can fit 8k. You should get around 15 tk/s. I'd recommend UnslopNemo 12B and Magnum V4 12B. I've also heard that Starcannon is quite good. If you want to run an even better model, I'd recommend Mistral Small 22B at Q4KM with partial offloading. You should get about 5-8 tk/s. Notable finetunes are Cydonia, Magnum V4, and ArliAI RPMax. I wouldn't go higher than 22B, it starts becoming way too slow. Make sure you use DRY and the correct instruct template
firefox56@reddit
How are you doing these calculations of what can fit in 12 gigs of VRAM at the different quants?
Or did you just do trial and error. I've been doing trial and error but it takes forever because of the file sizes in download times. I'm hoping to figure out some formula so I can just download the ones I know or likely to fit
Quiet_Joker@reddit
Here is what i can fit on my GPU with llamacpp and 32gb of RAM and 12GB of VRAM.
I can run 1-14B models at 8 bits. (30-33 layers sent to the GPU, the rest on RAM)
22B at 8 bits (22 layers sent to GPU)
27B at 5 bits (16 layers sent to GPU)
32B at 5 bits (11\~ layers to GPU)
Speed may suffer performance issues but this is what i can run using the highest possible quality quant my system can take without caring for performance.
ArsNeph@reddit
I've got you :) If you just need a quick estimate of how large your model is, a simple rule of thumb is at 8-bit, 1 billion parameters equals 1 gigabyte. So on average, Q8 14B is 14GB, and Q8 34B is 34GB. At 4-bit, the file size becomes just about half of that, so a Q4 14B is about 7GB, and Q4 34B is about 17GB. Note that Q4KM quants are slightly higher bit, around 4.65, and therefore slightly larger. Then, take this number, and add 2-3 GB for context overhead. That is a rough ballpark of whether it will fit completely in VRAM or not.
However, if you want a more precise number, I highly suggest using this VRAM calculator https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
gnat_outta_hell@reddit
Newbie here: what is DRY?
ArsNeph@reddit
DRY, short for do not repeat yourself is a sampler that's been implemented into a lot of frontends and backends. It prevents the model from repeating itself as more information enters the context, it can be considered a replacement for repetition penalty, even though it can work alongside repetition penalty. I can pretty confidently say that it does a reasonable job at reducing repetition
ChengliChengbao@reddit
This one is a tad outdated, but Fimbulvetr V2 11B is still solid. I also like Lyra V4 12B, which is based on a newer model.
gfy_expert@reddit (OP)
can you please post download links?
ChengliChengbao@reddit
Here are the GGUF quants:
Fimbulvetr: https://huggingface.co/Lewdiculous/Fimbulvetr-11B-v2-GGUF-IQ-Imatrix
Lyra: https://huggingface.co/Lewdiculous/MN-12B-Lyra-v4-GGUF-IQ-Imatrix
Also, if you want something smaller, Stheno is pretty nice too.
Stheno: https://huggingface.co/Lewdiculous/Llama-3.1-8B-Stheno-v3.4-GGUF-IQ-Imatrix
gfy_expert@reddit (OP)
Fimbulvetr is censored, lyra getting errors
ChengliChengbao@reddit
Fimbulvetr has never refused my prompts before, like ever. Must be your configs, try changing your system prompt.
gfy_expert@reddit (OP)
thanks! trying my luck with MN-12B-Lyra-v4-Q6_K-imat.gguf and Ministral-8B-Instruct-2410-Q6_K_L.gguf
Mission_Bear7823@reddit
I've heard good things about Mistral small / 12B finetunes, havent tried them myself though
mayo551@reddit
Cydonia.
Llamacpp, you can fit (almost) every layer in 11GB vram for the q2 gguf with ~20k context if you quant the context.
chloralhydrat@reddit
... I used his rocinante (12B) model so far. With q4 quant. I never run a model with q2 quantization. I wonder how such coarse quantization with more parameters compares to more finely-quantized less parametrized model? Does somebody have an experience with this (ie. 12Bq4 vs. 22Bq2)?
mayo551@reddit
You can do a Q3 cydonia on a 2080ti with 51 layers out of 57 on the GPU at 10k q8 context. You could shed a layer or two and add more context or you could use q4 context @ 24k-30k probably.
With 51 layers out of 57 on the GPU it processes 7k context in about 1 minute and finishes the response in 1.5.
chloralhydrat@reddit
I'm not asking about the speed, but rather about the quality of the output.
mayo551@reddit
Allow me to clarify even further!
You said you never ran a model with a Q2 quant. I let you know it's still possible to run it on Q3 with 11GB VRAM.
You would likely be able to run Q4 with 12GB VRAM with decent performance.
If you are wondering about quality of output, download it and find out?
mayo551@reddit
Try it and tell us!
gfy_expert@reddit (OP)
any download links,please?
mayo551@reddit
Google the following:
“The drummer huggingface”
This will open a list of all of drummers models (almost entirely ERP models). Find cydonia (it’s one of the recent ones) and find the gguf quant you want.
gfy_expert@reddit (OP)
problem is it's 22b and afraid for my modest 12gb vram
supereatball@reddit
You can offload it into ram though...
gfy_expert@reddit (OP)
AI:My policy is to avoid generating or discussing NSFW
mayo551@reddit
It fits like 52 layers on my 11GB 2080ti and the rest goes on system RAM.
Because most of the layers are on the GPU it’s still fast and responsive.
Since you have 12GB VRAM instead of 11 you can likely fit the entire q2 on your HPU with like 10k context.
Linkpharm2@reddit
Gguf is slower, should be used when you need to offload to ram. Exl2 is faster, you're trying to keep it all in vram, so use this. Tabbyapi is good. I recommend clydonia 3bpw.
BlipOnNobodysRadar@reddit
If I can piggyback off this post, I've been doing some side projects that utilize local LLMs through API w/ LMStudio. I think this is obviously suboptimal both because of the .gguf inefficiency and lack of ability to do batching.
Could you point me at better backends I could use? Does exl2 run on llama.cpp?
CheatCodesOfLife@reddit
I'm not sure there's anything inherently wrong with the format.
If I fully offload a model onto a single 3090, it's very fast these days.
Yeah and lack of tensor parallelism make it painfully slow when using more than 1 GPU.
You'd definately want to use exl2 or awq for multi-gpu.
I recomment TabbyAPI - an OpenAI API drop-in replacement:
https://github.com/theroyallab/tabbyAPI
There's also this self-contained chat web ui:
https://github.com/turboderp/exui
And of course the swiss army knife of LLM inference:
https://github.com/oobabooga/text-generation-webui
BlipOnNobodysRadar@reddit
I'm having the local LLM need to go through thousands of queries, so "fast enough" becomes different at scale. If I was using it just 1:1 for chats, gguf would be plenty fast enough
Thanks for the links, you're the second person to recommend tabbyAPI so I'll try that first
CheatCodesOfLife@reddit
Right. I was responding in the context of this guy's post you piggybacked where he's doing role playing.
Exllamav2, while very fast and flexible (eg. parallel processing with 3, 5 GPUs not just multiples of 2), is still designed with single conversations and queuing in mind.
If you're doing massive batch tasks like synthetic data generation, etc then you'll really want to look at vllm and AWQ quants.
Anthonyg5005@reddit
Use tabbyapi. It has batching and runs as an openai and koboldai compatible api server
Barafu@reddit
That's not as much of a difference now, since Kobold started using flash attention and quantized context
LoafyLemon@reddit
PP is still slower in kobold, no?
gfy_expert@reddit (OP)
any link, pretty please to cydonia 3bpw ? can I fit it in only 12gb vram?
wakigatameth@reddit
NemoMix Unleashed 12b is the absolute champion. Quant 8, offload 32 layers to GPU, you get 6 tokens/s which is workable.
Kat-@reddit
I grabbed `NemoMix-Unleashed-12B.Q5_K_M.gguf` and it was a huge upgrade from `c4ai-command-r-08-2024-Q4_K_M.gguf`
When `Mistral-Small-22B-ArliAI-RPMax-v1.1-GGUF` was released, I thought, "Nice. Let's try." It's weird, but, NemoMix is "playful" in a way I can't get ArliAI to be.
wakigatameth@reddit
NemoMix is also more operable because you can tell it to keep responses short and it can, at least for a while, comply. Other "creative" LLMs often struggle with this. including ArliAI-RPMax.
gfy_expert@reddit (OP)
bartowski or MarinaraSpaghetti ? via google
JohannWolfgangGoatse@reddit
Bartowski is always a safe bet.
gfy_expert@reddit (OP)
thanks! trying luck with NemoMix-Unleashed-12B-Q6_K_L.gguf
shirotokov@reddit
taking notes for my dog
lacerating_aura@reddit
Recently tried ministral 8b. It was good. I didn't expect that from an 8b. I got q8 quant cause it allows for about 90% offload in kcpp. Haven't played with lower quants.
gfy_expert@reddit (OP)
can you post a download link pls?
lacerating_aura@reddit
Choose quant for yourself.
https://huggingface.co/bartowski/Ministral-8B-Instruct-2410-GGUF
BokuNoToga@reddit
Thank you kind stranger
gfy_expert@reddit (OP)
thanks! after consulting chat gpt trying my luck with Ministral-8B-Instruct-2410-Q6_K_L.gguf