what's best nsfw (roleplay) model I can run on rtx 3060 12gb?
Posted by gfy_expert@reddit | LocalLLaMA | View on Reddit | 90 comments
greetings, what nsfw model can I run on rtx 3060 12gb to fit on vram?
I'm ok with something smaller 8b be it trained or standard
is kobold app still good/top choice now ?
other specs: 5700x3d overclocked, 32gb 3600, 990 pro 1tb
any help would really great be appreciated!
thank you so much!!! and please don't downvote, I've searched a lot, but no relevant post past months.
tito-victor@reddit
I've got a 3060ti 8gb and 32GB of RAM, any recommendations for NSFW models?
gfy_expert@reddit (OP)
Read what was written here, pick up your model and offload to vram. Some programs to load llm auto-offload models to cpu ram (ddr4/5)
ArsNeph@reddit
I'm a fellow 3060 12GB user. I wouldn't run any less than 7B, so at 8B, Llama 3 finetunes like Stheno 3.2 8B and similar models are quite good. However, frankly, they're a little dumb compared to larger models. I'd highly recommend moving up to Mistral Nemo 12B and it's finetunes, at Q5KM you can fit 16k context, at Q6, you can fit 8k. You should get around 15 tk/s. I'd recommend UnslopNemo 12B and Magnum V4 12B. I've also heard that Starcannon is quite good. If you want to run an even better model, I'd recommend Mistral Small 22B at Q4KM with partial offloading. You should get about 5-8 tk/s. Notable finetunes are Cydonia, Magnum V4, and ArliAI RPMax. I wouldn't go higher than 22B, it starts becoming way too slow. Make sure you use DRY and the correct instruct template
Alternative_Welder95@reddit
I know I'm a little late, but what settings do you use in SillyTavern? On magnum's hugginface page it recommends context and instruct but I don't see anything about temperatures, min p, etc.
ArsNeph@reddit
Quite late indeed, but no matter. Personally, I hit neutralize samplers, leave temp at one, set min p to .02, and DRY to .8. Context length depends on the individual model, check the RULER benchmark to see true context lengths. Mistral Nemo advertises 128k, but actually only supports 16k. Setting it any higher than the native context length will cause severe degradation. You may want to take a look at MarinaraSpaghetti's huggingface profile, as her sampler settings are also widely regarded. At this point in time, I don't recommend Magnum V4 or Starcannon anymore, try Mag-mell 12B. It uses ChatML. Unslop Nemo is cool too.
Alternative_Welder95@reddit
I really appreciate it, I'm definitely sticking with Mag Mell these days, it's giving me good results in general and much fresher and cleaner than other previous models, thank you for taking the time to answer me!
ArsNeph@reddit
NP :)
firefox56@reddit
How are you doing these calculations of what can fit in 12 gigs of VRAM at the different quants?
Or did you just do trial and error. I've been doing trial and error but it takes forever because of the file sizes in download times. I'm hoping to figure out some formula so I can just download the ones I know or likely to fit
ArsNeph@reddit
I've got you :) If you just need a quick estimate of how large your model is, a simple rule of thumb is at 8-bit, 1 billion parameters equals 1 gigabyte. So on average, Q8 14B is 14GB, and Q8 34B is 34GB. At 4-bit, the file size becomes just about half of that, so a Q4 14B is about 7GB, and Q4 34B is about 17GB. Note that Q4KM quants are slightly higher bit, around 4.65, and therefore slightly larger. Then, take this number, and add 2-3 GB for context overhead. That is a rough ballpark of whether it will fit completely in VRAM or not.
However, if you want a more precise number, I highly suggest using this VRAM calculator https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
firefox56@reddit
Thank you for this! As it turns out, I only have an 8GB 3060Ti, so my options are super limited.
But I do have a pretty decent AMD laptop with a Ryzen 7 PRO + 32GB of RAM, so now I'm trying to figure out how to find the best model for 8 cores CPU and just have that be a dedicated AI device.
As for the OP, I've been doing OK with 4_0_4_8 quants on my phone and running a Mistral 7B on my Samsung S23 Ultra, but I'll prob have to find something slightly smaller because it gets really slow right when things are getting good :)
ArsNeph@reddit
NP! For CPU inference, speed is limited by memory bandwidth, so generally you'd only really get good speeds on a 8B, acceptable speeds on a 12b, and slow speeds on a 22b. I wouldn't try going higher than that for real time usage. Personally, I'd just say run like a Q5KM of a 12B while offloading as much as you can to GPU and call it a day. As for your phone, I can't really say much. I wouldn't advise going below 7B, but you could try Llama 3.2 3B fine-tune at Q8 if you really need more speed
Quiet_Joker@reddit
Here is what i can fit on my GPU with llamacpp and 32gb of RAM and 12GB of VRAM.
I can run 1-14B models at 8 bits. (30-33 layers sent to the GPU, the rest on RAM)
22B at 8 bits (22 layers sent to GPU)
27B at 5 bits (16 layers sent to GPU)
32B at 5 bits (11\~ layers to GPU)
Speed may suffer performance issues but this is what i can run using the highest possible quality quant my system can take without caring for performance.
firefox56@reddit
Yeah I really don't want to be less than 10 tokens/s at the start of the convo, I've seen it get down to 3-4 by the end of the context.
I only have an 8GB 3060ti, it turns out, so it's pretty tough to fit any more than a 10.7B SOLAR model, in my trial-and-error testing.
gnat_outta_hell@reddit
Newbie here: what is DRY?
ArsNeph@reddit
DRY, short for do not repeat yourself is a sampler that's been implemented into a lot of frontends and backends. It prevents the model from repeating itself as more information enters the context, it can be considered a replacement for repetition penalty, even though it can work alongside repetition penalty. I can pretty confidently say that it does a reasonable job at reducing repetition
Fluffy-Feedback-9751@reddit
Does it get rid of the times when the model will repeat a previous reply almost exactly?
ArsNeph@reddit
Generally speaking, it should. It's incentivized to break loops and over probable tokens. It doesn't always work though. You may have to raise the multiplier a little depending on how bad it is
Jim__my@reddit
What frontend do you use that has DRY?
Expensive-Paint-9490@reddit
Silly Tavern with llama.cpp has DRY. Kobold.cpp too, both standalone and with Silly Tavern. They have XTC (exclude top choices) too which is very good for creative tasks.
TheRealGentlefox@reddit
Also the creator of both recommends using them together. I highly recommend it to, have had amazing results. Things that make a 13B model feel more creative than a 70B model.
yuppieliam@reddit
SillyTavern has DRY
Caffdy@reddit
how do I activate it/use it?
superfluid@reddit
Bear in mind that ST is a front end and your backend (for example ooobabooga's text-generation-webui) must support it too.
Caffdy@reddit
so, does Oobabooga supports it? that's the one I use actually
yuppieliam@reddit
On the sampler settings, click "sampler select" then tick DRY related samplers.
Spirited_Example_341@reddit
for me on backyard ai it seems for a 1080 gtx ti that model is too large to run smoothy tho. nemo 12b i mean.
for now i find llama 3 8b works fine its a good balance between usefulness and size doenst really seem that "dumb" to me and produces pretty nice content.
ArsNeph@reddit
Try using a smaller quant, 1080Ti has 11GB VRAM, not 12
bearbarebere@reddit
Hijacking this to add that my list still holds up: https://www.reddit.com/r/LocalLLaMA/s/wWqoKbZGw3
lacerating_aura@reddit
Recently tried ministral 8b. It was good. I didn't expect that from an 8b. I got q8 quant cause it allows for about 90% offload in kcpp. Haven't played with lower quants.
TheRealGentlefox@reddit
I can't find EXL2 quants for it. Isn't that much superior when you know it's going to fit in VRAM anyway?
gfy_expert@reddit (OP)
can you post a download link pls?
lacerating_aura@reddit
Choose quant for yourself.
https://huggingface.co/bartowski/Ministral-8B-Instruct-2410-GGUF
BokuNoToga@reddit
Thank you kind stranger
gfy_expert@reddit (OP)
thanks! after consulting chat gpt trying my luck with Ministral-8B-Instruct-2410-Q6_K_L.gguf
mayo551@reddit
Cydonia.
Llamacpp, you can fit (almost) every layer in 11GB vram for the q2 gguf with ~20k context if you quant the context.
chloralhydrat@reddit
... I used his rocinante (12B) model so far. With q4 quant. I never run a model with q2 quantization. I wonder how such coarse quantization with more parameters compares to more finely-quantized less parametrized model? Does somebody have an experience with this (ie. 12Bq4 vs. 22Bq2)?
mayo551@reddit
You can do a Q3 cydonia on a 2080ti with 51 layers out of 57 on the GPU at 10k q8 context. You could shed a layer or two and add more context or you could use q4 context @ 24k-30k probably.
With 51 layers out of 57 on the GPU it processes 7k context in about 1 minute and finishes the response in 1.5.
chloralhydrat@reddit
I'm not asking about the speed, but rather about the quality of the output.
petrus4@reddit
a} Your current video card is the same as my current one, but I would also recommend 64 Gb of system RAM.
b} Once you have done that, get the Q8 of this. Load as much of it into video ram as you can, and the rest into system ram. You can do that in Kobold by specifying the number of layers. Start small until you figure out how much VRAM each layer costs you.
mayo551@reddit
Allow me to clarify even further!
You said you never ran a model with a Q2 quant. I let you know it's still possible to run it on Q3 with 11GB VRAM.
You would likely be able to run Q4 with 12GB VRAM with decent performance.
If you are wondering about quality of output, download it and find out?
mayo551@reddit
/u/chloralhydrat you can run the 22B Q4 with 8k context with 45 layers on 11GB VRAM and 13 layers on RAM.
Performance is not bad. You would likely be able to offload another layer (or two) onto RAM, bringing up your context window to maybe 16k?
Note that I am using cache-type-k of q8_0 and cache-type-v of q5_0 for context. You could lower those down to q4 but lose accuracy, however you would be able to fit like triple the context onto the current setup without shedding layers.
Have fun!
mayo551@reddit
Try it and tell us!
gfy_expert@reddit (OP)
any download links,please?
mayo551@reddit
Google the following:
“The drummer huggingface”
This will open a list of all of drummers models (almost entirely ERP models). Find cydonia (it’s one of the recent ones) and find the gguf quant you want.
gfy_expert@reddit (OP)
problem is it's 22b and afraid for my modest 12gb vram
supereatball@reddit
You can offload it into ram though...
gfy_expert@reddit (OP)
AI:My policy is to avoid generating or discussing NSFW
mayo551@reddit
It fits like 52 layers on my 11GB 2080ti and the rest goes on system RAM.
Because most of the layers are on the GPU it’s still fast and responsive.
Since you have 12GB VRAM instead of 11 you can likely fit the entire q2 on your HPU with like 10k context.
EpicFuturist@reddit
How about 10 GB? I have a 3080 and have been looking to do something extremely similar but everyone either has 12 GB or higher
gfy_expert@reddit (OP)
Open your own topic,buddy. Mention full pc specs. Some models posted here but smaller sizes will work for you
superfluid@reddit
That's not exactly fair... I mean, someone could have told you to do a search for one of the near-weekly threads on this topic.
bearbarebere@reddit
A lot of the ones on my post will work for you! https://www.reddit.com/r/LocalLLaMA/comments/1fmqdct/favorite_small_nsfw_rp_models_under_20b/
wakigatameth@reddit
NemoMix Unleashed 12b is the absolute champion. Quant 8, offload 32 layers to GPU, you get 6 tokens/s which is workable.
Nice_Squirrel342@reddit
I have tried this model (Q6K) with different settings/templates and with different system prompts, but it always writes my actions and dialogues for me, even though I have stated very clearly and explicitly that this is absolutely unacceptable. This is almost never the case with other models I use.
wakigatameth@reddit
I use quant 8, and this model did hundreds of RPs with me without writing my actions for me. Of course I also have these rules: .
MANDATORY: LIMIT RESPONSES TO 1 PARAGRAPH. BE SUCCINCT. DO NOT RAMBLE.
Never describe smells.
Never attribute actions or words to me that I haven't inputted myself.
Never describe my actions.
Never continue the narrative on my behalf - you only control the woman.
Never assume my future actions.
Never describe anyone's feelings.
Never judge.
Never summarize what's already been said.
Never end the story.
Never add narrator's philosophical musings.
Do not describe women's thoughts - they should express them via speech.
Nice_Squirrel342@reddit
Hmm, my is the following:>!Prompt!<
You are a brilliant and versatile writer. Your task is to write a role-play based on the information below.
Make use of onomatopoeia for sounds in narration. Vocalize moans, murmurs, and laughter in dialogue.
(Here is going the part where I tell AI not to speak on my behalf.)
Create actions and dialog for {{char}} and other characters as needed, adding and removing them where suitable and fulfilling their roles. REMEMBER to NEVER write actions, dialog, or thoughts or emotions for {{user}}. Never produce anything from {{user}} directly. Only respond indirectly to {{user}}'s actions from the point of view of others.
Ensure you uphold spatial awareness and abide by logical consistency along with cause and effect. Strive for natural feel, employing the “show, don't tell” principle.
Move the ongoing scene slowly, always with a hook for {{user}} to continue.
Indicate thoughts and internal dialogues using backticks (`). Unless explicitly stated otherwise, characters cannot read each other's thoughts. Thoughts are used solely to reveal characters' feelings and emotions, so do not respond to them directly.
I just take the good lines from different prompts and combine them to make the most of it.
wakigatameth@reddit
I think your instructions are generally too complex for a 12b model to parse. I am surprised you can deal with Magnum using those instructions. Mine work for Nemo but Magnum just won't shut up.
Nice_Squirrel342@reddit
Well, the thought part is my own addition and it works perfectly. However, some models were answering them as if they were direct speech, so I had to add this clarification as to why thoughts shouldn't be answered.
I don't know if it's too complex, I've tried to keep it to the most necessary parts that I think are important for my roleplay.
Not really sure what you mean by it not shutting up. Does it write too long?
I usually just trim everything if the model doesn't let me say a word. I've never had that with Magnum. Although I do switch between models a lot. What I especially like about Magnum is that it sometimes inserts a ‘Status’ window. It looks cool, like:
Stress level 50%,
Yandere level 3❤️❤️❤️
Actions: arms crossed, nails sinking into palms.
But in general the model is not without flaws, of course. It has a lot of positivism for my taste, and it's also quite lewd (not that I'm against NSFW, but I prefer slow-burn).
wakigatameth@reddit
You may find that plain Mistral Instruct is best at following instructions in general. It's just too laconic in its output for my taste.
Kat-@reddit
I grabbed `NemoMix-Unleashed-12B.Q5_K_M.gguf` and it was a huge upgrade from `c4ai-command-r-08-2024-Q4_K_M.gguf`
When `Mistral-Small-22B-ArliAI-RPMax-v1.1-GGUF` was released, I thought, "Nice. Let's try." It's weird, but, NemoMix is "playful" in a way I can't get ArliAI to be.
wakigatameth@reddit
NemoMix is also more operable because you can tell it to keep responses short and it can, at least for a while, comply. Other "creative" LLMs often struggle with this. including ArliAI-RPMax.
gfy_expert@reddit (OP)
bartowski or MarinaraSpaghetti ? via google
JohannWolfgangGoatse@reddit
Bartowski is always a safe bet.
gfy_expert@reddit (OP)
thanks! trying luck with NemoMix-Unleashed-12B-Q6_K_L.gguf
ChengliChengbao@reddit
This one is a tad outdated, but Fimbulvetr V2 11B is still solid. I also like Lyra V4 12B, which is based on a newer model.
gfy_expert@reddit (OP)
can you please post download links?
ChengliChengbao@reddit
Here are the GGUF quants:
Fimbulvetr: https://huggingface.co/Lewdiculous/Fimbulvetr-11B-v2-GGUF-IQ-Imatrix
Lyra: https://huggingface.co/Lewdiculous/MN-12B-Lyra-v4-GGUF-IQ-Imatrix
Also, if you want something smaller, Stheno is pretty nice too.
Stheno: https://huggingface.co/Lewdiculous/Llama-3.1-8B-Stheno-v3.4-GGUF-IQ-Imatrix
gfy_expert@reddit (OP)
Fimbulvetr is censored, lyra getting errors
superfluid@reddit
Based on my experience, I find that quite... surprising.
ChengliChengbao@reddit
Fimbulvetr has never refused my prompts before, like ever. Must be your configs, try changing your system prompt.
gfy_expert@reddit (OP)
thanks! trying my luck with MN-12B-Lyra-v4-Q6_K-imat.gguf and Ministral-8B-Instruct-2410-Q6_K_L.gguf
OwnSeason78@reddit
Rocinante 1.1 12B, absolutely.
Ketsuyaboy@reddit
Nemomix Unleashed 12B at Q6_K
PavelPivovarov@reddit
My workhorses for now are L3-8B-Niitama-v1 and Rocinante-12B-v1.1. Both are playful, don't rush and have good situational awareness. Not the latest models out there but I enjoy working with them.
bearbarebere@reddit
!remindme 1 hour to check this out
RemindMeBot@reddit
I will be messaging you in 1 hour on 2024-11-07 08:53:39 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
i_need_good_name@reddit
'asking for a friend' kinda post
PangurBanTheCat@reddit
Cydonia-22B-v1.2-IQ4_XS.gguf
gemma-2-27b-it-SimPO-37K.i1-IQ3_XXS.gguf
EVA-Qwen2.5-32B-v0.0-Q2_K.gguf
magnum-v4-22b-IQ4_XS.gguf
Mistral-Small-22B-ArliAI-RPMax-v1.1.i1-IQ4_XS.gguf
input_a_new_name@reddit
lyra-gutenberg-mistral-nemo, mistral-nemo-gutenberg-doppel, violet_twilight-0.2
these three are imo the best all-around 12B finetunes for rp. and i've tried basically everything in the 12B range at this point.
Mission_Bear7823@reddit
I've heard good things about Mistral small / 12B finetunes, havent tried them myself though
Linkpharm2@reddit
Gguf is slower, should be used when you need to offload to ram. Exl2 is faster, you're trying to keep it all in vram, so use this. Tabbyapi is good. I recommend clydonia 3bpw.
BlipOnNobodysRadar@reddit
If I can piggyback off this post, I've been doing some side projects that utilize local LLMs through API w/ LMStudio. I think this is obviously suboptimal both because of the .gguf inefficiency and lack of ability to do batching.
Could you point me at better backends I could use? Does exl2 run on llama.cpp?
CheatCodesOfLife@reddit
I'm not sure there's anything inherently wrong with the format.
If I fully offload a model onto a single 3090, it's very fast these days.
Yeah and lack of tensor parallelism make it painfully slow when using more than 1 GPU.
You'd definately want to use exl2 or awq for multi-gpu.
I recomment TabbyAPI - an OpenAI API drop-in replacement:
https://github.com/theroyallab/tabbyAPI
There's also this self-contained chat web ui:
https://github.com/turboderp/exui
And of course the swiss army knife of LLM inference:
https://github.com/oobabooga/text-generation-webui
BlipOnNobodysRadar@reddit
I'm having the local LLM need to go through thousands of queries, so "fast enough" becomes different at scale. If I was using it just 1:1 for chats, gguf would be plenty fast enough
Thanks for the links, you're the second person to recommend tabbyAPI so I'll try that first
CheatCodesOfLife@reddit
Right. I was responding in the context of this guy's post you piggybacked where he's doing role playing.
Exllamav2, while very fast and flexible (eg. parallel processing with 3, 5 GPUs not just multiples of 2), is still designed with single conversations and queuing in mind.
If you're doing massive batch tasks like synthetic data generation, etc then you'll really want to look at vllm and AWQ quants.
Anthonyg5005@reddit
Use tabbyapi. It has batching and runs as an openai and koboldai compatible api server
Barafu@reddit
That's not as much of a difference now, since Kobold started using flash attention and quantized context
LoafyLemon@reddit
PP is still slower in kobold, no?
gfy_expert@reddit (OP)
any link, pretty please to cydonia 3bpw ? can I fit it in only 12gb vram?
shirotokov@reddit
taking notes for my dog