MMLU: NeuralDaredevil 8B Abliterated vs Abliterated 70B Llama 3

Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 8 comments

I have been extremely impressed with Neuraldaredevil Llama 3 8b Abliterated. I'm running it at Q8 and apparently the MMLU is about 71.8. I also tried running the abliterated 3.5 70b llama 3. It is good, but I can only run it at IQ2XXS on my 3090. Looking at the GitHub page and how quants affect the 70b, the MMLU ends up being around 72 as well. My question is, has anyone put them though their paces and compared the two? I am not a programmer so I haven't tested code, but I notice the 8b model acts more "human-like". It's hard to explain. I can force the 8b model with a 16k+ context, but I can only get 4k context on the 70b model. Any thoughts? I would like a neuraldaredevil 70b model based on Llama3..

Reply to Post

8 Comments

[-]

Enfiznar@reddit

I love that model, it's the only local one I currently use

[-]

durden111111@reddit

I run IQ2\_M euryale 70B on my 3090 with 6k context. It's usable for chatting. XXS is just too lobotomized.

[-]

matteogeniaccio@reddit

I tested NeuralDaredevil 8B Abliterated ("ND" from now on) vs the original llama 70B at IQ2_XS. - original Llama 70b_IQ2XS is immensely better than ND 8B. - ND8B is slightly better than original llama8b in my benchmarks. It has some weak spots so you might want to switch between the two depending on your use case. llama.cpp should be able to run IQ2_XS with 24GB of VRAM. I tested this with two 3060 12GB in parallel. I enabled unified memory in llama.cpp by editing ggml-cuda.cu and replacing cudaMalloc with cudaMallocManaged, so it doesn't crash with an out of memory error.

[-]

My_Unbiased_Opinion@reddit (OP)

Upvote and thank you. I think I wasn't able to use IQ2XS because I had some other small stuff in VRAM (this is my daily gaming PC), so I'll look into it again with a fresh reboot and stuff closed. Last question; I see imatrix and "i1" versions of guff models. Are these the same? I know what an imatrix is but no idea what is an i1.

[-]

matteogeniaccio@reddit

i1 is the imatrix version as opposed to static quants. My suggestion is to use bartowski's version that has everything: imatrix, iquants, support for latest llama.cpp... https://huggingface.co/bartowski/Meta-Llama-3-70B-Instruct-GGUF

[-]

My_Unbiased_Opinion@reddit (OP)

Thanks again. What context are you running with IQ2XS?

[-]

matteogeniaccio@reddit

I use the default 8k. most inference tools support kv cache quantization if you run out of memory for your context. in llama.cpp you can enable it by appending "-ctk q8_0 -ctv q8_0" to your launch command. You can also keep the context in CPU ram with "-nkvo" but it is too slow for me. LLama70b also performs well with attention self extend to increase context to around 20k, but it's slooooow!

[-]

k4ch0w@reddit

You sir are a legend. Thank you for helping plebs like me.