Running on cpu :(

Posted by Frizzy-MacDrizzle@reddit | LocalLLaMA | View on Reddit | 4 comments

I am in the midst of a POC project at work and am I have is 4 AMD Epyc cores and those are essentially virtualized. Does any one have any tricks? Additionally kv cache sucks on system memory and have to clear it by adding ALL the no cache and sps 1 etc,. I have 32gb memory, loads the model fine, mistral 7b q4 k m.

To add, this is part of a RAG system and the context will get piped into the system prompt. I was on Ollama but have since moved to llama-server.

Please suggest and I will say of i tried, or will do. Close but yet not quality. Example, it’s not adding 8 records json with 4 columns name, company, balance, phone. The balance is always off and there is not a correlation to missing a balance.

I can’t really say exactly what I have tried, and not for solutions as it is probably working as much as it can, just tips, tricks, please.

[-]

ML-Future@reddit

Hi, Mistral isn't a good model for creating JSON structures.

My recommendation is that you try models like Gemma4 2b or Qwen3.5 2b.

They try to use less aggressive quantization than Q8 to avoid confusion.

I've recommended 2b models for faster CPU performance, but there are also more capable 4b versions.

Some-Ice-4455@reddit

Agree qwen is really good at that stuff in my experiences.

Frizzy-MacDrizzle@reddit (OP)

Mistral is just in general slow for generation I have found. But the output is plain text, and have seen the json problem and are avoiding that format type. Thanks. In going to try the smaller ones again, I had context issues but I know far more now than a week ago since going llama.

Also remember that with small models you have to work more on the prompts and be very clear