Help needed: Ollama > qwen3.6 in OpenCode on 64Gb M4
Posted by Konamicoder@reddit | LocalLLaMA | View on Reddit | 10 comments
Hi Ollama team!
I’d love to get your advice as to why I’m doing wrong. In running Ollama on an M4 MacBook Pro with 64Gb RAM. Am trying to use OpenCode with qwen3.6-35b-a3b-q4_K_M as the selected model. I made a modelfile version of the model with the following parameters:
PARAMETER num_ctx 32768
PARAMETER num_predict 4096
PARAMETER temperature 0.6
PARAMETER top_k 20
PARAMETER top_p 0.95
PARAMETER min_p 0.0
PARAMETER repeat_penalty 1.0
PARAMETER repeat_last_n 64
I figure a context length of 32K should be fine for my system with 64Gb RAM.
But when I launch OpenCode with this command…
ollama launch opencode —model qwen3.6-35b-a3b-q4_K_M
…and issue a simple cd command to focus OpenCode on my project folder, RAM instantly pegs to 100 percent, and the system locks up. Mouse cursor starts stuttering across the screen. Activity monitor shows two instances of Ollama chewing up 30Gb and 15Gb of my available RAM. I have to force quit Ollama for the system to calm down.
Based on the details I have shared, can someone help me detect the root cause of the issue? Even better, suggest a fix?
Thanks in advance!
Kagemand@reddit
Context is way too low.
Konamicoder@reddit (OP)
I’m willing to tweak the context parameter. I was just under the impression that larger context parameter = more RAM usage. Which is why I set the context parameter to what I thought was a conservative 32k to start, then increase gradually upon testing performance. But opencode hard-choked immediately at the 32k level. Another data point that seems to argue against low context being the issue is that 32k context seems to work in Codex CLI, but the same settings seem to fail in OpenCode. So the data seems to point toward some negative interaction between Ollama, qwen3.6, and OpenCode.
wasnt_in_the_hot_tub@reddit
You can probably use 128k or higher with your system. Keep in mind opencode is already eating a third of your total 32k context
Objective-Stranger99@reddit
Don't use Ollama. All the tuning you did means that you are ready for llama.cpp. Ollama is a beginner-friendly wrapper. Use llama.cpp, because it is both faster and better.
Konamicoder@reddit (OP)
Thanks for the confidence. Boost. How do I connect llama.cpp as the backend for OpenCode?
Objective-Stranger99@reddit
You create an opencode.jsonc file (in ~/.config/opencode for Linux) and use the documentation to set up your server configuration for opencode. Could you let me know what OS you are on? I can share my configuration if you want. It will take a bit of tinkering but it will be worth it. I went from being unable to run Qwen3.5 35B on Ollama to running it at a good 20 t/s on llama.cpp.
Konamicoder@reddit (OP)
I am on MacOS (latest version). I imagine the directory path for opencode config files would be similar in macOS as in Linux, since they are both *nix-based systems.
Objective-Stranger99@reddit
Yes, you are right there.
r1str3tto@reddit
I can’t recommend oMLX highly enough. The context caching actually works. (!!) It’s kind of miraculous to process a 100k+ token prompt and then get instant follow-up responses on it.
taking_bullet@reddit
Get ready for reading comments from Ollama haters 🫡