Can I improve performance for qwen 3.6 27b?

Posted by wgaca2@reddit | LocalLLaMA | View on Reddit | 40 comments

Hardware
OS: Windows 11 Pro 10.0.26200, Build 26200
CPU: Intel Core Ultra 7 270K Plus, 24 cores / 24 threads, max clock 3.7 GHz
RAM: 32 GB DDR5 @ 5600 MHz, 2x16 GB Crucial CP16G56C46U5.C8D
GPU: 2x NVIDIA GeForce RTX 3090, 24 GB VRAM each, compute capability 8.6
NVIDIA driver: 596.21
Windows GPU driver: 32.0.15.9621

Model
Name: qwen36-q6-tools-192k-nothink:latest
Ollama model ID: 42e91752a44b
Architecture: qwen35
Parameters: 26.9B
Quantization: Q6_K

Ollama Runtime / Model Parameters
GPU offload: 65/65 layers, 100% GPU
Configured context: 196,608 tokens
num_ctx: 196,608
num_batch: 1,024
num_predict: 8,192
temperature: 0.45
top_k: 20
top_p: 0.8
repeat_penalty: 1
stop tokens: <|im_start|>, <|im_end|>

Runner Settings Observed In Ollama Logs
FlashAttention: enabled
KV size: 196,608
Parallel: 1
NumThreads: 8
UseMmap: false
MultiUserCache: false
LoRA: none
GPU layers: 65

Observed Load With num_batch 1024
Total model memory reported by Ollama: ~38.6 GiB
All 65/65 layers offloaded to GPU

Layer / Memory Split From Load Log
CUDA0: 35 layers, weights 9.4 GiB, KV cache 7.6 GiB, compute graph 843.8 MiB
CUDA1: 30 layers, weights 10.2 GiB, KV cache 8.1 GiB, compute graph 1.5 GiB
CPU: weights 994.6 MiB, compute graph 20.0 MiB

Currently getting 2000-5000 evaluation tokens and 15-20 generating tokens. Is that the limit for this context size?