Qwen 3.5 27B - quantize KV cache or not?
Posted by Spicy_mch4ggis@reddit | LocalLLaMA | View on Reddit | 40 comments
I’m getting mixed answers on the tradeoff between weight quantization and/or KV cache quantization with the qwen 3.5 model family.
I’m some sources I read that the architecture of this model is not really negatively affected by a q8 K or V cache quantization.
I’m currently running q 6k weights with bf16 Kav cache. It fits on my GPU with around 80k context window. Apparently the documentation suggests not going lower than 128k context window.
I’m trying to judge the tradeoff between going to q4 weights or q8 KV, either of which would get me to above 128 context window.
Thanks!
40 Comments
AppealSame4367@reddit
voyager256@reddit
AppealSame4367@reddit
voyager256@reddit
AppealSame4367@reddit
voyager256@reddit
Delicious_Box_9823@reddit
Adventurous-Gold6413@reddit
Prudent-Ad4509@reddit
Adventurous-Gold6413@reddit
DragonfruitIll660@reddit
Adventurous-Gold6413@reddit
Prudent-Ad4509@reddit
Mart-McUH@reddit
Prudent-Ad4509@reddit
grumd@reddit
Prudent-Ad4509@reddit
AppealSame4367@reddit
heislera763@reddit
mp3m4k3r@reddit
AppealSame4367@reddit
dinerburgeryum@reddit
Lissanro@reddit
voyager256@reddit
Lissanro@reddit
voyager256@reddit
Lissanro@reddit
Spicy_mch4ggis@reddit (OP)
My_Unbiased_Opinion@reddit
ambient_temp_xeno@reddit
mp3m4k3r@reddit
ambient_temp_xeno@reddit
mp3m4k3r@reddit
ambient_temp_xeno@reddit
ambient_temp_xeno@reddit
mp3m4k3r@reddit
ClearApartment2627@reddit
ambient_temp_xeno@reddit
Spicy_mch4ggis@reddit (OP)
TKristof@reddit