Brief Ngram-Mod Test Results - R9700/Qwen3.6 27B
Posted by exact_constraint@reddit | LocalLLaMA | View on Reddit | 8 comments
Decided to try out the new --spec-type ngram-mod feature in llama.cpp using Qwen3.6 27B during an OpenCode bug chasing session. TLDR: Performance is variable, but so far it seems to provide a nice speed increase for working on the same code base.
Here's a baseline llama-bench test:
$: llama-bench-vulkan -m 'Qwen3.6-27B-UD-Q4_K_XL.gguf'
WARNING: radv is not a conformant Vulkan implementation, testing use only.
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon AI PRO R9700 (RADV GFX1201) (radv) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium | 16.39 GiB | 26.90 B | Vulkan | 99 | pp512 | 1050.13 ± 0.54 |
| qwen35 27B Q4_K - Medium | 16.39 GiB | 26.90 B | Vulkan | 99 | tg128 | 31.26 ± 0.01 |
build: 97895129e (8863)
My llama-server run flags:
llama-server-vulkan -m '/Qwen3.6-27B-UD-Q4_K_XL.gguf' --mmproj '/mmproj-BF16(3).gguf' -np 1 -ngl 99 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence_penalty 0.00 --jinja --chat-template-kwargs '{"preserve_thinking": true}' -ub 2048 -fa 1 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48 --host 0.0.0.0 --port 8180
Stats Summary:
--- Prompt Processing (PPS) Statistics ---
Mean: 549.60 t/s
Median: 519.19 t/s
P95: 936.60 t/s
StdDev: 240.80 (Stability)
Range: 64.18 - 1015.91 t/s
--- Token Generation (Tok/s) Statistics ---
Mean: 28.80 t/s
Median: 28.20 t/s
P95: 45.34 t/s
StdDev: 6.78 (Stability)
Range: 16.49 - 53.63 t/s
Total Tokens Generated: 87840
$:~/Documents/llama_perf$ python3 parse_performance_stats_full.py
== Prompt Processing (PPS) Analysis ==
Effective Avg: 549.60 t/s (Token-Weighted)
Median (P50): 519.19 t/s
Tail (P99): 958.31 t/s
Stability(CV): 43.8% (JITTERY)
Skewness: 0.04 (Symmetric)
== Token Generation (Tok/s) Analysis ==
Effective Avg: 1697.20 t/s (Token-Weighted)
Median (P50): 28.20 t/s
Tail (P99): 51.39 t/s
Stability(CV): 23.5% (JITTERY)
Skewness: 1.40 (Burst Heavy)
clarence@Claraence:~/Documents/llama_perf$
Raw data:
$:~/Documents/llama_perf$ python3 parse_performance_stats.py
Task ID | PPS (Prompt) | Tok/s (Gen) | Gen Tokens
------------------------------------------------------------
7824 | 72.51 | 25.76 | 340
8053 | 330.16 | 22.49 | 709
8629 | 345.13 | 20.84 | 1820
10286 | 64.18 | 28.11 | 181
10372 | 309.37 | 19.31 | 123
10496 | 360.21 | 27.07 | 891
11071 | 345.78 | 34.59 | 1595
11810 | 349.13 | 21.83 | 389
12124 | 304.43 | 27.89 | 438
12364 | 320.76 | 24.20 | 408
12673 | 304.25 | 22.16 | 281
12899 | 281.09 | 19.12 | 286
13188 | 777.57 | 25.27 | 1428
14644 | 970.67 | 30.00 | 231
14863 | 834.32 | 32.17 | 98
14944 | 651.29 | 35.26 | 90
15012 | 690.06 | 28.15 | 98
15101 | 706.03 | 30.84 | 97
15177 | 678.13 | 39.51 | 100
15243 | 695.42 | 28.46 | 85
15330 | 347.35 | 27.75 | 83
15404 | 527.11 | 28.71 | 79
15485 | 495.88 | 28.83 | 73
15552 | 757.88 | 28.85 | 70
15610 | 754.61 | 27.08 | 106
15716 | 343.11 | 30.13 | 82
15784 | 597.03 | 28.51 | 77
15848 | 724.77 | 25.24 | 91
15932 | 612.62 | 40.13 | 87
15986 | 603.72 | 28.13 | 125
16105 | 545.72 | 27.96 | 105
16212 | 140.18 | 30.04 | 53
16256 | 518.56 | 27.60 | 1330
17587 | 705.96 | 27.46 | 336
1 | 891.36 | 27.73 | 1644
1621 | 689.95 | 30.96 | 750
2238 | 87.37 | 27.05 | 348
2593 | 86.72 | 27.15 | 2003
4593 | 86.10 | 27.07 | 161
4728 | 431.04 | 26.33 | 178
4900 | 86.53 | 28.26 | 112
4987 | 87.27 | 27.09 | 161
5129 | 346.48 | 28.73 | 104
5214 | 426.83 | 37.51 | 147
5295 | 369.10 | 27.33 | 74
5371 | 258.20 | 27.12 | 172
5545 | 82.23 | 28.34 | 83
5619 | 78.99 | 39.80 | 163
5711 | 342.33 | 25.94 | 103
5814 | 557.16 | 27.15 | 92
5908 | 82.57 | 24.07 | 112
6011 | 655.56 | 16.87 | 255
6250 | 538.12 | 16.73 | 259
6509 | 226.40 | 19.07 | 78
6572 | 380.42 | 17.08 | 84
6650 | 369.20 | 17.92 | 176
6805 | 542.54 | 19.01 | 133
6917 | 508.31 | 17.65 | 711
7567 | 592.44 | 21.26 | 113
0 | 825.63 | 26.19 | 258
265 | 570.25 | 26.75 | 170
410 | 400.81 | 24.33 | 97
501 | 495.63 | 25.28 | 153
649 | 602.06 | 22.47 | 315
871 | 317.47 | 16.50 | 746
1616 | 75.78 | 16.49 | 105
1717 | 458.49 | 16.79 | 111
1830 | 135.83 | 16.80 | 347
0 | 837.89 | 26.31 | 764
794 | 651.57 | 24.01 | 116
905 | 224.91 | 25.38 | 80
969 | 551.64 | 29.70 | 81
1029 | 547.99 | 24.96 | 89
1118 | 545.28 | 25.38 | 86
1187 | 596.21 | 25.20 | 81
1267 | 387.68 | 25.03 | 83
1342 | 526.17 | 25.98 | 616
1960 | 795.61 | 23.57 | 177
2169 | 518.94 | 24.00 | 75
2245 | 487.28 | 28.62 | 84
2307 | 519.44 | 26.36 | 218
2506 | 83.51 | 25.92 | 184
2674 | 317.34 | 25.31 | 101
2756 | 491.71 | 25.41 | 690
3424 | 540.33 | 33.60 | 184
3529 | 511.05 | 28.57 | 106
3601 | 523.09 | 27.26 | 471
4014 | 518.84 | 25.74 | 251
4238 | 82.16 | 23.83 | 163
4401 | 338.39 | 46.13 | 83
4437 | 324.35 | 23.52 | 126
4560 | 248.12 | 25.89 | 81
4634 | 443.34 | 24.78 | 182
4804 | 463.62 | 28.23 | 83
4872 | 438.71 | 31.26 | 635
5352 | 504.33 | 22.47 | 96
5439 | 277.02 | 25.48 | 179
5596 | 506.73 | 39.77 | 179
5687 | 493.95 | 23.50 | 69
5757 | 523.45 | 25.08 | 110
5869 | 105.32 | 23.02 | 67
5938 | 200.24 | 24.93 | 316
6256 | 555.49 | 45.34 | 175
6327 | 466.26 | 24.61 | 262
0 | 761.08 | 24.29 | 139
160 | 505.55 | 22.34 | 117
271 | 256.61 | 28.42 | 83
322 | 426.93 | 30.01 | 97
388 | 482.84 | 27.16 | 96
463 | 494.38 | 24.48 | 1150
1613 | 259.32 | 23.89 | 73
1683 | 167.49 | 23.52 | 80
1755 | 318.21 | 24.25 | 3084
4834 | 318.37 | 22.71 | 88
4909 | 451.91 | 24.01 | 160
5051 | 429.60 | 24.10 | 112
5144 | 426.04 | 24.11 | 1209
6326 | 563.82 | 23.99 | 207
6529 | 512.83 | 34.04 | 90
6585 | 498.78 | 28.49 | 92
6656 | 492.01 | 24.35 | 104
6738 | 484.51 | 29.75 | 92
6797 | 450.49 | 29.46 | 95
6859 | 437.55 | 23.36 | 650
7504 | 235.33 | 23.13 | 81
7568 | 405.40 | 27.63 | 126
7661 | 426.11 | 22.62 | 137
7798 | 351.68 | 28.88 | 100
7865 | 445.78 | 23.28 | 122
7981 | 398.07 | 22.79 | 155
8136 | 265.58 | 22.67 | 83
8201 | 375.09 | 23.50 | 446
8623 | 419.87 | 23.31 | 921
9516 | 424.62 | 23.22 | 98
9594 | 399.86 | 23.04 | 557
10133 | 410.36 | 30.93 | 85
10180 | 445.30 | 26.01 | 82
10240 | 384.94 | 25.42 | 147
10356 | 369.66 | 22.97 | 312
10670 | 1011.00 | 29.40 | 153
10819 | 735.71 | 30.75 | 65
10877 | 912.32 | 28.97 | 92
10969 | 829.14 | 28.24 | 132
11108 | 710.79 | 28.56 | 94
11195 | 694.49 | 29.13 | 129
11313 | 440.72 | 28.87 | 67
11373 | 736.58 | 43.25 | 100
11431 | 278.92 | 28.97 | 89
11513 | 564.79 | 30.91 | 97
11585 | 464.87 | 32.45 | 93
11659 | 605.83 | 28.62 | 63
11715 | 727.11 | 28.05 | 180
11879 | 643.30 | 30.79 | 126
11985 | 665.26 | 29.20 | 149
12111 | 492.23 | 27.98 | 72
12176 | 695.06 | 26.40 | 164
12340 | 558.65 | 26.57 | 2933
15263 | 447.12 | 21.40 | 271
15534 | 1015.91 | 30.65 | 87
15619 | 923.95 | 30.58 | 1613
17127 | 455.62 | 21.57 | 186
17307 | 939.74 | 31.02 | 70
17371 | 897.35 | 33.11 | 1213
18401 | 450.77 | 23.31 | 694
19047 | 939.26 | 30.94 | 71
19112 | 921.63 | 29.57 | 1399
20514 | 440.08 | 21.55 | 179
20680 | 941.92 | 30.28 | 86
20769 | 916.08 | 29.72 | 213
20985 | 630.99 | 28.39 | 90
21076 | 783.87 | 29.83 | 90
21153 | 869.66 | 31.89 | 141
21270 | 559.49 | 28.48 | 163
21434 | 781.38 | 29.42 | 115
21543 | 783.60 | 33.50 | 129
21647 | 542.43 | 29.70 | 88
21728 | 681.01 | 30.92 | 282
21984 | 583.15 | 27.92 | 108
22092 | 87.14 | 26.63 | 117
22207 | 552.15 | 28.99 | 90
22284 | 648.15 | 27.79 | 110
22394 | 758.16 | 29.34 | 103
22482 | 570.20 | 28.52 | 1171
23655 | 449.73 | 22.45 | 191
23840 | 913.13 | 30.05 | 102
23944 | 924.18 | 29.36 | 249
24198 | 797.90 | 30.26 | 76
24266 | 859.60 | 28.60 | 155
24419 | 613.57 | 29.71 | 87
24498 | 696.11 | 34.20 | 105
24578 | 654.08 | 29.09 | 107
24678 | 601.79 | 29.27 | 96
24759 | 667.10 | 28.99 | 116
24868 | 700.61 | 34.60 | 110
24952 | 722.68 | 27.95 | 2270
27224 | 434.52 | 22.17 | 373
27586 | 920.69 | 30.19 | 82
27670 | 923.33 | 29.41 | 135
27802 | 878.87 | 28.93 | 159
27967 | 697.86 | 29.29 | 101
28061 | 694.84 | 35.07 | 114
28150 | 724.74 | 36.25 | 84
28209 | 362.26 | 34.01 | 87
28277 | 726.33 | 33.11 | 119
28375 | 738.59 | 27.36 | 95
28470 | 571.26 | 25.75 | 94
28562 | 372.33 | 28.18 | 80
28631 | 598.19 | 29.04 | 97
28721 | 669.38 | 25.55 | 108
28821 | 396.21 | 31.45 | 86
28887 | 618.82 | 27.92 | 2077
30958 | 429.42 | 22.30 | 405
31356 | 916.46 | 30.26 | 75
31433 | 897.39 | 36.61 | 949
32154 | 417.12 | 34.14 | 398
32348 | 940.13 | 30.26 | 71
32421 | 921.72 | 46.64 | 1434
33187 | 422.44 | 49.40 | 397
33303 | 937.79 | 32.47 | 105
33395 | 924.34 | 29.25 | 1684
35077 | 418.33 | 48.17 | 421
35215 | 928.92 | 30.81 | 78
35287 | 906.27 | 29.21 | 2857
38060 | 422.58 | 48.37 | 402
38182 | 936.60 | 34.20 | 72
38240 | 916.12 | 44.28 | 3143
39949 | 421.28 | 44.29 | 415
40073 | 939.96 | 30.25 | 75
40150 | 905.92 | 40.91 | 1662
41202 | 412.22 | 47.27 | 403
41325 | 938.87 | 30.36 | 76
41403 | 916.59 | 38.85 | 1532
42476 | 399.14 | 48.52 | 402
42586 | 938.19 | 34.64 | 74
42645 | 915.96 | 32.35 | 1551
43997 | 407.69 | 53.03 | 383
44096 | 930.86 | 31.11 | 68
44157 | 919.13 | 29.52 | 853
45012 | 398.91 | 49.45 | 387
45118 | 935.23 | 30.34 | 83
45203 | 925.79 | 52.86 | 1615
45981 | 396.90 | 48.34 | 390
46092 | 936.96 | 30.29 | 88
46182 | 915.64 | 53.63 | 2544
exact_constraint@reddit (OP)
UPDATE: Collected more data. I restarted llama-server and ran Qwen3.6 27B without the ngram flags, to get a 'baseline' real world performance profile. Obviously a little scant on datapoints (\~11k generated tokens vs \~80k for the nmap run). Performance was very consistent, and I had work to do lol. Ngram results are using the same flags as the OP.
I haven't done any testing on ngram-mod vs ngram-map-k yet - From some brief research, it *looks* like -mod is the preferred method, even when running single GPU inference. Not sure though. Either way, here's my janky runtime flags vs a baseline:
TLDR:
No Decoding Baseline - Real debugging session, \~125-150k context:
Ngram-mod decoding, \~125-150k context:
Finanzamt_Endgegner@reddit
In my testing with pi + qwen3 27b i used these parameters and was had a more consistent speed up (;
Clear-Ad-9312@reddit
Which is funny because each ngram-* type has different performance uplift
- ngram-simple looks for a previous matching n-gram and inserts the following m-gram.
- ngram-map-k looks for a previous matching n-gram and inserts the following m-gram but uses an internal hash-map of n-grams in the current context window.
- ngram-mod uses a hash pool which is shared across all server slots. The hash pool is a map from n-gram hash to the next token (not the next m-gram as in ngram-map).
op used
ngram-modwhich looks to take advantage of parallel server slots?but op used
-np 1which makes this worse thanngram-map-kthat seems to be more specific for current single parallel context?exact_constraint@reddit (OP)
!!! Thanks for this. Going to do some more testing based on this info. Last night was real fast and loose, just quickly scanned the llama.cpp PRs covering ngram and grabbed some runtime flags to test.
Middle_Bullfrog_6173@reddit
Could you explain the speed increase? From the stats it looks like same or slightly lower performance, unless I misunderstand the numbers.
exact_constraint@reddit (OP)
The main difference is in context size - The llama-bench result is basically absolute best case - Empty context, small prompt size. During actual use in OpenCode, running over 100k context, speeds tend to run in the low 20tok/sec range.
Qualitatively, so far it feels a little faster, too. The instances where tok/sec jumps over the baseline 30, into the 40-50 range, seem to be skewed to prompts where the agent is responding to a question about something it’s just done, or making small tweaks. Makes conversations a little snappier.
We’ll see how performance works out with a longer sample size.
CalligrapherFar7833@reddit
Show us a bench with 256k 128k context with ngram and without ?
Zc5Gwu@reddit
Same, I was kind of expecting a before and after. It's hard to tell what is being compared.