Same 9B Qwen weights: 19.1% in Aider vs 45.6% with a scaffold adapted to small local models

Posted by Creative-Regular6799@reddit | LocalLLaMA | View on Reddit | 41 comments

I spent the past week testing a simple question:

Small local models often look weak inside coding agents. But how much of that is actually model weakness, and how much is scaffold mismatch?

So I held the model fixed and changed only the scaffold.

Same Qwen3.5-9B Q4 weights in both conditions.

Same Aider Polyglot benchmark.

Full 225 exercises.

Results:

- vanilla Aider: 19.11%

- little-coder: 45.56% mean pass@2 across two full runs

little-coder is not a new model. It is a scaffold I adapted to the behavioral profile of a \~10B local model: bounded reasoning budget, a Write guard that refuses to overwrite existing files, explicit workspace discovery, and small per-turn skill injections instead of one huge static preamble.

This is not a conference paper. There are obvious things a proper paper would still want:

- more replications

- component ablations

- more model families

- maybe a second benchmark

But the effect size was large enough that I thought it was worth sharing now (I don’t have time to do the above unfortunately).

My takeaway is fairly narrow:

at this scale, coding-agent benchmark results are not just properties of model weights. They are also properties of scaffold–model fit.

I suspect sub-10B local models may have been written off too early in coding-agent evaluation.

Full write-up, code, and numbers here: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-agent

Would be very interested in replication attempts, failure cases, or reasons you think this would not generalize.

[-]

_-_David@reddit

"This is not a conference paper."

"But"

I love the fuck out of this post.

[-]

Far-Low-4705@reddit

dont use a reasoning budget, if it ever hits the budget, its performance is far worse than if you would have just use instruct mode.

I'd suggest just leaving reasoning untouched and unbounded.

[-]

look@reddit

Ha. And this excerpt from an analysis that just finished running against Qwen 3.6 Plus:

--thinking-budget 256 appears to be the sweet spot for the production distillation run. The 8% degradation we saw with no-reasoning is eliminated, while the cost/speed savings are substantial vs full reasoning.

[-]

DefNattyBoii@reddit

Do you have more info on this? I was using this in my conf.ini:

reasoning-budget = 4096

reasoning-budget-message = "...\n Considering the limited time by the user, I have to give the solution based on the thinking directly now."

[-]

look@reddit

The example above was from generating training data for a sequence classifier. Going from full thinking to none had an 8% accuracy drop on my test set. Giving it a truncating (no stop message) 256 budget recovered that 8%.

I’ve since run a larger test set at 256, 512, 1024, and full, and found it got to 99.9% with just 512. I’m now running the full dataset at 512.

This was a fairly specialized use, without a stop message at all, but I find that helps with more general tasks. The most important thing I’ve found with the stop message (for small Qwen 3.5 models at least) is to add a newline at the end of your message.

The rest of the message itself doesn’t seem to matter all that much, but the newline had a significant impact. I use something like this:

… reasoning budget exceeded. Answer now\n

(I’ll look up my exact message later. On my phone at the moment.)

[-]

Ell2509@reddit

This work ie very useful. Thanks for sharing.

[-]

look@reddit

Hmm. Is there data to back that up? Mine is anecdotal, but I see improved performance on Qwen3.5 0.8b with reasoning but a small budget that it nearly always hits.

[-]

Far-Low-4705@reddit

yes, if you look a the pr request in llama.cpp for the reasoning budget feature, they did performance benchmarks and it absolutely tanked reasoning performance.

[-]

look@reddit

I’m familiar with the results mentioned in https://github.com/ggml-org/llama.cpp/issues/20632 but that is about graceful termination with budgets vs the truncated termination.

And I use truncated termination in an application on Qwen3.5, and it definitely benefits from a short, truncated reasoning over no reasoning at all. My case might be an exception, but I doubt it is that rare.

I did find that the message you inject at the end matters a great deal, though. I’d not be shocked if the other results you’ve seen were using an ineffective conclusion message.

[-]

Far-Low-4705@reddit

I think it is still a good sign that it’s a hacky solution that is suboptimal at best, and can result in unexpected behaviors.

At the very least, it’s going to completely mess with tool calling.

Best to just use it as it was natively trained imo

[-]

look@reddit

That’s fair. I just use truncation with LLM-as-classifier type applications, not any agent application that would be tool calling. It is more like the tool I am using in this scenario, and I’m often just reading off the first token logprobs directly, not even the actual output text.

[-]

metmelo@reddit

Great job! I wonder why people don't optimize more harnesses for small models.

[-]

vatta-kai@reddit

I’m building one! A browser agent with custom built scaffold that can work with small local models. I tested against llama 4 scout 17b 16e (old I know) and even much smaller ones like the Gemma E4B. It needs refinements but it consistently performs good even on complex tasks at a fraction of cost.

I sincerely believe local models with custom scaffolding will be very very useful.

[-]

ArtfulGenie69@reddit

It's more frustrating hehe

[-]

thrownawaymane@reddit

How robust is the non Ollama support? I'd wager most who are going to try this out/contribute to the project are running something more robust

[-]

Creative-Regular6799@reddit (OP)

Just added llama.cpp support! Thanks again for the tip

[-]

TitwitMuffbiscuit@reddit

Using llama.cpp on windows, I don't get the right context.

.\llama-server.exe --no-warmup -dio -t 7 -np 1 -ngl 999 -fa 1 -fitt 1 -c 262144 -n 131072 -ctk q8_0 -ctv q8_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.01 --presence-penalty 0.0 --repeat-penalty 1.0 --jinja -m ..\models\Qwen3.6-35B-A3B\Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf --mmproj ..\models\Qwen3.6-35B-A3B\mmproj-BF16.gguf --no-mmproj-offload --image-min-tokens 1024 --alias Qwen3.6-35B-A3B --port 8008 --no-webui

Little-coder shows:

❯ python little_coder.py --model llamacpp/Qwen3.6-35B-A3B
╭──────────────────────────────────────────────────────────────────────╮
│                                                                      │
│   little-coder  ·  v0.0.1                                            │
│   An AI coding agent optimized for small local LLMs                  │
│   Tuned for Qwen3.5-9B  ·  46.22% on Aider Polyglot                  │
│                                                                      │
│   model:       llamacpp/Qwen3.6-35B-A3B  (llamacpp)                  │
│   permissions: auto                                                  │
│   context:     16K                                                   │
│                                                                      │
│   /help for commands  ·  /model to switch  ·  /quit to exit          │
│                                                                      │
╰──────────────────────────────────────────────────────────────────────╯

I think the relevant code is https://github.com/ggml-org/llama.cpp/blob/master/tools/server/server-context.cpp

Anyway:

/props gives the max context as set by -c but also exposes a whole lot more like the whole jinja.

"n_ctx": 262144

/slots also shows the max context but it's the current state:

[{"id":0,"n_ctx":262144,"speculative":false,"is_processing":true,"id_task":1,"params":{"seed":4294967295,"temperature":0.6000000238418579,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":20,"top_p":0.949999988079071,"min_p":0.009999999776482582,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":262144,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"max_tokens":-1,"n_predict":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"n_probs":0,"min_keep":0,"chat_format":"peg-native","reasoning_format":"auto","reasoning_in_content":false,"generation_prompt":"<|im_start|>assistant\n<think>\n","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"speculative.type":"none","speculative.ngram_size_n":1024,"speculative.ngram_size_m":1024,"speculative.ngram_m_hits":1024,"timings_per_token":true,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"next_token":[{"has_next_token":true,"has_new_line":true,"n_remain":131019,"n_decoded":53}]}]

Hope, this is helpful.

[-]

Creative-Regular6799@reddit (OP)

Unfortunately i only wrote it with ollama, but can add support for others as well

[-]

swfsql@reddit

Cool discovery! Perhaps when a turn ends, you could remove the previous turn's skill injection - even if this means doing a little prefill? This should save contexts and help the model to not focus on things that should no longer be important.

[-]

Creative-Regular6799@reddit (OP)

That is a cool idea! Will try it out during the weekend (you can fork and try yourself if you get to it before me)

[-]

swfsql@reddit

Thanks, please let me know if you manage to test this.
I apologize but I don't have enough total ram to run this model, not even a Q3 variant.

I was thinking back to this, and I think "erasing past cache" from the Gated Delta Net states may not be as easy as it is for attention. I theory it is possible to "reverse-forward" and recover previous states, but you'd most likely need to backup the state that you'd intend to "rollback into" (restore). I.e. make a restoration point for the GDN states before injecting something that is intended to be evicted, and only then you can "move the clean states forward" with the prefill.

[-]

Creative-Regular6799@reddit (OP)

No need to apologize at all! Will try it out. BTW, I ran little-coder with an extremely small model (9B parameters, <8GB ram), so maybe it will fit your hardware?

[-]

New_Comfortable7240@reddit

So I run limited to cpp aider benchmark with qwen3.5 35B and indeed got better numbers

========================================================================
  Aider Polyglot Benchmark — little-coder
  Model: custom/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled  Small-model optimizations: ON
  Context: 32768  Skills: 300tok
  Languages: ['cpp']  Resume: True  Retry: True
  Exercises to run: 26  (Results: /home/israel/personal/code/little-coder/benchmarks/results_full_polyglot.json)
========================================================================


--- cpp (26 exercises) ---
[1/26] cpp/all-your-base  (cached: pass_1)
[2/26] cpp/allergies  (cached: pass_2)
[3/26] cpp/bank-account  (cached: pass_2)
[4/26] cpp/binary-search-tree  (cached: fail)
[5/26] cpp/circular-buffer  (cached: pass_1)
[6/26] cpp/clock  (cached: pass_2)
[7/26] cpp/complex-numbers  (cached: pass_1)
[8/26] cpp/crypto-square  (cached: fail)
[9/26] cpp/diamond  (cached: pass_1)
[10/26] cpp/dnd-character  (cached: fail)
[11/26] cpp/gigasecond  (cached: fail)
[12/26] cpp/grade-school  (cached: pass_1)
[13/26] cpp/kindergarten-garden  (cached: fail)
[14/26] cpp/knapsack  (cached: pass_1)
[15/26] cpp/linked-list
  ✓ PASS (1st, 85.9s)
[16/26] cpp/meetup
  ✗ FAIL (210.6s)            
[17/26] cpp/parallel-letter-frequency
  ✓ PASS (1st, 85.3s)
[18/26] cpp/perfect-numbers
  ✓ PASS (1st, 73.7s)
[19/26] cpp/phone-number
  ✓ PASS (1st, 107.9s)
[20/26] cpp/queen-attack
  ✓ PASS (1st, 97.1s)
[21/26] cpp/robot-name
  ✓ PASS (1st, 57.6s)
[22/26] cpp/space-age
  ✓ PASS (1st, 83.1s)
[23/26] cpp/spiral-matrix
  ✓ PASS (1st, 79.3s)
[24/26] cpp/sublist
  ✓ PASS (1st, 67.2s)
[25/26] cpp/yacht
  ✓ PASS (1st, 83.6s)
[26/26] cpp/zebra-puzzle
  ✗ FAIL (335.7s)            

========================================================================
  RESULTS
========================================================================
  cpp          19/26  (1st: 16, 2nd: 3, fail: 7)   73.1%

[-]

Creative-Regular6799@reddit (OP)

Now running it with qwen3.6 35B, very curious to see the results

[-]

New_Comfortable7240@reddit

Well in my case it went well with qwen 3.6-35B\~

I tweaked a bit and got some of the option but got21/26 in cpp

here is my llama.cpp script if useful

#!/bin/bash
# Qwen3.6-35B-A3B - Agentic Code Mode (Text-only, no vision)
# Based on official Unsloth docs: https://unsloth.ai/docs/models/qwen3.6
#
# Environment variables:
#   MODE=thinking|instruct   - thinking mode or non-thinking instruct mode
#   TASK=coding|general|reasoning - selects appropriate sampling params per official docs
#   THINKING=true|false      - explicitly enable/disable thinking
#   REASONING_BUDGET=-1|0|N  - -1 = unrestricted, 0 = disable, N>0 = token budget (default: 8000)
#   REASONING_BUDGET_MSG=""  - message injected when thinking budget is exhausted
#   CTX_SIZE=32768           - context window size (max 262144)


SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
LLAMA_ROOT="$(dirname "$SCRIPT_DIR")"


MODEL="$SCRIPT_DIR/qwen3.6-35B-A3B/Ornstein3.6-35B-A3B.i1-Q4_K_M.gguf"


if [ ! -f "$MODEL" ]; then
    echo "ERROR: Model not found at: $MODEL"
    exit 1
fi


# Mode selection: "thinking" or "instruct" (non-thinking)
MODE="${MODE:-thinking}"
# Task type: "general" or "coding" (for thinking) / "reasoning" (for instruct)
TASK="${TASK:-coding}"
# Enable/disable thinking (default: enabled for thinking mode)
THINKING="${THINKING:-true}"
# Reasoning budget: -1 = unrestricted, 0 = disable, N>0 = specific token count (default: 8000)
REASONING_BUDGET="${REASONING_BUDGET:-8000}"
# Message injected when budget exhausted (Qwen3.6 Thinking Mode Fusion)
# Default: "... Considering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>.\n"
REASONING_BUDGET_MSG="${REASONING_BUDGET_MSG:-$'... Considering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>.\n'}"
# Context size (max 262144 for Qwen3.6)
CTX_SIZE="${CTX_SIZE:-131072}"


echo ""
echo "=== Qwen3.6-35B-A3B Llama Server ==="
echo "Mode: $MODE (task: $TASK)"
echo "Model: $MODEL"
echo "Context: $CTX_SIZE"
echo ""


# Set parameters based on mode and task per official docs
if [ "$MODE" = "thinking" ]; then
    if [ "$TASK" = "coding" ]; then
        # Thinking mode for precise coding tasks
        TEMP=0.6
        TOP_P=0.95
        PRESENCE_PENALTY=0.0
        echo "Config: Thinking mode for coding (temp=0.6)"
    else
        # Thinking mode for general tasks
        TEMP=1.0
        TOP_P=0.95
        PRESENCE_PENALTY=1.5
        echo "Config: Thinking mode for general tasks (temp=1.0)"
    fi
else
    # Instruct (non-thinking) mode
    if [ "$TASK" = "reasoning" ]; then
        TEMP=1.0
        TOP_P=0.95
        PRESENCE_PENALTY=1.5
        echo "Config: Instruct mode for reasoning (temp=1.0)"
    else
        TEMP=0.7
        TOP_P=0.8
        PRESENCE_PENALTY=1.5
        echo "Config: Instruct mode for general (temp=0.7)"
    fi
fi


# Reasoning flag (replaces deprecated enable_thinking kwarg)
REASONING_FLAG="--reasoning on"
PRESERVE_THINKING_FLAG="--chat-template-kwargs {\"preserve_thinking\":true}"
if [ "$THINKING" = "false" ] || [ "$MODE" = "instruct" ]; then
    REASONING_FLAG="--reasoning off"
    PRESERVE_THINKING_FLAG=""
    REASONING_BUDGET=0
    echo "Thinking: disabled"
else
    echo "Thinking: enabled"
fi


# Build reasoning budget args (use array to preserve spaces/newlines in message)
BUDGET_ARGS=()
if [ "$REASONING_BUDGET" -ge -1 ]; then
    BUDGET_ARGS+=(--reasoning-budget "$REASONING_BUDGET")
fi
if [ -n "$REASONING_BUDGET_MSG" ] && [ "$REASONING_BUDGET" -gt 0 ]; then
    BUDGET_ARGS+=(--reasoning-budget-message "$REASONING_BUDGET_MSG")
fi


echo "Temperature: $TEMP"
echo "Top-P: $TOP_P"
echo "Presence Penalty: $PRESENCE_PENALTY"
echo "Reasoning Budget: $REASONING_BUDGET (-1=unlimited, 0=disabled, N=token limit)"
if [ -n "$REASONING_BUDGET_MSG" ]; then
    echo "Reasoning Budget Message: $REASONING_BUDGET_MSG"
fi
echo ""

$LLAMA_ROOT/build/bin/llama-server \
    -m "$MODEL" \
    -c "$CTX_SIZE" \
    -b 8192 \
    -ub 1024 \
    --parallel 1 \
    --fit on \
    --flash-attn on \
    --jinja \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --temp "$TEMP" \
    --top-p "$TOP_P" \
    --top-k 20 \
    --min-p 0.0 \
    --presence-penalty "$PRESENCE_PENALTY" \
    --repeat-penalty 1.0 \
    "${BUDGET_ARGS[@]}" \
    --no-webui \
    $REASONING_FLAG \
    $PRESERVE_THINKING_FLAG#!/bin/bash
# Qwen3.6-35B-A3B - Agentic Code Mode (Text-only, no vision)
# Based on official Unsloth docs: https://unsloth.ai/docs/models/qwen3.6
#
# Environment variables:
#   MODE=thinking|instruct   - thinking mode or non-thinking instruct mode
#   TASK=coding|general|reasoning - selects appropriate sampling params per official docs
#   THINKING=true|false      - explicitly enable/disable thinking
#   REASONING_BUDGET=-1|0|N  - -1 = unrestricted, 0 = disable, N>0 = token budget (default: 8000)
#   REASONING_BUDGET_MSG=""  - message injected when thinking budget is exhausted
#   CTX_SIZE=32768           - context window size (max 262144)


SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
LLAMA_ROOT="$(dirname "$SCRIPT_DIR")"


MODEL="$SCRIPT_DIR/qwen3.6-35B-A3B/Ornstein3.6-35B-A3B.i1-Q4_K_M.gguf"


if [ ! -f "$MODEL" ]; then
    echo "ERROR: Model not found at: $MODEL"
    exit 1
fi


# Mode selection: "thinking" or "instruct" (non-thinking)
MODE="${MODE:-thinking}"
# Task type: "general" or "coding" (for thinking) / "reasoning" (for instruct)
TASK="${TASK:-coding}"
# Enable/disable thinking (default: enabled for thinking mode)
THINKING="${THINKING:-true}"
# Reasoning budget: -1 = unrestricted, 0 = disable, N>0 = specific token count (default: 8000)
REASONING_BUDGET="${REASONING_BUDGET:-8000}"
# Message injected when budget exhausted (Qwen3.6 Thinking Mode Fusion)
# Default: "... Considering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>.\n"
REASONING_BUDGET_MSG="${REASONING_BUDGET_MSG:-$'... Considering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>.\n'}"
# Context size (max 262144 for Qwen3.6)
CTX_SIZE="${CTX_SIZE:-131072}"


echo ""
echo "=== Qwen3.6-35B-A3B Llama Server ==="
echo "Mode: $MODE (task: $TASK)"
echo "Model: $MODEL"
echo "Context: $CTX_SIZE"
echo ""


# Set parameters based on mode and task per official docs
if [ "$MODE" = "thinking" ]; then
    if [ "$TASK" = "coding" ]; then
        # Thinking mode for precise coding tasks
        TEMP=0.6
        TOP_P=0.95
        PRESENCE_PENALTY=0.0
        echo "Config: Thinking mode for coding (temp=0.6)"
    else
        # Thinking mode for general tasks
        TEMP=1.0
        TOP_P=0.95
        PRESENCE_PENALTY=1.5
        echo "Config: Thinking mode for general tasks (temp=1.0)"
    fi
else
    # Instruct (non-thinking) mode
    if [ "$TASK" = "reasoning" ]; then
        TEMP=1.0
        TOP_P=0.95
        PRESENCE_PENALTY=1.5
        echo "Config: Instruct mode for reasoning (temp=1.0)"
    else
        TEMP=0.7
        TOP_P=0.8
        PRESENCE_PENALTY=1.5
        echo "Config: Instruct mode for general (temp=0.7)"
    fi
fi


# Reasoning flag (replaces deprecated enable_thinking kwarg)
REASONING_FLAG="--reasoning on"
PRESERVE_THINKING_FLAG="--chat-template-kwargs {\"preserve_thinking\":true}"
if [ "$THINKING" = "false" ] || [ "$MODE" = "instruct" ]; then
    REASONING_FLAG="--reasoning off"
    PRESERVE_THINKING_FLAG=""
    REASONING_BUDGET=0
    echo "Thinking: disabled"
else
    echo "Thinking: enabled"
fi


# Build reasoning budget args (use array to preserve spaces/newlines in message)
BUDGET_ARGS=()
if [ "$REASONING_BUDGET" -ge -1 ]; then
    BUDGET_ARGS+=(--reasoning-budget "$REASONING_BUDGET")
fi
if [ -n "$REASONING_BUDGET_MSG" ] && [ "$REASONING_BUDGET" -gt 0 ]; then
    BUDGET_ARGS+=(--reasoning-budget-message "$REASONING_BUDGET_MSG")
fi


echo "Temperature: $TEMP"
echo "Top-P: $TOP_P"
echo "Presence Penalty: $PRESENCE_PENALTY"
echo "Reasoning Budget: $REASONING_BUDGET (-1=unlimited, 0=disabled, N=token limit)"
if [ -n "$REASONING_BUDGET_MSG" ]; then
    echo "Reasoning Budget Message: $REASONING_BUDGET_MSG"
fi
echo ""

$LLAMA_ROOT/build/bin/llama-server \
    -m "$MODEL" \
    -c "$CTX_SIZE" \
    -b 8192 \
    -ub 1024 \
    --parallel 1 \
    --fit on \
    --flash-attn on \
    --jinja \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --temp "$TEMP" \
    --top-p "$TOP_P" \
    --top-k 20 \
    --min-p 0.0 \
    --presence-penalty "$PRESENCE_PENALTY" \
    --repeat-penalty 1.0 \
    "${BUDGET_ARGS[@]}" \
    --no-webui \
    $REASONING_FLAG \
    $PRESERVE_THINKING_FLAG

Tailored to my 3060, got 30\~40 tps (tg), only downside is TTFT (only first time starting a session, from llama.cpp point of view all the activity by the little-coder is a session) around 20s but after that works really good

[-]

Creative-Regular6799@reddit (OP)

That’s a shocker! I wonder if more models benefit from this coding agents

[-]

rarogcmex@reddit

Have you tried any bigger model with little-coder (special scaffold). Is there less difference?

[-]

Creative-Regular6799@reddit (OP)

I thought about it, and it might be that I am onto a secret sauce here (though very unlikely). Honestly just didn’t have time to test it yet. Will try to get to it by the end of the week if nobody else tries before that

[-]

Taenk@reddit

This tracks with newer research showing that the harness may matter more than the model itself, or rather that the harness explains more variance in performance than model choice.

Have you compared the performance of larger or even frontier models in your harness vs vanilla harnesses? I’m curious whether and how much larger models benefit from more „sophisticated“ harnesses or they benefit from more breathing room.

More generally I noticed halfway decent prompting really levels up smaller models. I haven’t bench marked specific skill files though — there is conflicting data on their effectiveness.

[-]

Creative-Regular6799@reddit (OP)

Thank you for the comment. I didn’t test it with larger models yet, that is a natural next step

[-]

fragment_me@reddit

Do I understand it right that you used two different temp settings? One for your little cider and the other for the regular model? If so doesn’t that skew results?

[-]

Creative-Regular6799@reddit (OP)

That’s a great question, and my answer that it might, although no qualitative difference was observed.

I initially ran aider with the same temperature of 0.3 like i have set in little-coder, and it degraded performance (not on Polygot benchmark, but on my own examples and experimentations). I figured it wouldn’t be fair to change Aider’s configuration and then test it, so I accepted the difference in temperature.

Another example of this is that I found out that for the Aider baseline, litellm times out and resets if the response takes more time. so I made the timeout longer, that way I won’t count these as Aider failures for no good reason.

So yes, the difference in temperature really is there, yet I found it will be less of a confound to leave the temperature as they are

[-]

jadbox@reddit

How about against OpenCode?

[-]

Creative-Regular6799@reddit (OP)

Great question. I can put it against that as well

[-]

dtdisapointingresult@reddit

Impressive, very nice.

Any chance you could try it with https://huggingface.co/agentscope-ai/QwenPaw-Flash-9B so we have a comparison? It's a finetune of Qwen 3.5 9B by a different Alibaba team (the ones making their OpenClaw-style assistant QwenPaw), designed for better agentic performance.

[-]

Ok-Measurement-1575@reddit

Nice. Where's the github?

[-]

SadBBTumblrPizza@reddit

Nobody clicks links anymore do they? bottom of the article.

[-]

lannistersstark@reddit

bottom of the article.

Then you have to give a click to the article first.

[-]