I've created a LoRA for Gemma 3 270M making it probably the smallest thinking model?

Posted by Firstbober@reddit | LocalLLaMA | View on Reddit | 14 comments

https://huggingface.co/firstbober/gemma-3-270M-it-smol-thinker

Here is an example of the output:
```
==================== THINKING ====================

Here is the thinking process:

- This is a large community with a wide range of interests

- Users can ask questions, share experiences, and discuss local events

- The rules are generally open-ended and allow for creativity

- However, the rules may be unclear or incomplete

<|thinking_end|>

==================== RESPONSE ====================

r/LocalLLaMA is a large, open-source question answering subreddit. Its rules are generally open-ended, allowing users to ask questions and share their experiences. However, the rules might be unclear or incomplete depending on the current state of the community.

<|response_end|>
```

It doesn't have much knowledge baked in, but with prompting it can give some interesting results.

Lore:

I've been working for a few days on it. First I just wanted to adapt it locally for function calling without using FunctionGemma. When it worked out (more or less) I moved to adding some thinking. The dataset was procedurally generated + some with Qwen 3.6 35B A3B (Q4 quants) + GLM 5.1.

The biggest hurdle was figuring out how to make it keep the format, I settled for rank 24, 768 max length for training data, and customized loss function which gives 20x for not using proper tags. Due to that the loss stayed at around 7, but the effect is there.

I've wanted to add longer examples, but my RTX 3050 4GB Mobile is kinda not enough, with train batch size of 1 and gradient accumulation step of 2 this is the best I could do.

Another interesting thing, Claude/Gemini were saying that bigger `gradient_accumulation_steps` essentially meant larger batch size but without actually increasing the batch size. This accounted for like 40% of all of my headaches, with model spitting utter garbage and random chinese slop characters.

Well, I think that's all, here are all the relevant training parameters:
```
SFTConfig:

per_device_train_batch_size=1,

gradient_accumulation_steps=2,

per_device_eval_batch_size=1,

learning_rate=1e-4,

lr_scheduler_type="cosine",

warmup_ratio=0.10,

weight_decay = 0.1,

load_best_model_at_end=True,

LoraConfig:

n_rank = 24

r=n_rank,

lora_alpha=n_rank,

target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],

lora_dropout=0.15,

task_type="CAUSAL_LM",
```

Oh, also increasing alpha to 2x rank as recommended in paper kinda broke everything, this is another thing that was pretty frustrating to figure out.

I plan to continue and train some more adapters with other ideas, maybe I'll switch to Qwen 3.5 0.8B when I buy a card with enough VRAM? I don't know. One thing I'll definitely do is thinking adapter for FunctionGemma, as it would fix my issues with function calling to some degree.

[-]

MasterLJ@reddit

I like the idea of fine-tuning on LoRA adapters for specialized reasoning models that know one particular technology well like "Go manage this namespace inside k8s, here is a LoRA adapter with k8s knowledge, and some context around the service/logs etc"

Just please remember that LoRA leaks cross-tenant.

[-]

InterestRelative@reddit

LoRA adapter don't have much knowledge inside, they steer the conversation style. If model is tiny and don't know anything about k8s, LoRA won't help.

[-]

Queasy-Contract9753@reddit

That's amazing! To see a sophisticated approach to make a model this small do things. Do you think one could train it for conversation with a character and lore?

Like for example to make a captain Picard, with Star Trek lore.

[-]

RanklesTheOtter@reddit

Yeah that kinda thing works well for an SFT if you have a dataset prepared. I taught a small model to talk and RP my character.

[-]

qwen_next_gguf_when@reddit

Does applying on up and down actually work?

[-]

Firstbober@reddit (OP)

I have no idea, but I guess it turned out fine? When I get to it I will try without those projs, should also save some VRAM...

[-]

Silver-Champion-4846@reddit

Ablation studies mmmmm

[-]

MuDotGen@reddit

I'm still curious to see how well this would work with 1-bit models like Bonsai which can contain x10 more parameters for the same size with a loss in accuracy. Trade-offs for sure, but it's quite fast. I'm just curious how much these SLMs' abilities can be augmented with good tool-following Loras.

[-]

RanklesTheOtter@reddit

Very interesting. This was a thought of mine, I think you had a similar one.

What if there was a small model that could reason cleverly. If it had web search and memory tools, it wouldn't need much baked in knowledge because it could find and reason what it needed.

[-]

Firstbober@reddit (OP)

Exactly, although this attempt has a broken calculator tool, I believe that grounding the model fully with tools and getting a performance hit is the future as this would enable smart but reasoning models to perform as well as large ones.

Here is an entry from my smol dataset:

The numbers 4, 8, 12, 16 represent the number of pages Maria read each day for 4 days. What is the total number of pages she read?
---
<|thinking_start|>
Here is the thinking process:
- Add all four daily page counts
<|function_call_start name="calculator"|>
  <|expression|>4 + 8 + 12 + 16<|/expression|>
  <|output|>40<|/output|>
<|function_call_end|>
- Maria read 40 pages total
<|thinking_end|>
<|response_start|>
Maria read a total of 40 pages. The daily amounts add up to 4 + 8 + 12 + 16 = 40.
<|response_end|>

Here is an entry from my smol dataset:

With memory tools, web search, logic statement prover and other tools it could perform as well as larger models and on local machines without much processing power, much faster.

[-]

RanklesTheOtter@reddit

Definitely! Great minds think alike. I think some instances of "Hmm...I honestly don't know, even with my available tools." Might be helpful too to minimize hallucinating.

[-]

Firstbober@reddit (OP)

I was thinking about GRPO for forcing the model to admit to not knowing, but it's very VRAM hungry, although I saw that unsloth also has their implementation of it so it may be worth giving it a try. Also, such training should be more effective than for all-knowing models because they do not need to optimize their parameters for paths that they don't know, they can kinda infer it from the tool output.

[-]

RanklesTheOtter@reddit

Unsloth GRPO I was a able to do with a DoRA on 16GB of VRAM for Qwen 3.5 4B so it should be perfectly doable. Though it was way slower than SFT. 7 hours vs. 1.5 hours on a 5060TI.

[-]

Firstbober@reddit (OP)

Good to know, thanks for data!