Been using PI Coding Agent with local Qwen3.6 35b for a while now and its actually insane

Posted by SoAp9035@reddit | LocalLLaMA | View on Reddit | 172 comments

So ive been running PI Coding Agent with a the Qwen3.6 35b a3b q4_k_xl model for some real projects and honestly didn't expect it to work this good.

The real game changer was the plan-first skill file i created. Like it actualy follows what you say and does everything step by step without going off the rails. Used it on actual production stuff and it held up.

Here's the skill file if anyone wants to try it:

---
name: plan-first
description: Structured planning workflow for any coding task. Use at the start of every new feature, bug fix, refactor, or implementation request. Analyzes the project, asks up to 5 clarifying questions, creates a TODO.md, gets user approval, then executes task by task. Never writes code before a plan is approved.
---

# Plan-First Workflow

## Rules

- NEVER write code, create files, or run commands before a TODO.md is approved.
- NEVER assume missing information. Ask instead.
- NEVER skip steps. Follow phases in order.
- NEVER go off-plan. If new work is discovered, add it to TODO.md and ask for approval before doing it.

---

## Phase 1 — Analyze the Project

Read the project silently before asking anything. Check:

1. Directory structure (top 2 levels)
2. `package.json`, `pubspec.yaml`, `go.mod`, `requirements.txt`, `Cargo.toml`, `pom.xml`, or equivalent
3. Existing dependencies and their versions
4. Build system and scripts (`Makefile`, `scripts/`, CI config)
5. `README.md` or `README.*`
6. Any existing `TODO.md`, `TASKS.md`, `.todo`, or open issue files

Do not output analysis results unless directly relevant to your questions.

---

## Phase 2 — Ask Clarifying Questions (One Round Only)

After analysis, identify gaps that would block correct implementation.

- Ask **at most 5 questions** in a single message.
- Only ask what is **critical and cannot be inferred** from the codebase.
- Number the questions.
- Do not ask about things already answerable from the project files.
- Do not split into multiple rounds — this is your only chance to ask.

Example format:

```
Before I create the plan, I need a few things clarified:

1. Should the new endpoint require authentication?
2. Is there a preferred database (the project has both SQLite and Postgres configs)?
3. Should existing tests be updated, or only new ones added?
```

Wait for the user's response before proceeding.

---

## Phase 3 — Create TODO.md

Using the analysis and the user's answers, write a `TODO.md` file in the project root.

### TODO.md Structure

```markdown
# TODO

## Goal
One sentence describing what will be built or fixed.

## Tasks

### 1. <Phase Name>
- [ ] <Concrete, measurable action>
- [ ] <Concrete, measurable action>

### 2. <Phase Name>
- [ ] <Concrete, measurable action>
- [ ] <Concrete, measurable action>

## Notes
Any constraints, decisions, or known risks recorded here.
```

### Requirements

- Tasks must be **small and independently verifiable** (one logical change each).
- Order tasks by **dependency** (prerequisites first).
- Each task must be checkable as done/not done.
- No vague items like "fix things" or "improve code".

After writing the file, show the full contents to the user and ask:

```
I've created TODO.md. Does this plan look correct?
Reply YES to start, or tell me what to change.
```

---

## Phase 4 — Revision Loop (if needed)

If the user requests changes:

1. Ask targeted follow-up questions to resolve the disagreement.
2. Rewrite `TODO.md`.
3. Show the updated plan and ask for approval again.

Repeat until the user approves.

---

## Phase 5 — Execute the Plan

Once approved:

1. Work through tasks **in order**, one at a time.
2. After completing each task, mark it done in `TODO.md`:
   - Change `- [ ]` to `- [x]`
3. State which task you are starting before you begin it.
4. Do not start the next task until the current one is complete.
5. Do not perform any work not listed in `TODO.md`.

If you discover that an unlisted task is required:
- Stop.
- Add it to `TODO.md` under a `## Discovered Tasks` section.
- Tell the user what was found and why it is needed.
- Ask for approval before continuing.

When all tasks are marked `[x]`, write:

```
All tasks in TODO.md are complete.
```

Defenetly worth trying if you havent already. Local models have come a long way fr

[-]

onefourten_@reddit

Glad someone is getting success! My Qwen / Pi / oMLX combo keeps getting stuck in a loop…

M4 Max 36Gb

[-]

besmin@reddit

Use llama server, it’s been really great for me. Mlx has issues with prompt caching with moe models.

[-]

VegetaTheGrump@reddit

It's a bug in omlx. Check the issues page, they're working on it. If not omlx, it's probably a missing "preserve_thinking" setting specific to 3.6 or the backend isn't quite supporting that yet.
I'm looking for a replacement. Gonna try vMLX.

[-]

R_Duncan@reddit

Thanks man, will definitely try this

[-]

SoAp9035@reddit (OP)

Here's my llama.cpp configs:

/home/abk/llamacpp/llama-server \
  --model /home/abk/llm-models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  --port 8001 \
  --alias qwen3.6-35b-a3b \
  -c 131072 \
  -n 32768 \
  --no-context-shift \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --repeat-penalty 1.00 \
  --presence-penalty 0.00 \
  --fit on \
  -fa on \
  -ctk q8_0 -ctv q8_0 \
  --chat-template-kwargs '{"preserve_thinking": true}'

[-]

NicholasCureton@reddit

I've 8GB VRAM and 16GB VRAM. Context size is 128000. 38t/s. Linux TTY console to strip down all GUI bloats so I can finally run some model on my pc. I also use my own inference cli client for llama.cpp with my own tools like bash commands, read/write files, internet access, crawl pages, search pages etc, which btw, made by OmniCoder 9B (Qwen3.5 9B). So just saying Hi to model didn't cost 10,000 tokens unlike OpenCode. OminiCoder run in 68t/s in my pc.

[-]

Pangocciolo@reddit

You have your own agent? I also had the same crazy idea. How do you make the agent respect the project indentation settings?

[-]

NicholasCureton@reddit

I only use python so default 4 spaces indentation is fine. If you use different indentation, system prompts, tool results, directly tell LLM, or just use vim to re-indent entire file, or use exact match string replacement tool for LLM including spaces. I made tool results return extra rules that keep reminding LLM about constraints after each tool calls. Idk it's standard way or not or right way. It's work for me. Haha. Exact match tool have pros and cons especially for 9B models. Poor thing having trouble writing exact string. Which language do you use btw?

[-]

Pangocciolo@reddit

I'm writing in bash with jq. As inefficient and dangerous as it can be. It's not that it's going anywhere, but the popular ones were timeouting constantly on my slow hardware.

[-]

NicholasCureton@reddit

Are you trying to let LLM JSON? my experience is LLM can't write perfect JSON. I'll to instruct it to create a python script that will mathematically calculate all brackets and format JSON instead of writing it manually. That's solved JSON problem for me.

[-]

Pangocciolo@reddit

No, I use json to save the chat context on a big temp file. Then with jq I manipulate the context file.

[-]

Maximum@reddit

38t/s? What quants? What does your command line look like? What is your hardware? There must be a catch.

[-]

NicholasCureton@reddit

Qwen3.6-35B-A3B-APEX-I-Compact.gguf 17.3GB

blenderkosai@trixie:~$ fastfetch
        _,met$$$$$gg.          blenderkosai@trixie
     ,g$$$$$$$$$$$$$$$P.       -------------------
   ,g$$P""       """Y$$.".     OS: Debian GNU/Linux 13 (trixie) x86_64
  ,$$P'              `$$$.     Kernel: Linux 6.12.63+deb13-amd64
',$$P       ,ggs.     `$$b:    Uptime: 16 mins
`d$$'     ,$P"'   .    $$$     Packages: 3165 (dpkg)
 $$P      d$'     ,    $$P     Shell: bash 5.2.37
 $$:      $$.   -    ,d$$'     Display (AOC 27"): 1920x1080 @ 60 Hz in 27" [External]
 $$;      Y$b._   _,d$P'       DE: GNOME 48.7
 Y$$.    `.`"Y$$$$P"'          WM: Mutter (Wayland)
 `$$b      "-.__               WM Theme: Colloid-Teal
  `Y$$b                        Theme: Colloid-Teal [GTK2/3/4]
   `Y$$.                       Icons: Colloid-Teal [GTK2/3/4]
     `$$b.                     Font: Roboto (11pt) [GTK2/3/4]
       `Y$$b.                  Cursor: Adwaita (24px)
         `"Y$b._               Terminal: alacritty 0.15.1
             `""""             Terminal Font: Jetbrains Mono (12.0pt)
                               CPU: AMD Ryzen 5 3600 (12) @ 4.21 GHz
                               GPU: NVIDIA GeForce RTX 5060 [Discrete]
                               Memory: 2.48 GiB / 15.55 GiB (16%)
                               Swap: 268.00 KiB / 29.80 GiB (0%)
                               Disk (/): 26.05 GiB / 228.12 GiB (11%) - ext4
                               Disk (/home): 513.12 GiB / 656.85 GiB (78%) - ext4
                               Disk (/media/blenderkosai/Red_SSD_1T): 840.13 GiB / 931.47 GiB (90%) - exfat
                               Local IP (wlxd03745edc3e2): 192.168.43.222/24
                               Locale: en_US.UTF-8

Nothing special. Just use TTY mode on linux. GPU idle is 500MB. RAM is 650MB. llama-server -m Qwen3.6-35B-A3B-APEX-I-Compact.gguf --fit on --fit-ctx 128000 --fit-target 256 -np 1 -fa on -b 2048 -ub 2048 -ctk q8_0 -ctv q8_0 --chat-template-kwargs "{\"preserve_thinking\": true}" --draft-min 1 --draft-max 8 --temp 0.6 --top-p 0.95 --top-k 20 That is launch parameters.

[-]

hudsonab123@reddit

I have a setup with the same specs, ddr4 or 5?

[-]

SoAp9035@reddit (OP)

DDR5 5600Mhz

[-]

hudsonab123@reddit

Ah ok good to know. I’ve got ddr4 and I suspect it makes a pretty big perf difference.

[-]

JustSayin_thatuknow@reddit

The huge difference is not between ddr4 and ddr5, it is rather about if yours is using dual channel or not.

[-]

Kholtien@reddit

I’ve got a similar setup to OP except DDR4 and 16 GB VRAM and I get about 25-30 tps

[-]

hockeyketo@reddit

It doesn't really change it that much in my experience. If you're using RAM, you're already paying a big perf hit compared to VRAM. I couldn't find any quick LLM benchmarks, but in gaming it's almost negligible.

[-]

SoAp9035@reddit (OP)

I think It will run ok. Last week I setup llama.cpp and qwen3.6 35b q1_m to a old 16gb ram school pc. It was working 10 t/s. I gave it a few html webos and games. It did ok work but it worked!

[-]

vorwrath@reddit

How are you running PI Agent? I was thinking about checking it out, but hadn't really decided how to lock it down, as it seems to have full shell access by default.

I guess it needs to be in a container of some sort, but is there a quick and simple method to spin that up when moving between different projects?

[-]

quantyverse@reddit

Just spin up a docker container and mount a local folder into it. Than Pi Agent can only destroy the content of the folder but not the rest of your computer. Let me know if you need help with that.

[-]

aeroumbria@reddit

Is there a better way to control what it is and isn't allowed to do? Container is an option, but it doesn't really protect itself from destroying its own progress or messing up the repository. Sometimes you do want more granular controls like "no git commit", "no switching branches" or "never run 'rm' at all". Is there a mechanistic (non-prompting) way to add these features?

[-]

wasatthebeach@reddit

You can easily make the git read only. Just mount the .git folder separately read-only.

[-]

blackhawk00001@reddit

Docker sandbox (the new sbx utility, not old desktop 'sandbox') is promising but currently in development. Good news is they are taking feedback on their git page.

Hermes supposedly works with it so maybe pi will. I recently installed it to try with claude but currently the container vm it spins up does not recognize the environment variable overrides I need for using llama.cpp as the backend for claude. I also haven't figured out how to use a specific version of claude code yet as it seems to only pull latest image at this time.

[-]

HongPong@reddit

"nono" is still alpha (not security fully vetted at all) but offers this for Claude opencode and some other profiles

[-]

vorwrath@reddit

Thanks, that seems sensible. I'm already running llama.cpp in a docker container, so should hopefully be able to set the networking up to that Pi can communicate with that and not the rest of my network.

Just dipping a toe into coding agents for the first time, as I've previously wanted more manual control over changes and been all about the copy/paste life. But now that more capable local options are becoming viable, I'm more interested in exploring them.

[-]

blackhawk00001@reddit

Docker sandbox (the new sbx utility, not old desktop 'sandbox') is promising but currently in development. Good news is they are taking feedback on their git page.

Hermes uses a more generic template I think. I'm checking out Pi so will try sbx if I can find time.

[-]

SoAp9035@reddit (OP)

You can install it with just simple command then type any directory pi, done!

https://pi.dev

[-]

Amazing_Upstairs@reddit

What part allows you to run such a large model on only 8Gb vram? My computer became totally unstable with bsod and I have 24 gb vram and 64gb ram on a pc

[-]

MuDotGen@reddit

I've just discovered Pi.dev, and it's literally the only harness so far that has been not terrible for me so far. Heck, it somehow works pretty decently so far even with Qwen3.5-4b with vision model for images. My PC at home has an RTX 2070 Super (8gb) with 32gb of system RAM. I've got to try this myself. What GPU and OS are you using?

[-]

SoAp9035@reddit (OP)

I'm thinking the same. Pi is simple and just works. RTX 4070 Mobile 8GB and Omarchy (Arch Linux).

[-]

MuDotGen@reddit

That's awesome. I just learned about fit on from your post now and turns out it automatically does what I was testing with manually offloading a certain number of layers to the GPU, so this is a lot more convenient too. Thanks for the info!

[-]

Maximum@reddit

What is your hardware exactly? That's pretty good speed

[-]

SignificantActuary@reddit

Running the same except also --spec-default on my RTX 1070 with 8GB VRAM. Getting 11-14 t/s. Slow but usable.

[-]

Client_Hello@reddit

The GTX 1070 was a great CPU but it did not do ray tracing. That started with the RTX 2000 series.

[-]

OldKaleidoscope7@reddit

I know you're using llama, but I think you can improve your speed if you mess with force offload experts (I'm on LMStudio, so I can't help much), but I got 22 t/s with 8GB VRAM + 64 GB DDR4 2400 RAM, I believe it can be better

[-]

Real_Ebb_7417@reddit

How much better does Qwen get with temp 0.6 vs default temp 1.0?

[-]

IShitMyselfNow@reddit

That's their recommended settings for coding.

[-]

Real_Ebb_7417@reddit

I know that. I just wonder how big is the difference.

[-]

danigoncalves@reddit

isnt it --fit on by default on the latest llamacpp versions?

[-]

PaceZealousideal6091@reddit

Hey! Thanks a lot for sharing this. Good work. How much PP speed do you get? And how well does the setup works for you? Does it drop tool calls? How many attempts to get it right?

[-]

SoAp9035@reddit (OP)

I get 275 t/s. It works really well for my current projects. I haven't tried it on a project from scratch yet, but I think it would work fine. As for dropped tool calls, I'd say roughly 1 out of 10 attempts. It usually just one or two retries needed to get it right.

[-]

PaceZealousideal6091@reddit

yeah.. getting about the same. I was feeling something is was off. With 3.5, i was getiing about 400-600 tps for PP and 26-30 tps TG in the range of 32k to 131k context. somehow, 3.6 is running slower.

[-]

SoAp9035@reddit (OP)

That's odd. What quant are you running and what parameters are you using in llama.cpp? Maybe there's something in the setup causing the slowdown.

[-]

PaceZealousideal6091@reddit

I have about the same hardware as yours. Rtx 4070 8GB, 32 GB DDR5 RAM, Intel i7 13620H. So, the speed you are getting is about what I am getting too. Parameters I am using is also about similar to what you are using. So, there's definitely a bit of slowdown from qwen 3.5.

[-]

wtfihavetonamemyself@reddit

I’m curious what led you to the K XL vs the K M version of q4? Do you find q8 cache is a much better experience over q4?

With no context shift do you have your own compact command?

[-]

According_Study_162@reddit

wait you have only 8GB VRAM? dam got to try this.

[-]

SoAp9035@reddit (OP)

Yep.

[-]

ibishitl@reddit

This is almost my same exact setup right now
Pi + qwen/qwen3.6-35b-a3b on a Macbook Pro M4 Pro 48Gb Ram

Is super fast and smart to complete my tasks, I'm already canceled my IDE suscription and Claude Suscription too

[-]

Heavy-Focus-1964@reddit

did you tweak your omlx setup? what coding harness did you use?

i tried to get it to clean up 7 classes of lint error and it just had a panic attack and went into a loop

[-]

ibishitl@reddit

I'm using LM studio for the Model, I just changed temp=1 and thats it, will take a look at omlx, looks great!

[-]

Heavy-Focus-1964@reddit

oh yeah you gotta get on the MLX tip for apple silicon

[-]

ibishitl@reddit

I can download MXL version from LM Studio, and after trying omlx I will stick with LM Studio, for me it feels just like a better tool
Have you tried setting the temperature to 1? I mean, that is what I saw in a few posts
I just entered local models not too long ago, haven't tested that much yet, this week I just learned what was a MOE version hahaha

[-]

Heavy-Focus-1964@reddit

we're all pretty new to it, and the ground shifts every day...

a while back i'd heard that LM Studio didn't leverage MLX as much as it could or something like that. oMLX is supposed to be more performant.

so far i've only used oMLX to generate embeddings, which it was fine for. i've been trying to use it as a backend for coding agents and so far haven't been able to get satisfactory results. now i'm not sure if it's an inherent difference to the oMLX runtime, my model settings, or something else. i'm going to try LM Studio and Ollama and some other backends and see I get better results.

[-]

GrehgyHils@reddit

Please report back as I'm also on the fence on using lm studio or omlx

[-]

sarsarhos@reddit

what variant/quant are you using? also any tips for other settings? I have m4 pro too but 24 gb, would love to hear any recommendations.

[-]

kuleg@reddit

I had the same problem with the latest omlx. No matter what I’ve changed in the model or omlx settings it was getting stuck in a loop after a while. Went back to Unsloth (llama.cpp)

[-]

Heavy-Focus-1964@reddit

huh. well that’s disappointing. i’ll look into that, thanks

[-]

AdOk3759@reddit

Is Pi a harness? Is it better than the recently released little-coder?

[-]

Fluffywings@reddit

No idea so I asked Gemini. I verified nothing.

Both Pi Agent (often referred to as Pi.dev) and little-coder are modern, open-source CLI coding agents designed to orchestrate LLMs for software development. However, they take fundamentally different approaches to solving the problem of AI coding assistance.

Pi.dev is built around minimalism and extreme extensibility for any model (cloud or local), while little-coder is a highly specialized scaffold designed to make small, locally hosted models punch above their weight class.

Here is how they compare to help you decide which is best for your workflow.

Pi Agent (Pi.dev)

Created by Mario Zechner, Pi is built on the philosophy that most coding agents are bloated "spaceships with 80% unused functionality." Instead of forcing you into a specific way of working, Pi acts as a lightweight foundation.

Core Philosophy: Radically minimal. Out of the box, it only gives the LLM four tools: read, write, edit, and bash.
Extensibility: This is Pi's superpower. It features a TypeScript SDK that allows you to easily plug in "Pi Packages" via npm or Git. You can inject custom prompt templates, skills, or even full autonomous loops (like pi-autoresearch for benchmarking optimizations).
Target LLMs: It is agnostic. While it works beautifully with local setups via Ollama, it is equally comfortable routing to frontier cloud models like Anthropic's Claude Pro, OpenAI, or Google Gemini.
Best For: Developers who want a clean, un-opinionated foundation they can customize to their exact enterprise workflow or CI/CD pipelines without wrestling with a rigid agent framework.

little-coder

Created by Itay Inbar, little-coder is essentially an architectural hack to make consumer-hardware-friendly models (5 GB to 25 GB) perform like massive frontier models on standard coding benchmarks.

Core Philosophy: Heavy optimization and guardrails for smaller models. Small LLMs (like Qwen3.5-9B or Qwen3.6-35B) often hallucinate, burn through context windows, or disastrously overwrite files if given too much freedom. little-coder constraints them to keep them on track.
Key Optimizations:
- Thinking Budgets & Compaction: It actively manages context, preventing small models from entering endless loops and automatically compacting the context window when it gets too full.
- Write-vs-Edit Invariants: It enforces strict rules at the tool level so a small model can't accidentally overwrite an entire file when it just meant to edit a few lines.
- Workspace Awareness: It auto-discovers specs (README.md, CLAUDE.md, etc.) and reads them before the model acts, injecting domain knowledge cleanly.
Target LLMs: Local models run through Ollama or llama.cpp on consumer laptop GPUs (e.g., 8 GB to 24 GB VRAM).
Best For: Developers running entirely local, offline setups who want the highest possible coding accuracy out of smaller open-weights models without paying for cloud API keys.

Feature Comparison

Feature	Pi.dev (Pi Agent)	little-coder
Primary Goal	Minimal, customizable foundation for all LLMs.	Strict scaffolding to maximize small local LLM performance.
Model Focus	Cloud (Claude, GPT, Gemini) & Local (Ollama).	Strictly Local (Ollama, llama.cpp).
Built-in Tooling	Barebones (`read`, `write`, `edit`, `bash`).	Advanced guardrails (Write-vs-Edit invariants).
Extensibility	High (TypeScript SDK, npm/Git packages).	Low (Focused on a specific, optimized architecture).
Context Management	Standard API handling.	Aggressive auto-compaction and "thinking budgets".
Hardware Requirement	None (if using cloud) / Varies (if local).	Designed for consumer laptops (8 GB+ VRAM).

The Verdict

Choose Pi.dev if you have a powerful LLM (like Claude 3.5 Sonnet or GPT-4o) or a specific, complex workflow you want to automate. Its extensibility makes it the better choice for power users who want to build custom tools and scripts on top of an agent.
Choose little-coder if you are running models like Qwen 9B or 35B locally on your laptop and want them to actually succeed at complex, multi-step coding tasks without breaking your codebase.

[-]

explorigin@reddit

little-coder, as of a few days ago, was it's own thing. But that changed. It is now implemented as extensions on top of pi.dev

[-]

I_HAVE_THE_DOCUMENTS@reddit

Bruh trim down the slop.

[-]

lerboenner@reddit

little-coder is pi + 16 extensions + 30 skill markdown files + a Python benchmark harness

little-coder is an attempt to optimize pi for locally run models.

[-]

AdOk3759@reddit

I see! That’s good to hear then.

So little coder for small models, pi for using cloud models via API?

[-]

SEC_intern_@reddit

Holy hell identical setup. Got open box 48GB M4 Pro for $1.7k just before the ramocalypse and it has been the best investment so far. I've published some simple benchmarks here in case someone is in similar shoes..

[-]

ibishitl@reddit

I would love to be runnin 27b q8_0, but in my case it is slow enought to not wanting to use it
Right now I'm running 25b-a3b Q6_K and it is pretty good for my use case

[-]

wbuc1@reddit

Thank you for sharing this! I've just started looking into running local models with a coding agent. Do you have any tips or advice for a newcomer in this space?

[-]

pdycnbl@reddit

do you use any plugins with it?also how do u interact with it primarily? cli?

[-]

SoAp9035@reddit (OP)

No plugin. I use CLI. Open my project directory and just start giving instructions, etc.

[-]

Exact_Golf_1072@reddit

How do you send sampling Params on your requests using Pi?

I have LM Studio running the same model as you and I'm unable to add temperature / presence_penalty, etc to the requests Pi sends

[-]

SoAp9035@reddit (OP)

Just use llama.cpp. With these configs: https://www.reddit.com/r/LocalLLaMA/s/PXL2OsGgMS

[-]

audiophile_vin@reddit

I'm using this as well with qwen3.6 27b and is mind blowing I can do this locally now. I came across this article via pi!

Plan mode is available as an extension in official examples: https://github.com/badlogic/pi-mono/tree/main/packages/coding-agent/examples/extensions/plan-mode

[-]

NewRooster1123@reddit

The setup is interesting, but the part I would want documented is how you keep the conventions from drifting over time. Local model performance is only half the story here; reproducibility usually lives in versioned prompts, tool boundaries, and a clean AGENTS file. If you ever share a writeup, that operational layer would probably help people more than the model name alone.

[-]

oxygen_addiction@reddit

What is your setup like? Prompt, etc.
I'd like to replicate this. Thanks!

[-]

NewRooster1123@reddit

[-]

audiophile_vin@reddit

- Global AGENTS.md at \~/.pi/agent/AGENTS.md with my conventions and workflow prefs

- Skills for things like web search (searing default and tavily secondary), GitHub CLI, knowledge base lookup (karpathy's llm wiki idea)

- Extensions for custom tools (ask mode, plan mode, voice mode)

- Qwen 27B via a local provider config

[-]

oxygen_addiction@reddit

Thanks

[-]

iamapizza@reddit

How do you install this plan extension or is it out of the box

[-]

audiophile_vin@reddit

grab the extension folder from GitHub and then copy the extension folder into your extensions dir

\~/.pi/agent/extensions/plan-mode/

[-]

itroot@reddit

Cool! What are you using to access reddit?

[-]

nunofgs@reddit

Has anyone run qwen + pi successfully with lmstudio? Mine seems to loop and forget what it’s doing. Seems like a bug

[-]

sine120@reddit

I'm kind of at the point that Pi is really the only agent worth using. Everything else feels so bloated now.

[-]

ea_man@reddit

Well Aider can load with pretty much no context and it won't fuck up file edits, it's focused on the file you add and won't fuck up the whole system.

[-]

sine120@reddit

No offense but I have no idea what you're talking about.

[-]

ea_man@reddit

I'm talking about: https://aider.chat/ if you wanna dive into.

[-]

sine120@reddit

I mean the fucking the system up thing. I know what Aider is.

[-]

ea_man@reddit

install, change permissions, delete, move files, execute commands

[-]

sine120@reddit

You mean bash? Aider and OpenCode have bash tools as well.

[-]

ea_man@reddit

Yeah but aider asks you before use and it works on added files.

[-]

sine120@reddit

Put it in a sandbox?

[-]

LocoMod@reddit

You've been heavily promoting it in comments this week to skirt around the self promotion rule of this sub. It's a pretty obvious campaign. Pi does not offer any substantial benefit over existing harnesses.

[-]

sine120@reddit

I'm not a maintainer or affiliated with it in any way, I just don't have 20+ seconds to wait for OpenCode's massive system prompt on my system, and comparatively Qwen3.6 seems to do get better results in it. For consumer systems where your PP speeds are under 2k, it's just really good.

I have 16GB of VRAM and have to run Qwen out of CPU if I don't want Q2 quants, which means my PP speeds are in the ballpark of 200-600 depending on context depth. If OpenCode's system prompt is 10k+ tokens and I have 30k tokens of work going on, I'm literally waiting minutes every time context is modified, which is often.

[-]

LocoMod@reddit

You can modify that system prompt, or any of the other features in OpenCode as you see fit. I have not looked at the OpenCode system instructions but there is usually a pretty good reason for configuring a complex set of instructions. You want to be explicit about certain things. You want to put some safeguards in place. That system prompt could be the difference between an agent building an app with OpenCode, or an agent deleting your precious home directory using Pi.

But I get it. If you are compute limited then you have to make tradeoffs somewhere.

[-]

sine120@reddit

I could keep messing with OpenCode until it's as performant as Pi, or I can just use Pi. It has such a long system prompt to inform about all its tools. I'm not a power user, so read, write, edit and bash in a sandbox works fine for me. An agent can do stupid things in both harnesses, but on my cheaper hardware and smaller harness I don't have to wait for those stupid things.

[-]

FusionX@reddit

Actually, pi is pretty well-regarded in an otherwise vibecode filled space. It's the only project, I can trust.

The dev has a pretty sensible approach and philosophies when it comes to the project. You can go through their blog.

[-]

sn2006gy@reddit

little-coder sounds dope!

I just wonder why the effort isn't being pushed into a common framework/harness as the "upper harness" that is the layer between coding tool <> upper harness <> openai api - seems like we could get more standardization there to make more models punch above their weight by also adopting a schema/control plane idea on top of it. That's what i'm thinking / building with https://github.com/supernovae/open-cot

[-]

mouseofcatofschrodi@reddit

do you get any loop in the thinking?

I'm getting many times loops using pi (or others), when it already has coded the solution. The job is done, it keeps thinking in loops. With preserve_thinking true or false.

[-]

rpkarma@reddit

3.5 or 3.6? 3.5 requires presence_penalty, 3.6 requires repeat_penalty

Check the system cards for both: following their parameter sets for each made a big difference for me

[-]

mouseofcatofschrodi@reddit

3.6 (mlx 4 bit):

Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

I've been using it with omlx, next I will see if in LM Studio works better, and also downloading the unsloth 4bit mlx to check if they are better.

Other checks will be to go for a 6 or 8 bit and less context window

[-]

SoAp9035@reddit (OP)

I rarely get a loop. If it loops, I just stop it, undo the previous prompt, and run it again.

[-]

mouseofcatofschrodi@reddit

what sampling and engine do you use?

[-]

talk_nerdy_to_m3@reddit

Not very agentic if you have to constantly monitor it

[-]

arthor@reddit

how is this different / better from plan mode in opencode.. ?

[-]

Pleasant-Shallot-707@reddit

Let Mario (the developer) explain: https://youtu.be/RjfbvDXpFls?si=oKQjE3Q4SZ54tvIy

[-]

Corosus@reddit

Yep ok, thats convinced me, gonna give it a try again. I've been reluctant to add more things to opencode, just feels like bloat, im still using low to the ground build/plan mode + web search mcp as for most of my work I prefer to review it.

[-]

Corosus@reddit

Wait wtf qwen 3.6 27b responds immediately in pi, in opencode it takes like 10 seconds, simplicity ftw, and when i say "test" pi just says "sup" but opencodes initial prompt bloat has it suddenly trying to explore a codebase

[-]

HongPong@reddit

ohh I'll have to try this then.. how do you deal with sandboxing pi?

[-]

kfl@reddit

I like gondolin https://github.com/earendil-works/gondolin

There is a Pi + Gondolin extension that runs pi tools inside a micro-VM and mounts your project at /workspace.

[-]

Corosus@reddit

I've still been meaning to set something up myself for sandboxing. I'll probably go for a docker container that has access to llama.cpp running on the host.

[-]

SoAp9035@reddit (OP)

I don't know if I can say that this plan-first skill is better than OpenCode's. OpenCode is slow for me because of its big system prompt and other stuff, I don't know why. Pi is basically lightweight and works well with this skill.

[-]

sagiroth@reddit

So we back to writing md files ? Looping back to the beginning

[-]

rpkarma@reddit

I mean it’s literally all still text descriptions sent to the bag of matmuls at the end of the day

[-]

cleverusernametry@reddit

Wdym all systems are literally just that, they just add obfuscation layers that pretend to be some new capability/abstraction like skills, modes, plugins etc

[-]

Pleasant-Shallot-707@reddit

Did you set up any guardrails? What do you think coding agents run on?

[-]

sn2006gy@reddit

Cheap price to pay to use a local model where you need to be more explicit because it doesn't have a huge upstream harness trying to vibe it for you.

Advantages/disadvantages for sure. In the hands of a coder, this works.. "spec driven development" but in the hands of a novice they don't know what they don't know so Anthropic will do it better.

[-]

CrushingLoss@reddit

I appreciate your SKILL.md file! I'm using it now in PI to try and re-create a classic TI-994/A game. Will post results when it finishes.

Biggest issue I had was making sure i had wide enough context window and max tokens. So far, so good. I'm running on a Mac Studio M2 Max; 96GB. Getting about 35 tok/s through Pi or Opencode; about 50 just benchmarking through oMLX.

[-]

emiliobay@reddit

That rule about making it read the project silently before asking anything is the exact fix for the most annoying part of using agents right now. Whenever a model goes completely off the rails on a real project, it's usually because it skipped checking the existing directory structure and just guessed how things were wired up. Forcing the TODO.md approval step before a single line of code gets written changes the whole dynamic from babysitting a rogue script to actually managing a decent plan.

Getting into coding recently by heavily relying on Claude Code and Cursor, my biggest trap is always letting the AI run away with a bad assumption that trashes the local setup. I end up spending an hour just reverting changes because it confidently hallucinated a dependency that wasn't even in my package.json. Dropping this specific phase-by-phase structure into my setup is going to save me from those endless rollback loops when I'm just trying to glue a basic feature together.

[-]

rm-rf-rm@reddit

can you share your pi settings/config JSON?

Im not sure how involved it will be to migrate claude code hooks, rules, skills etc. to Pi.

[-]

SoAp9035@reddit (OP)

Honestly my setup is super minimal. I only have the llama.cpp connection configured via models.json and the plan-first skill file I already shared. That's literally it. But here is the github link that should answer everything about migration and config: https://github.com/badlogic/pi-mono/tree/main/packages/coding-agent

Also my \~/.pi/agent/models.json file:

{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8001/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        { "id": "qwen3.6-35b-a3b", "contextWindow": 131072, "maxTokens": 32768 }
      ]
    }
  }
}

[-]

Gueleric@reddit

Have you benchmarked performance as context grows? I find that with limited VRAM setup it start outs fast but as context fills it lags the PC and slows down to a crawl.

[-]

invincibles@reddit

Complete newbie to this but very enthusiastic.

I used the same model with LM Studio on windows 11, RTX 5060 16GB, 32 GB RAM.

I used it in Android Studio to code a kotlin application. It was very bad experience. I feel i am doing something wrong.

Any pointers?

[-]

jimmytoan@reddit

At 11-14 t/s on 8GB VRAM, you're running most of the layers CPU-offloaded which means time-to-first-token is noticeably longer. For interactive coding that's usable, but you lose some of the tight feedback loop that makes AI coding feel fast. Curious what context window you're running at that VRAM constraint - at 8GB you're probably limited to 8-16K effective context, which is fine for isolated function work but starts to show its limits when the agent needs to hold multiple files and test results in memory simultaneously.

[-]

quantyverse@reddit

I know we should not compare them with models like Sonnet-4.6. But what is your opinion on that, how far are we away from that ? Also did you have the chance to test qwen3.6-27b already ?

[-]

RapidRaid@reddit

Yesterday i tried Qwen3.6 27b on my 5090/128GB DDR5 rig (fresh llamacpp build, ran with Vulkan since someone mentioned CUDA had an issue, pi + opencode) and its awesome. I think its not quite on Sonnet level yet but very very close. I say maybe 85%ish there. I always try my same prompt on all the models I tested: "create a minecraft clone for web/nodejs" and then watch the results. Every attempt I previously tried with Qwen 3.5, M2.5, Gemma4, etc always gave me either working movement and just a plane or one chunk with broken movement and broken camera. Qwen 3.6 was now the first model that matched what Sonnet was one shotting before: a procedual chunk generating world, working movement, working camera, no inside faces between blocks, stable 60 fps. I then tried to prompt a bit more to get a few features and there the differences are starting to show a bit in comparison to sonnet. It did reason about how to implement swimming, but disregarded block placement in water, i then wanted it to fix it, which caused it to only place the blocks above water, etc. In the end i prompted until it got it right, but in my mind i think sonnet would've figured that out earlier, without so much manual prompting.
But if online models would dissappear tomorrow, i wouldn't be too sad now since this model truly punches above its weight and is really useable.

[-]

SoAp9035@reddit (OP)

I changed my setup (VS Code Copilot and OpenCode) to this simple setup, and it did what I told it to do. I think that if your target is to edit or make some changes to current projects, that would work, but for large, from-the-ground-up projects, it's hard for that model. The 27B dense model is not really runnable for me; I get around 5 t/s with zero context. That's kind of bad.

[-]

HongPong@reddit

this is way more useful than silly stuff from garry tan

[-]

RMK137@reddit

This is great, thanks for sharing. Any idea how to get pi to show the thought trace for this mode when respondingl? I can't see it for some reason, and hide_thinking is set to false in settings.

[-]

IrisColt@reddit

THANKS!!! Will definitely try it!!!

[-]

chuvadenovembro@reddit

Basta criar o arquivo "plan-first.md" (com o plano que você escreveu) na pasta do projeto e no prompt orientar a a llm ler esse arquivo?

[-]

SoAp9035@reddit (OP)

Global skill: \~/.pi/agent/skills/plan-first/SKILL.md

Project level skill: \~/test-project/.pi/skills/plan-first/SKILL.md

[-]

chuvadenovembro@reddit

Obrigado pelas orientações, criei em skill global para testar

[-]

SoAp9035@reddit (OP)

No problem! Let me know how it goes.

[-]

chuvadenovembro@reddit

Eu confesso que ainda não tenho parâmetro, mas testei com qwen 3.6 27b de 8 e 4bits, mas o processo demorou demais e desisti (foram algumas horas), então testei com o qwen 3.6 35 de 8 e 4bits também e da ultima vez que vi no de 4bits, ele estava em looping (ja tive esse problema em outros hardless e com falta de contexto...mas não deveria ser o caso desse projeto, pois é um porjeto novo...Então estou testando agora em um modelo chamado zen4 pro max, rapidamente a llm criou pastas (mas ainda não terminou), eu achei que o skill pudesse melhor os modelos qwen, mas acho que não deu certo pra mim, estou usando omlx como servidor de inferencia (carrego o modelo com parametros padrão...mas testei alguns ajustes como dfash e desligar pensamento, mas sem muito sucesso...), estou usando mac studio m2 ultra com 128gb de memoria...Ao longo dos dias eu vou testar outros modelos e falo sobre notoriedade do skill em seguir o plano, obrigado novamente.

[-]

SoAp9035@reddit (OP)

I am sorry and really surprised that you had a bad experience. For me it did not take that long and it worked fine. It might be something related to the model parameters or the inference setup.

I definitely want to improve the skill, so your feedback helps a lot. Thank you for testing and sharing your results. Let me know how it goes with the other models.

[-]

chuvadenovembro@reddit

Eu estava com erros no meu servidor de inferencia, estou testando novamente após resolver esse problema e o agente esta executando o trabalho de forma organizada, obrigado novamente

[-]

biller23@reddit

Do you guys use the model for your agents with thinking enabled?

[-]

casual_butte_play@reddit

How’d you point Pi at your local model/server? I’ve done the Claude Code hack(s) for months now but somehow tripping up getting Pi going using my local llama-server :\

[-]

SoAp9035@reddit (OP)

Like below to this file: \~/.pi/agent/models.json

{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8001/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        { "id": "qwen3.6-35b-a3b", "contextWindow": 131072, "maxTokens": 32768 }
      ]
    }
  }
}

[-]

Jeidoz@reddit

I am curious, does Pi have something similar to opencode plugins for auto-detection of available models at OpenAI endpoint?

Also, have you tried to compile and use WebUI version of Pi, or you work only with CLI edition?

[-]

Slashh1@reddit

Did you have context length issues. I configured the model.json without models context window or maxTokens and after around 4k token, pi gave a 400 error. So I configured models.json to 32768 context and 16384 maxTokens with the same Qwen 3.6 35b-a3b. I have a 12gb 4070 card and also plan to use pi locally for development and wanted to know if 131072 context spills your model onto ram.

[-]

SoAp9035@reddit (OP)

You can actually leave the context window and maxTokens empty in models.json, those aren't critical. The llama.cpp config is what really matters for controlling that. And yes, if you try to use 131072 context with a 12GB card, it will definitely spill into RAM.

[-]

worldwidesumit@reddit

model/settings.json

[-]

Pleasant-Shallot-707@reddit

Pi doesn’t care what you use. Just edit your .pi/agent/model.json to include the local model.

If you’re having problems connect pi to something like OpenCode go and ask it to help you set up the local connection.

[-]

Jeidoz@reddit

Sorry, if it may sound rude, but your "skill" is sounds much similar to SpecKit AKA "Specification Driven Development" with agents. 😅

[-]

Igot1forya@reddit

Replying so I can come back and test this later. Nice work OP!

[-]

FusionX@reddit

How are you getting it to follow agents.md? It just ignores it for me completely, despite being 2-3 lines.

[-]

SoAp9035@reddit (OP)

Use it as follows; Pi may not support your method.

Global skill: ~/.pi/agent/skills/plan-first/SKILL.md

Project-level skill: ~/test-project/.pi/skills/plan-first/SKILL.md

[-]

FusionX@reddit

is it not possible without skills? I've been trying to add general rules and guidelines applicable to all sessions

[-]

SoAp9035@reddit (OP)

Skills can act like general guidelines as well.

[-]

philmarcracken@reddit

can I make a skill to output a mermaid diagram and have it refer back to it, as things get larger?

[-]

ducksoup_18@reddit

How does pi compare to opencode? Im running that now paired with 2 3060s so i THINK i should have enough vram for decent context size with 3.6. Would love some feedback.

[-]

quantyverse@reddit

It is a minimalistic agent. If I am not wrong also part of the backbone of OpenClaw. So what you get is a minimalistic System Prompt and an agent which is super flexible which can write its own extensions in typescript. So you can extent it how you like it and make it your own unique Agent.

[-]

Finanzamt_Endgegner@reddit

imo its better, at least for local models, it doesnt fill as much context with instructions which helps a LOT.

[-]

talk_nerdy_to_m3@reddit

I'm very impressed with your results! Slow, but amazing that you got this to work on your machine. I downloaded Pi and had a hard time hooking up a local model. Finally figured it out, then didn't really know what to do. I look forward to trying out your method!

[-]

SoAp9035@reddit (OP)

Thanks! I've been getting good results with my current ongoing projects. Right now I'm testing it out on a project from scratch to see how it handles that. I'll let you know how it goes!

[-]

Clean_Initial_9618@reddit

Hi is pi really good got qwen3.6:27b setup on my RTX 3090 and 64gb ram. Looking to move away from my claude code subscription it's too expensive broke to afford it anymore was looking for local options. So thought will ask you is it really worth it ?

[-]

apeapebanana@reddit

literally asked pi with Qwen3.6 35b to rip out sillytavern memory system, asked claude/gemini-pro to fix the leaks and gaps, qwen build it, cloud-llm to double check.

vibed out personal memory systems, oh, then asked and compare how hermes does it.

oh, use 27B and gemma31B to cross check the plan, different perspectives. plan and build.
we're on a crazy train. choo choo!

[-]

annodomini@reddit

Pi is a minimalist, but extensible agent harness. So it's really good for providing you a lightweight base, and allowing you to use skills and/or plugins to customize it. It doesn't do nearly as much out of the box as Claude does (no plan mode, for example), but it provides you the tools to build your own workflow the way you like it, instead of filling up the context with 10s of thousands of tokens for a huge system prompt and a lot of tools like Claude or OpenCode provide.

[-]

SoAp9035@reddit (OP)

I have been using OpenCode with Qwen 3.6 35B, and it was really using too much context and was slow. Then I switched to Pi. Pi is really lightweight and fast; I recommend it.

[-]

SoAp9035@reddit (OP)

Make sure you use this skill that I shared it makes a big difference.

[-]

Positive_Kale@reddit

You guys believe it is realistic for me to run it on my iMac M3 with 24 Gb memory?

And do you just run Pi or the local via ollama, la studio or similar?

[-]

SoAp9035@reddit (OP)

You can try with the q2_k_xl model it will work great as well. You can also try the q4 model with mmap, I think that would work too.

I run it with llama.cpp.

[-]

getmevodka@reddit

No, but maybe q2 or q3 xl. Look for the xl versions, dynamic quants punch above their weight

[-]

bigh-aus@reddit

A variation I'm looking at at the moment - is to separate out the steps into different prompts, in new context windows, and also finding areas you can do tasks in parallel to better utilize the gpu...

[-]

jacek2023@reddit

I use pi coding agent with Gemma 26B and I agree it's worth trying

[-]

Intelligent_Lab1491@reddit

Did you tell pi to do everything in a subagent to Save context

[-]

SoAp9035@reddit (OP)

I actually didnt...

[-]

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)