Can application layer improve local model output quality?
Posted by ayechat@reddit | LocalLLaMA | View on Reddit | 16 comments
Hi -
I am building a terminal-native tool for code generation, and one of the recent updates was to package a local model (Qwen 2.5 Coder 7B, downloads on the first try). Initial response from users to this addition was favorable - but I have my doubts: the model is fairly basic and does not compare in quality to online offerings.
So - I am planning to improve RAG capabilities for building a message with relevant source file chunks, add a planning call, add validation loop, maybe have a multi-sample with re-ranking, etc.: all those techniques that are common and when implemented properly - could improve quality of output.
So - the question: I believe (hope?) that with all those things implemented - 7B can be bumped approximately to quality of a 20B, do you agree that's possible or do you think it would be a wasted effort and that kind of improvement would not happen?
The source is here - give it a star if you like what you see: https://github.com/acrotron/aye-chat
Icy_Bid6597@reddit
There is a limit of what 7B can do, simply because of how much knowledge can be baked in.
Additional context may of course help. It is definitely a path worth following. A lot really depends how it will be passed, and how model was trained. Keep in mind that smaller models tends to lose context faster so more is not always better.
Many coding agents are reporting that RAG is hard / inefficient for coding tasks. The fact how fast codebase ich changing forces you to reindex content really often. They tends to migrate towards MCP tooling and other code discovery mechanisms.
There is a blog post from Cline engineers: https://cline.bot/blog/why-cline-doesnt-index-your-codebase-and-why-thats-a-good-thing It is not very in-depth but it touches some of the issues
ayechat@reddit (OP)
Thanks for reply and for the link!
That post however does not apply to offline processing use case. Here are his 3 main problem points they re trying to solve:
But then he is describing follow semantic links through imports, etc. -> that technique is still hierarchical chunking, and I am planning to implement that as well: it's straightforward.
This is just not true - there are multiple ways to solve it. One, for example, is continuous indexing at low priority in the background. Another one - monitoring for file changes and reindexing only differences, etc. I already implemented first iteration for this: index remains current.
We are talking about offline mode of operation. Not with Aye Chat: it implements embedding store locally - with ChromaDB and ONNXMiniLM_L6_V2 model.
So as you can see - none of his premises apply here.
And then as part of solution he claims that "context window does not matter because Claude and ChatGPT models are now into 1M context window" - but once again that does not apply to locally hosted models: I am getting 32K context with Qwen 2.5 Coder 7B on my non-optimized setup with 8Gb VRAM.
The main thing why I think it may work is the following: answering a question includes "planning for what to do", and then "doing it". Models are good at "doing it" if they are given all necessary info, so if we unload that "planning" into application itself - I think it may work.
Thanks again for reply!
Icy_Bid6597@reddit
Point 3 about security definitely does not affect you.
The remaining two are described in very shallow way but they are still valid. It does not mean that it is impossible - just hard.
Lets take point two. Imagine large codebase, thousands of files. Keeping an index up to date is hard and compute intensive. Each git pull, merge or rebase might change a lot of files.
In case of conflicts they might not even be structurally valid for a while. You have to keep an eye what changed to remove entries from index. And add new ones.
Depending on the way to build indexes it might take a while to build ie. hnsw index. What to do in the meantime ?
Again it does mean it is a dead end. Just hard engineering problem to solve.
Chunking code is also definitely a challange. Not only it depends on the underlaying technology, but ie. methods does not live in separation from the classes, and classes are just part of the usecase. Some of the classes are used in particular way.
Maybe you are asking your agent to handle adding a particular product to the cart in a ecommerce site. There might be add_to_cart() method somewhere. But knowing that is not nearly enough. Maybe it is a part of a service that needs to be injected, maybe it is a CQRS command handler that expect that you will post a particular message on message queue.
Finding a method is one thing, understanding how it is used is another.
It does not mean that it is unsolvable with a chunking. Just it is not nearly enough in simple approach.
Using RAG for knowledge retrieval is fairly simple, using it for code is definetly possible but harder :D
BTW. i am also not agreeing with them regarding model context size.
ayechat@reddit (OP)
Yes, all good and valid points. Just hard, not impossible :) There is also more philosophical question: would anybody even try to use small local model for large code bases, but that's for another day :)
If I may - as you clearly thought about these things: how do you (personally) do code generation today and what's the hardest thing you are facing with your current tool?
Icy_Bid6597@reddit
Honestly i didn't find any solution capable of doing good job in large codebase. Even Cursor with Sonnet 4.5 is messing things up and not following instructions directly. (even if final effect could work, it is often agains our code structure policies)
They are great for starting up, and then they are lost. I suspect it is mostly due to tooling not models itself so your project still make a lot of sense.
Agentic mode is still helpfull to debug some of the things. Splitting up instructions into multiple steps seems to benefit all of the models a lot.
ayechat@reddit (OP)
That's interesting: so if I am hearing you correctly - in your environment/company - you have coding standards of some sort and tool output does not match it even if you spend time with it with all the prompting, correct?
If you will: is that the main issue with your large code base or there are others (e.g., not the right files updated, relevant files not found, etc.)?
I want to say: I appreciate to no end your replies: I am still early in development, figuring out main pain points is the biggest thing right now - to know what to address as priority. Thank you very much!
segmond@reddit
Application layers does improve model quality, that's why we have agents. Duh.
ayechat@reddit (OP)
Exactly, but the question is different: how much of an impact do they have: is it 30% or 300%?
segmond@reddit
Obviously there's a major limit based on the model, but the impact will also be relative to the logic and application layer implementation details and the nature of the problem.
ayechat@reddit (OP)
The problem: AI-assisted code generation. Let's say, Python, let's say, AWS development: lambda functions + terraform scripts. All those are rather stand-alone mini-projects, so context is small to begin with. The difference between "small" local model vs large online version is corpus of data that went into training - but if you use your existing code base to substitute that - one can argue that results may become comparable (deep transformer layer difference aside: that's why 7B cannot become "1Tb"-comparable)
Icy_Bid6597@reddit
For small codebases it may make sense. But small models are behaving weird sometimes. They make more mistakes, are more prone to weird input and so on.
Even in simple data transformation i find cases where big model have almost 100% success rate, when small ones jump between 80-90%. Most of the test cases are solved, but in case of failure they are often weird and hard to comprehend
ayechat@reddit (OP)
That's actually higher than I expected: I just added that feature for offline model because it seemed there was some demand for that - 80-90% is encouraging: for those who don't want their code leave their machine - I think that's an acceptable tradeoff, especially for smaller projects where I suspect percentages are higher.
SlowFail2433@reddit
Yeah frameworks can bring a 7B up to 20B quality for sure
ayechat@reddit (OP)
Can you elaborate? Difficult to say if you are serious or not.
SlowFail2433@reddit
7B LLMs can beat 1T models given good training and a good inference framework
ayechat@reddit (OP)
I see.