[Model Release] I trained a 9B model to be agentic Data Analyst (Qwen3.5-9B + LoRA). Base model failed 100%, this LoRA completes 89% of workflows without human intervention.
Posted by Awkward_Run_9982@reddit | LocalLLaMA | View on Reddit | 26 comments
Hey r/LocalLLaMA,
Most of us know the struggle with local "Agentic" models. Even good ones at the 4B-14B scale are usually just glorified tool-callers. If you give them an open-ended prompt like "Analyze this dataset and give me insights," they do one step, stop, and wait for you to prompt them to "continue."
I wanted to see if a small <10B model could achieve true autonomy through weights, rather than relying on massive external prompting frameworks.
What I built: I took agentscope-ai/CoPaw-Flash-9B (which is based on the Qwen3.5-9B architecture) and trained a LoRA specifically for end-to-end data analysis workflows.
The Secret Sauce (Training Data): Instead of standard instruction tuning, I constructed massive, multi-step trace datasets covering real-world scenarios (finance, education, sports data). The LoRA was trained not just to call tools, but to plan, execute, debug Python code, visualize, and summarize in a continuous loop until the job is done.
The Results (See Benchmark Image2): I tested it on 29 real Kaggle datasets using a custom framework (max_turns=50, context=128K).
- Base Model: Averages 1.2 iterations and stops. 0% completion rate. Produces zero usable output.
- With My LoRA: Averages 26 autonomous iterations. Writes Python, plots charts, and achieves an 89.7% natural completion rate with ZERO human intervention.
It basically turns a 9B model into a junior data analyst you can run locally on 12GB-24GB VRAM.
VRAM Requirements (vLLM):
- bf16 (Single GPU): \~22GB
- 8-bit: \~12GB
- 4-bit: \~6GB
Links:
- š¤ LoRA Weights: jason1966/CoPaw-Flash-9B-DataAnalyst-LoRA
- š Inference Framework: IIIIQIIII/data-analyst (You'll need this to handle the tool-calling loop)
- š Demo/Showcase: https://dataanalyst.locoremind.com/
ā ļø A Call to the Community (Looking for Compute/Sponsorship):
This one-week experiment proved something important: Small models CAN be fully autonomous agents if trained on scenario-based workflows.
Data analysis is just the beginning. I want to apply this methodology to build local, truly autonomous agents for Coding (Software Engineers), Research Assistants, and more.
However, I am currently bottlenecked by hardware and funding. Training these continuous-workflow datasets takes significant juice, and I want to scale this to create state-of-the-art open agents.
If anyone here has access to compute grants, GPU clusters they are willing to sponsor, or if there are organizations/backers interested in funding the development of open-source local agents, please reach out to me via DM.
Let's build local agents that actually do the work for us. Happy to answer any questions about the training process, data generation, or deployment in the comments!
GoodnessIsTreasure@reddit
When I was on my own fine tune spree a few years ago, I found modal.com. They give some credits every month so maybe that could help you. Otherwise there was a marketplace like rental site for indie GPUs.. That also is pretty affordable for quick trainings. Maybe someone knows the site name.
Outrageous_Recover56@reddit
Free up some computer by writing your own posts?
keepthepace@reddit
Honestly when it is informative and correct, I don't care if it is generated.
exaknight21@reddit
This post is considerably better. imho, this is good.
keepthepace@reddit
This is the type of posts that makes /r/localllama so great
mivog49274@reddit
is banana bread concrete or a concrete demand would include concrete inside the banana bread recipe ? nano bananabread-8B is killer though (hit me if you want huggingface link) good job btw smh
Creative_Bottle_3225@reddit
gguff?
Unlucky-Message8866@reddit
mind you sharing how did you train it? did you use unsloth? i've been preparing an anti-slop dataset based on stupid things the llms does and i would really like to fine-tune qwen3.5 27b as well. i tried a few things so far but as usual many scripts/tools/libraries were broken as of last time i tried (mainly because of hw/model incompatibilities)
Awkward_Run_9982@reddit (OP)
Yeah the dependency hell is real man. I actually got both Unsloth and Axolotl to work eventually for this.
The main trap with Qwen3.5 is that since it's fundamentally a VLM, if you just throw pure text/tool-calling data at it, the scripts often freak out and crash because of the vision layers. You gotta make sure the vision modules are properly ignored or handled in the config.
Tbh my lazy hack was just spinning up a Claude Code agent to monitor my terminal logs. Love the anti-slop 27B idea btw, we desperately need a model that doesn't say "delve into" every two seconds. Ping me if you hit specific Qwen errors during your run, I've probably banged my head against them already.
randomrealname@reddit
I posted asking for some clairifation seperatly before I seen you relied top this.
On the visual part, can I ask why you arent getting the model to produce graphs etc as part of its code, then feed the images in for analysis? That way you are keeping the V part included? I am impressed, I hope you can reply to my other comment, would love to hear more of you insights.
randomrealname@reddit
Impressive, mind sharing your data acquisition process?
glenrhodes@reddit
Training on successful error-recovery traces is a really smart way to handle it. The throw-out-the-spirals approach makes total sense too. Most fine-tuning datasets assume clean runs but messy real-world data means your model needs to see what a good retry actually looks like. Curious whether you tried DPO on the failure cases or purely SFT on the winning traces.
Beginning-Window-115@reddit
Weird how your comments are getting downvoted. This is perfect for people with small opus and you giving this out for free is amazing.
Competitive_Book4151@reddit
Hey, yeah - the "agentic" models often just stop and wait for prompts mid-workflow. Frustrating! Cognithorās designed to handle full end-to-end workflows autonomously, with built-in planning, code execution, and iteration loops (no external frameworks needed). If youāre experimenting with LoRA/autonomy in data analysis, it might align with your setup. Just a heads-up: no tool-calling dead ends here.
GitHub: github.com/Alex8791-cyber/cognithor
mbrodie@reddit
Iām running qwen 3.5 35b Q8 on 2 x 7900xtxs at about 50 tps and currently have him running through Hermes agent with full honcho memory setup
I was reading this post and like the sound of the project, so Iām genuinely asking what would the benefit be to me to switch and what kind of performance could I expect with the model Iām currently using.
Thanks in advance!
Awkward_Run_9982@reddit (OP)
Man, that 2x 7900 XTX setup is an absolute beast for running a 35B.
The main reason I train specialized 9B (and even 4B) models isn't to beat a generalist 35B one-on-one, but to achieve massive parallel throughput.
I recently had to analyze 10 years of financial reports in a single day, so I spun up 64 of these small models concurrently to do the heavy data crunching, and just used Claude for the final summary.
With your 48GB of VRAM, the benefit of switching would be turning your rig into a highly parallel 'swarm' that can process 4 or 5 different datasets simultaneously, rather than waiting linearly on one large model.
mbrodie@reddit
Yeah the project itself is super interesting honestly⦠youāve done a lot with such a small model itās quite impressive!
Iāve been doing some turning on a 4b for character personalities and itās yielding some really promising results, the thing I like most about all this is just so much new stuff everyday to explore!
My original comment was actually to the cognithor post, I was curious about the projectā¦
But this project is actually very interesting to me for its actual abilities itās fantastic!
false79@reddit
This is super cool. I really like seeing more of these smaller models being able to specialize and therefore saving a lot of time while being able to be run on consumer hardware locally
Awkward_Run_9982@reddit (OP)
Couldn't agree more. The era of 'one giant model for everything' is shifting towards a swarm of small, highly-specialized local models.
Exact_Guarantee4695@reddit
the workflow-trace training approach is really interesting, makes total sense that training on full multi-step traces vs single instruction pairs would fix the stop-after-one-step problem. curious how it handles cases where the python code errors mid-workflow though, does it recover and retry or does it just spiral into repeating the same broken code?
Awkward_Run_9982@reddit (OP)
You hit the exact nail on the head. Because real-world datasets are inherently messy (random nulls, weird formats, etc.), hitting Python exceptions mid-workflow isn't just a possibility, it's guaranteed.
So building up that "self-correction" muscle was actually the main focus of the training. When the tool returns a traceback error, the model reads it and rewrites the patch instead of just halting.
As for how I prevented the infamous death spiral of repeating the same broken code... here is my dirty secret: I just threw those out š. During the data curation phase, if a trace spiraled into an infinite loop of unrecoverable errors, I just dumped it in the trash. I only kept the traces where it successfully recovered. Turns out, if you strictly train it on successful "error -> fix -> proceed" patterns, it actually learns how to break out of the loop.
Creative_Bottle_3225@reddit
gguf?Ā
Awkward_Run_9982@reddit (OP)
No gguf at the moment, haven't gotten arround to quantizing it. sorry
denoflore_ai_guy@reddit
Ooooooo
stylehz@reddit
Damn what an amazing job! I will try out the smaller model.
Awkward_Run_9982@reddit (OP)
Thanks man! Let me know how it runs on your rig. If it does something stupid or breaks down, just let me know here or open an issue on HuggingFace