Llama3.2:1B
Posted by ranoutofusernames__@reddit | LocalLLaMA | View on Reddit | 96 comments
Llama3.2:1B on CPU and 8GB RAM
Great for asking code gen and one time requests. It degrades in long conversations, the 3B although a bit slower in that setup handles longer chat history better.
cms2307@reddit
Incredible how fast we’ve come since the original ChatGPT launch. 1b models providing answers in the same realm of quality.
ranoutofusernames__@reddit (OP)
Absolutely crazy. Small models are AI for the masses. They’ll be running everywhere soon
vibjelo@reddit
Sadly in that case, the masses shall remain dumb
ranoutofusernames__@reddit (OP)
Why do you say that?
vibjelo@reddit
Because 1B models aren't really useful for anything besides simple autocomplete and similar.
So if the masses use those to educate themselves, we'll be as smart tomorrow as we are today.
Future_Might_8194@reddit
I use Llama 3.2 3B in a chain, and it's better than a one-shot from any model. You know what answers (for example) math questions faster and better than a large model?
A 3B RAG'd up to a calculator.
When you just load models up in a chat app, you're just getting the demo. Start putting agent chains together with outside data and tools, and suddenly an incredibly obedient 3B that doesn't confuse researched data against its training is so much better.
cms2307@reddit
Any specific numbers on how much better 3b plus a calculator is than large models without? I’ve been interested in this for a while but it seems like people really aren’t trying this setup, despite what looks to me like obvious advantages
Future_Might_8194@reddit
No, It's just any model can hallucinate, no matter the size, but a calculator won't. A small and much faster model that is instructed to just relay outside information in a conversational way will more accurately read a calculator than a large slower model working the math itself and trying not to hallucinate.
cms2307@reddit
Would you say that small models without calculators are a reliable way to solve math problems? Let’s assume we’re doing basic calculus or something, can they get the answers right 50% of the time? 75%? 90%? I’m very interested to hear about this because I literally can’t find anyone else talking about it
Future_Might_8194@reddit
I mean, try it. I don't think it's ever twisted the answer for me if it's given the right answer from a calculator. I'm sorry, I don't have numbers. I have an agent chain I've been piecing together since Hermes was on Mistral.
treverflume@reddit
Can I do this with a 3B on mobile do you think?
he_he_fajnie@reddit
Rag, search, summary is actually all you need. It doesn't have to know stuff it needs to "think" and rephrase without hallucinating and thats it
tcika@reddit
It actually doesn’t even needs to think at all. LLMs have two main issues: hallucinations and inability to reason. And if you use the model for RAG, its inherent knowledge becomes “toxic” and you don’t rwally want it to fake your RAG data. So small models (like qwen 2.5 3b or qwen 2 vl 7b) are all you need. They do the job and they are cheap to host.
I have a custom use-case with a long-living multi-agent system and I found no real difference between smaller and larger models in terms of the end result. The reasoning part is done by a separate module with a bunch of external tools anyways.
cms2307@reddit
Can you give me some more info about that second part? How do you work reasoning into your workflow
tcika@reddit
Let’s start with the fact that the entire system was written from scratch so don’t hit me too hard with your keyboards when I open source it :D
My system is essentially split into several semi-independent modules communicating with each other when out-of-domain actions are needed. One of these modules is what I call the “logic reasoning module” and it is essentially a bunch of narrow specialized agents serving as a glue between the task and the bag of algos I found in the wild. One of its purposes is to apply formal logic to check whether the text given to it is correct, and to formally infer properties of some parts of the text (for example, if the text mentiones a certain door, and the system needs to ensure that the door, given its previosuly learned properties and a textual description, is indeed a door and has no undesirable properties such as being a broken door or a hard-to-open door). Another thing this module does is decision making. Agent generator creates state evaluation agents and all the other necessary entities from blueprints and then sends their actor references to the algorithm, such as mcts for example.
But I gave up on making this module work as I wanted and came up with a reasoning habit module instead. That one is a meta-module that essentially keeps track of the entire set of system activities and tries to detect any sorts of patterns, and its sub-module then tries to create a “shortcut”. The thing is, these learned patterns have individual scores w.r.t. the skill they were made for. These patterns essentially compete with each other for the right to be used in their respective cases. Basically, a schizo form of a reinforcement learning approach.
There’s much more to it than what I already described but I’m too sleepy so nope. And yes, you don’t really need large LMs for it to function, like at all. Yes, they will give you somewhat better result, but their cost is a big oof.
P.S.: I use knowledge graph with a few extensions (like that one that resembles frames), and this graph also has temporal component and a simple node level version control. I just ran out of hobbies and I really wanted to see how exactly would my attempt to build that all fail so here I am :D
cms2307@reddit
This is really interesting, I think the part about applying formal logic to questions could be really good and should be explored more. Maybe a good way to do it would be to fine tune a model on either restructuring or labeling an input question using formal logic, but I really don’t know the specifics of fine tuning. Good work though!
tcika@reddit
In-context learning is more than enough most of the time, actually. And I'm not a fan of keeping tens of different networks since it would impose many architectural and computational restrictions for the sake of minor improvements. And that reasoning habit module supersedes logic module for most tasks after gaining enough experience :-) This one is a good find!
Although it costed me \~350$ (in api credits) to run it for an hour with qwen 2 vl 72b and claude 3.5...
Because it spawned \~13k instances of agent(actor) types in total. I'm glad that only 50 were able to run at the same time.
vibjelo@reddit
Have you actually tried 1B models? They can barely form coherent sentences...
cms2307@reddit
Do you see the post your replying to? That’s a 1b model doing more than just forming coherent sentences
qwesz9090@reddit
My experience with llama 3.2:1b was the same, it was pretty incoherent. But llama 3.2:3b seriously impressed me. Still incredibly small and it seemed usefully coherent.
Ok_Cow_8213@reddit
I’m no expert in all of this LocalLLaMA stuff but in my experience smaller models tend to hallucinate more, refuse a reply, reply with something unrelated or just reply with the same text that was in the prompt. And smallest stuff i have tested has been 3b models. It’s so bad for me I really don’t understand how people are finding them useful at all in this stage.
cms2307@reddit
They do hallucinate but they’re useful in certain situations where you don’t care about 100% accuracy of information. I haven’t tested 3.1 1b and 3b very extensively so I can’t say if they’re actually at gpt3.5s level but just conversation wise theyre definitely on par, I don’t feel like I have to dumb down my prompts very much as opposed to something like Tinyllama from way back when.
JFHermes@reddit
But surely in coding situations you do want 100% accuracy. Who wants to sit around trying to get a small model on track? You would just code it yourself at that point.
Other stuff I totally get but coding seems like a poor use case for a small local model.
ggone20@reddit
Do you (or any other human on earth) code with 100% accuracy? No.
That said, the small models are really good at things like summarizing or rewriting in different tones, or taking in context and making inference on the input - think a calendar and ‘what time is my meeting’ or a sales report and ‘how much revenue last quarter’. Or think about realtime conversation advice/coaching when paired with STT where it listens to your conversation and warns of any non-factual comments or biases. Etc, etc, etc.
There are TONS of valuable uses for AI on the edge that don’t require ‘100% accuracy’ as that statement doesn’t even mean anything lots of times. Not only that but 3B can still do function calling, which makes it superhuman anyway.
It’s amazing Meta gives these away for free. Insanity.
draeician@reddit
Can someone give me some examples of when you don't care if the Model is accurate? The only thing I can come up with is Fiction Writing, but even there if you've outlined something you'd want the model to still be accurate to your outline. You wouldn't want the protagonist changing to an alien race, or switching planets, or changing from a rock star to a hermit in the span of a sentence.
ggone20@reddit
You’re misunderstanding ‘accurate’ for ‘factual’. I gave you three perfect examples. Humans aren’t 100% accurate or factual and you work and talk with them right?
AardvarkFuture4165@reddit
tbh i would say your examples would be correct...basically easy lookups that are simple to answer correctly..simple recalls where the answer is basically plainly there, no need for a big model
Wild_King4244@reddit
What models did you try?
Ok_Cow_8213@reddit
One I can remember from the top of my head that was especially bad is mini orca 3b
Various-Operation550@reddit
Dude these models in our field are like saying “i tried computers in 1987 - nothing special”
ConObs62@reddit
1987 was a good year for computers... the obviousness of their utility far exceeded these autocompletion tools
Various-Operation550@reddit
No
TechnoByte_@reddit
That's an ancient model, llama 3.2 3B and Qwen 2.5 3B are much, much better than that
nixed9@reddit
released June 26, 2023
crazy pace in this field
crappleIcrap@reddit
it is true tho, the naysayers have only focused on the doomgraphs of increased power and computation of the largest models saying it outpaces compute. in reality all sizes of models have become better since so much work has been done, a 1B parameter model today makes a 1B parameter model last year look like cleverbot.
2016YamR6@reddit
I use 1B and 3B models in my chain of prompts for the intermediary decisions that need to be made so I don’t have to make as many calls the API or load a 34B model
MINIMAN10001@reddit
I should experiment more because I hear this same thing across llama 1b to 8b depending on the particular one shot being asked
Zealousideal-Ask-693@reddit
What are you using to host the LLM? The only local hosting tool I’ve seen is GPT4ALL but I’d like to find something easier to fine tune and custom train.
ranoutofusernames__@reddit (OP)
Ollama + PeerJS
Apgocrazy@reddit
Dope!!! you gave me some inspiration
Over-Dragonfruit5939@reddit
What UI are you using?
ranoutofusernames__@reddit (OP)
Something I wrote for the device
Over-Dragonfruit5939@reddit
Nice it looks awesome
masteryoyogi@reddit
Did the code work?
ranoutofusernames__@reddit (OP)
Haven’t tested this specific one yet but I’ve been using it to code in JS this whole week. Pretty good, everything has worked so far.
ButterflySpecialist@reddit
What is the accuracy percentage of the code snippets? Have you figured that out yet/ how do you figure that out lol.
Orolol@reddit
From what I can read, it should works yes
ventilador_liliana@reddit
Is amazing, and in q4_k_m is very good also in spanish, and all only in 800MB
Obvious-Theory-8707@reddit
What is the UI you are using ?
ranoutofusernames__@reddit (OP)
It’s UI I built for the device!
No-Ocelot2450@reddit
I've used a bigger version on 6Gb GPU (Even 4Gb suffice) using LMStudio or llama.cpp directly. I't is not fast enough to use it in any "production" task, but acceptable for personal use.
But it terms of generalization capabilities 3.1 and 3.2 are not impressive. Lack of comprehension and overall logic in smaller versions. Gemma 2 and Qwen 2.5 and even the last Microsoft Phi are better
ranoutofusernames__@reddit (OP)
Definitely agree. I wouldn’t recommend using stuff it spits out for production. For the average joe though, it’s very good. Especially 3B, at least in my opinion has been a good model to quickly ask about random things, debug etc… The plan is to run “standard” models on a GPU based device but obviously it’ll be way more expensive and larger in size.
Perfect-Campaign9551@reddit
I can run 3.21b on my phone....
ranoutofusernames__@reddit (OP)
What phone do you have?
Perfect-Campaign9551@reddit
Moto G 5G 2024. 3.21b runs about 4t/sec or so using the PocketPal app
cerchez07@reddit
what is this ui you are you using?
ranoutofusernames__@reddit (OP)
Something I made for my AI device project
Thinking about adding ability to run the code, not sure if people will want that since there’s full feature IDEs though.
MoffKalast@reddit
Wrong, the Pi 5 is manufactured in Wales. :P
ggone20@reddit
Yea but the case is 3D printed and components assembled here! Lol
MoffKalast@reddit
I once worked with a company that made their entire product in China, but then sent them to HK where they only uploaded the software so it could be technically labelled as "Made in HK" and get around import restrictions.
The regulators were seemingly totally fine with it so I guess OP is in the clear, haha.
TheOwlHypothesis@reddit
I want hardware that "crystalizes" an LLM, in other words it can only run as the LLM that was flashed to it. I can imagine a dedicated piece of hardware would have performance gains. It would be good for a project like this and all local LLM enthusiasts.
Although I could also see no one doing this because of the cost and inflexible nature of it. I'm not even sure it's possible.
Mescallan@reddit
Verisatium had a video a few years ago on a start up that converted nand flash modules to analog neural networks.
Analog is the future, but we need to reach a capabilities plateu before it's reasonable to hardcore weights
swagonflyyyy@reddit
I feel like an ASIC would be what you're looking for.
el_isma@reddit
Like an FPGA? But they AFAIK they don't have enough RAM (unless you want to run something tiny)
TheOwlHypothesis@reddit
Not exactly. I'm not a hardware person so IDK what exactly to call it. But I imagine it would be a special class of hardware that is similar to a GPU but "hard coded", or I guess hard wired in this case in a way that the LLM weights are the only thing that it runs.
MidnightHacker@reddit
Isn’t it kinda what Grok is doing right now?
my_name_isnt_clever@reddit
I think an ASIC might be the idea you're looking for. There are some attempts, the issue right now is that everything is moving so fast it's very risky to hard commit to the transformers architecture when there is a high chance we end up with something better.
ranoutofusernames__@reddit (OP)
That’s my goal for the next next version. Not only dedicated model but dedicated board too. Ground up designed to be lightweight.
That being said, building on a popular platform is very important for this stage for many reasons.
mr_happy_nice@reddit
That's a pretty tasty UI there partner. I love your spacing.
LilaSchneemann@reddit
Maybe Llama 3.3 can teach devs how to waste even more screen space.
ranoutofusernames__@reddit (OP)
Thank you!
Different-Effect-724@reddit
Hey, great taste on the UI. Did you make your own or is this an open-source package I can find?
ranoutofusernames__@reddit (OP)
Hey! Working on it, doing some documentation and cleaning up some code. Some minor things to add. If you put your email on the site, I’ll send an email update when it’s on GitHub. I’ll most likely post it here too though so you don’t have to
upquarkspin@reddit
21.63 t/s on iPhone 13!!!
dazld@reddit
How did you run on iPhone? Have had very little luck using apps so far.
upquarkspin@reddit
https://apps.apple.com/app/id6502579498
punkpeye@reddit
The UI looks interesting. Reminds concept art from sci-fi movies.
330d@reddit
Sure, I'll create a neural network in Python!
import neuralnetwork ....
princetrunks@reddit
I should put this on my pi5
my_name_isnt_clever@reddit
When I first used GPT-2 in AI Dungeon it blew my mind and felt like the future. But it was running from some data center somewhere, it was still out of reach. Now we can run better models on a Raspberry Pi. I love technology.
Hungry-Loquat6658@reddit
this UI looks cool
ranoutofusernames__@reddit (OP)
Thanks!
RealBiggly@reddit
If you want to really impress me, ask it to create a simple click-n-play installer for that GUI, for Windows?
I bet ya can't! Betcha?
And I bet you couldn't add lorebooks and character creation to it, with character images n stuff, using normal GGUF files from the same directory as my other apps, I'm betting that's WAY beyond it's means...
Like totally?
;)
StyMaar@reddit
LM.rs has a desktop GUI (but there's no pre-compiled binary AFAIK, you'd need to compile it yourself)
RealBiggly@reddit
I use Backyard.ai and was jus' teasin' the fella, but yeah that's a nice GUI...
ranoutofusernames__@reddit (OP)
Heading that way. Already have an electron version for v1 that can be ported to all platforms.
Everything else you mentioned, coming very soon ;)
gami13@reddit
why electron? just use native winui3
ranoutofusernames__@reddit (OP)
True, eventually 100% that’s the goal. But between doing CAD, procurement, shipping, coding and everything else, it’ll take time so having a single codebase for all platforms using electron will be a good stopgap until all native releases. Trying to get this in the hands of as many people as possible as fast as possible.
RealBiggly@reddit
\o/ I like you already! :D
EastSignificance9744@reddit
I run 13B on my 16gb RAM CPU at that speed. why's this so slow?
ranoutofusernames__@reddit (OP)
Which CPU? This is running on a Pi
EastSignificance9744@reddit
oh makes sense, I'm on a i7 ice lake
ranoutofusernames__@reddit (OP)
Ah yeah, that’ll do it.
synw_@reddit
Impressive. I've never seen a 1b that can output acceptable code out of Deepseek 1.3b
Expensive-Apricot-25@reddit
thats a crazy UI, it looks so cool
RandiyOrtonu@reddit
damn