When are we getting consumer inference chips?
Posted by SnooStories2864@reddit | LocalLLaMA | View on Reddit | 133 comments
Dumb question but I genuinely don't get it. Billions of $ poured into AI startups the last few years and nobody has shipped a consumer chip with a model built in? Like a $200 stick that runs Llama 3 at reading speed, 30W, plug into your desktop, done.
Taalas is kinda doing this but only aimed at datacenters. Why tho? Today's OS models are already good enough for 90% of what most people actually need and will still be for years. The "model will be obsolete before the chip tapes out" argument feels weaker every month.
Starting to wonder if the whole industry is just trying to milk consumers through API subscriptions forever instead of selling the chip once. Feels like it would be trivially profitable to ship a $300 "Llama in a box" and call it a day but I guess no one wants the recurring revenue to stop.
What am I missing
silentus8378@reddit
because reading speed isn't enough. Based on my usage, I need enterprise-grade GPUs.
SexyAlienHotTubWater@reddit
Weird answers. Taalas literally has a chip like this that you can see here:
https://taalas.com/products/
And chat with here:
https://chatjimmy.ai/
It gets 20,000 tok/s, Llama 8b.
Last I heard (a few weeks ago) they're developing the next chip with a better model baked into it, I believe one of the Qwen 27b models. But I wouldn't be surprised if they're holding off for now given the rate of model releases.
synn89@reddit
Holy crap. Was getting 15,595 tok/s on having Llama 8 write short stories. I know it's only Llama 8, but that's still a hell of a lot of intelligence that's available basically instantly.
There's a lot of code logic routing you could do with a card like that in your data center. Would prefer something that could run a 27B. Image even 5k tok/s on Qwen 3.6 27B available locally. Insane.
Middle_Bullfrog_6173@reddit
That's a 2.5kW server not a 30W dongle though.
SexyAlienHotTubWater@reddit
No it's not. Look at the power input to the card - it's a PCIe 4-pin adapter. You physically can't push 2.5Kw into a port like that, and you can't dissipate 2.5Kw from a GPU die.
The power draw is almost certainly for a server with multiple of these cards.
Middle_Bullfrog_6173@reddit
I was just going by what your link says, but 10x to a server sounds plausible. Still quite a bit larger even for a quantized 8B model.
SexyAlienHotTubWater@reddit
Yeah their messaging is terrible lol.
If you think about it in terms of watts per token (I am bending the numbers a bit since you can't map it directly), each watt gets you 68 tok/s, so in a sense it's much more power efficient than a 30W dongle.
Middle_Bullfrog_6173@reddit
Yeah, but I suspect a lot of that efficiency comes from parallelism. If you are trying to run a single user request per device it may not look so efficient any more.
But impossible to know since we really have just marketing material for a demonstration product.
Independent_Plum_489@reddit
You’re massively underestimating memory bandwidth requirements.
SnooStories2864@reddit (OP)
Happy to understand that constraint better to be honest. wouldn't small OS model built in silicon counter-balance this requirement? Atleast to some extend?
PurpleWinterDawn@reddit
Memory speed is the name of the game with token generation.
For prompt processing, all the tokens are already available, so the tokens are processed in batches, layer by layer. This lowers the amount of bandwidth needed.
For token generation, when a token is generated, it is then passed through ALL the layers again. So each new token sees all the layers to determine how it relates to the previous layers (the Attention mechanism).
So let's say you have a 7B model in Q8 format for simplicity. That's 7GB of data that needs to travel through the RAM bus for each generated token.
If you have a RAM bus with a bandwidth of 700GB/s, that's a maximum throughput of 100tps at the start. Then each added token adds some processing overhead to do the whole Attention mechanism over all the previous tokens.
What you need isn't just a "chip", it's a full computer that contains lots of memory and the capability to run a specific model. Even if the model is baked in the chip, the memory constraints are just too important. And it gets even worse if you don't plan on attaching the memory to the chip, then you're relying on the communication port's bandwidth... Ugh.
Furthermore, the KV cache grows quadratically in memory requirements. For instance, the LFM2-8B-A1B model has a training window of 32768 tokens, and when I tell it to use that, it requires about 300MB of memory. Sure, that's not much by tokday's standards. If it could do four times more tokens in its context window (131k), it would require 300MB*(4*4) = 4.8GB of memory. Uh-oh.
For reference, a nVidia RTX 5090 has 1.8TB/s of bandwidth, while the AMD r9700 AI PRO has 640GB/s. PCIe 5.0 x16 offers 128GB/s. The fastest USB interface (3.2 Gen 2x2) tops out at a "measly" 1.875 GB/s. Thunderbolt 4 is purported to achieve 5 GB/s, but since it's PCIe underneath I'd expect some encoding shenanigans eating into that and reach 4GB/s guaranteed. Nowhere near what a GPU can do, to start with anyway.
ShelZuuz@reddit
KV cache.
Long_comment_san@reddit
HBM memory seems to be the crux. We need something like 64 gb HBM memory as a "booster card" and that will be our end goal for home use (and DDR6 with 256 gb sticks).
So speaking realistically, this will come in a decade or so.
SnooStories2864@reddit (OP)
So you are saying computing is not the issue but it's all about very high end ram?
Aphid_red@reddit
Yep, it's all about memory and memory bandwidth. You can't stack enough 3090s to run big models without triggering your power breaker.
Yet you could theoretically easily run a 1T MoE model on a 5070 or 5080 for one user. But it won't fit in its memory.
Even just AMD releasing a GPU comparable in memory to the 6000 pro at the 2000-3000 price range would already be a big step. Doesn't even have to be HBM, could be LPDDR or LPDDR/GDDR or high capacity GDDR. It just needs a fast connection to the card, PCI-e is too slow, so it needs to be integrated.
kaisurniwurer@reddit
That's not true. Prompt processing speed is pretty much purely bound by the compute and with recent "agentic" wave, it is becoming more and more important, where processing time can be longer than inference time even on fast cards, especially with the SWA models.
Long_comment_san@reddit
in home use we are (artificially) limited by VRAM capacity, yes. DDR6 that's going to speeds about 18000 easily would be perfectly suitable for running MOE in 300b easily. but then there's the issue of VRAM - 16 gb is 600$, 32gb is 3000$, 48gb is 5000$.
BuildAQuad@reddit
Didnt intel just release the B70 with 32GB for 950$?
Long_comment_san@reddit
support is terrible for now, R9700 is better buy for sure, but again, its a lot of work over cuda
BuildAQuad@reddit
True, at the price points i see here locally i prefer the 5090 as it has ddr7
Important_Coach9717@reddit
So keep investing in Micron … gotcha!
Long_comment_san@reddit
did you see micron valuation? better to invest in literally anything else
Important_Coach9717@reddit
Dude what are you on about ???
i_am__not_a_robot@reddit
You've answered your own question.
The industry "vision" is that you'll rent everything and own nothing.
no-name-here@reddit
No:
i_am__not_a_robot@reddit
That's true, but the average traveler in 1890 would only drive a few yards in an automobile.
no-name-here@reddit
Is the argument that Microsoft, Apple, and most other PC and smartphone makers won't continue forcing even more AI hardware onto all consumers in the year(s) to come?
moofunk@reddit
I don't think 1 and 2 are functions for selling inference chips any more or less than running games are functions for selling GPUs. Also 1 and 2 will change a lot in the coming years.
Tensor hardware is a fundamental function for generalized AI over just running LLMs.
no-name-here@reddit
moofunk@reddit
I think, again, your idea of "forcing" tensor hardware onto users isn't any different than gamers being "forced" to buy GPUs or buying gaming PCs that have GPUs built in or phones or tablets with powerful GPUs to run games.
It comes with the territory.
I think there is a much greater nuance to it. You can't run the huge frontier models on local hardware, and probably never will be able to, hence you must subscribe to a service to use it.
Then also, small models are becoming useful enough to run on that local hardware that is "forced" on users. Those models will eventually become necessary to interface with the hardware.
Inference is inference. It's just tensor math, applicable to LLMs, speech/image recognition, computational photography, fuzyz logic, OCR, generative AI, any kind of neural network.
Build a chip that performs tensor math at high performance, and you have an AI inference chip that can do all of the above.
It's as ubiquitous as using a GPU to render 3D.
no-name-here@reddit
I think that's the big difference - GPUs have been around for decades, but it isn't just AI enthusiasts ("gamers" in your analogy) whose new devices are getting AI hardware. Even if Grandma doesn't want to use AI, it's likely her next device is going to include a significant amount of AI-specific computing capability whether she wants it or not.
But the original commenter's point was that this is because the "industry 'vision' is that you'll rent everything and own nothing", which is complete misinformation or disinformation about whether and why you'll be able to run AI models (and huge models specifically like the ones you mentioned?) on a local device, if someone specifically wants to, agreed?
moofunk@reddit
I think that assumes that grandma thinks that AI delivers low quality work, is a burden and reduces performance of the device.
I think in most cases, grandma will be completely unaware that the device is using tensor hardware for tensor calculations to perform AI functions. I purposely use "tensor calculations" rather than "AI", because the fundamental function for AI hardware is generic tensor calculations, not the whimsical task of counting the Rs in "strawberry". What the chip may do is do hardware level power management or OS-level process management next to providing better image/audio compression and quality for Teams meetings. Eventually, such tasks may be as ubiquitous as using an FPU for floating point math, which once was optional hardware you had to purchase.
Eventually, everyone was "forced" to buy an FPU, but nobody really noticed.
Thus, the discussion on "forcing AI hardware onto consumers" doesn't make sense to me any more than you are forced to use a device with more than 512 MB RAM to run a modern OS or forced to use a GPU to run any modern game.
Agree on that.
mellenger@reddit
Is that because they used to walk to the car wash?
i_am__not_a_robot@reddit
Yes.
SnooStories2864@reddit (OP)
Yes, totally agree, however i'm still wondering why none is disrupting this status quo given the potential of having AI at home ... it's not like you can't have some optional subscription services attached..
CorpusculantCortex@reddit
The real answer is that the industry moves too fast and there hasn't been time. Development starts today on a qwen 3.6 chip, gtm is likely 1.5 years MINIMUM for an ots consumer product, and that is liberal in how quick it would likely take. By that point we are 2+ model generations down the line potentially. Something like this can't be updated. There is also the fact that the market is small. We already have relatively affordable gpus that can run a small model. Why buy something that is locked in when you can get something flexible particularly given the above. The number of people who want a dedicated single model card is much smaller than those who want gpus. $200 is absurdly impossible. The r&d and production costs aren't that different than gpus, except less established. This will not be that cheap. Which reinforces the above. And the biggest thing is taalas and others have been working on solutions like this, but they are still in development because open weight llms that are actually useful have been around for like a few years in the public's mind. So things are in development, but there is an obvious lag before gtm, and poc/mvp have been shown but they just aren't marketable yet because of old models.
gh0stwriter1234@reddit
Tallas has a turn around of like a month or something crazy like that because they basically have a predesigned framework that the AI model gets compiled into for the hardware ASIC version of the model.
I expect thier hardware to cost a lot though because its supremely fast... I mean if they put qwen3.6 in their ASIC it would be insane...
upsidedownshaggy@reddit
Because the capital required to do that is tied up in the "rent everything, own nothing" scheme. This has been the modus operandi of VC investing for decades now, especially in the tech scene.
Far-Chest-8821@reddit
Your stick becomes outdated in a month. Very niche product. High customer acquisition cost and slow on boarding.
Customer would need to know your product, ship globally, software needs to work on Linux, windows etc. Waiting 3 weeks to arrive after toll. Cost 200$
Vs. Oh an API for 1$ to play with now.
kaisurniwurer@reddit
Sounds like a great business plan.
Likeatr3b@reddit
This is coming to Macs. I’m building an app that takes advantage of this for consumers to use local (optionally cloud) AI on Apple machines. The mainstream view is that Apple has failed here, but I’ve seen proof they haven’t. They’re just extremely patient
mesasone@reddit
Because the demand for the high margin data center products leaves little capacity left over for low margin consumer products.
pab_guy@reddit
No, this is conspiracy brained nonsense. The only question is whether such a product would be sufficiently profitable. No one can stop a company from building something like that. TSMC will gladly take your money if you can get the capital together.
Luneriazz@reddit
as expected....
ZenaMeTepe@reddit
And with large corporations, they might own the product but if it is a black box they still need to be "subscribed" to maintenance.
uutnt@reddit
Data centers get far higher utilization out of the same hardware than a consumer running a model a few hours a day, so they outbid everyone for both logic silicon and HBM. The price of producing such an ASIC would not make sense.
Not everything is a conspiracy.
Innomen@reddit
You're basically asking why the bank that rules the english speaking world amounts to an evil AI deployed by cultists. That's the end of the logic chain starting from your question and then adding the obvious follow up questions to every accurate answer received. Sounds crazy, isn't.
New_Alps_5655@reddit
I'm predicting Google will buy Taalas and put their tech into pixel phones first, then later other devices.
gh0stwriter1234@reddit
Because you can already do what you are asking for with a regular PC... wasting even more silicon on such a lack luster product would be dumb thats exactly why Tallas is doing what they are...
Frankly I'm fine with this so long as competition between model companies and hosting remains good.
Puzzleheaded_Local40@reddit
Funny thing is GPUs aren't even the ideal solution for half of this but money has to money in order to money for those with money.
psxndc@reddit
You and I must get different Reddit ads, because I get served “AI in a box” ads all the time.
FullOf_Bad_Ideas@reddit
Tiiny did this.
iansaul@reddit
Tiiny AI Pocket Lab: The First Pocket-Size AI Supercomputer by Tiiny AI — Kickstarter https://www.kickstarter.com/projects/tiinyai/tiiny-ai-pocket-lab
All these comments, nobody pointed out this "little" ($$$$) guy?
Or maybe I just missed it - first thread of the day.
SunstoneFV@reddit
You seem to be the first. Probably due to the device being crowdfunded plus the cost people have largely forgotten about it. I vaguely remember people being skeptical when it was announced or expecting it to be far cheaper than it is.
SkyFeistyLlama8@reddit
We're not. NPUs can run smaller models like 4B and below but they're useless for larger models. You still need lots of crunching power through CPUs or GPUs for the MOEs that make it worthwhile to run local inference.
I can't go back to using small lobotomized 7B models when a 35B or 80B is just so much smarter. Unified RAM is good for that.
Dell was supposed to ship a Qualcomm AI 100 discrete NPU with some workstation laptops last year but I haven't heard anything after the initial announcement. I think the AI industry is focusing either on NPUs for simpler on-device tasks or big iron GPUs for cloud LLMs, with no room in the middle for running smarter workloads on device.
VickWildman@reddit
Phones, tablets and notebooks with Qualcomm chips have NPUs with all the bells and whistles, a llama.cpp backend and you can knock yourself out running them larger models if you have enough memory. I have OnePlus 13 with 24 GB, it does it's thing using a few watts, I'm running gemma4 26b like a champ (if a champ runs it at Q4_0 and 8K context, but give me a break it's a phone).
SkyFeistyLlama8@reddit
Not on X Elite laptops. Nexa AI SDK used to be able to run smaller 7B and 4B models on NPU but now the whole thing's been nerfed after the company was acquired by Qualcomm. I can't even get a new license key to run NPU models.
I prefer using the X Elite CPU and Adreno GPU for much larger models. Qwen 35B and Gemma 26B are usable with CPU inference but they make the fans spin like jet engines.
boutell@reddit
The product you want is called a macbook. No seriously, a macbook with 16 GB of RAM can run a current gamma 4GB model, which will probably outperform llama 3.1 due to general improvement since then. And yes it won't get out of date as soon as it's made.
I think a better question is why other laptop manufacturers are not shipping unified RAM and serious competitors to the lower end M chips. That comes down to foresight, which they and Intel did not have.
But we're all assuming these models are actually adequate for a typical person's needs. What are a typical person's needs? At first it seems like mostly a better Google. Yes, these models can give you that when given a search tool. But then people start asking for medical advice, and if they don't find it precisely in their searches, the models start confabulating. The larger cloud models are significantly better at this. But of course no model is perfect at it.
Maybe that's another reason not to sell this dedicated hardware. It gives bad advice, it's very clear who to sue. Laws regarding the quality and safety of physical products may be more strict.
Zyj@reddit
The cheapest GPU with >1200GB/s memory bandwidth is the RTX5090 at 3400€.
SnooStories2864@reddit (OP)
Does it have to be a GPU tho?
Zyj@reddit
If you want good performance …
SnooStories2864@reddit (OP)
I mean, GPU is a generalist way (less so now obviously) to run inference but we could get specialized chips, don't you think? Just like decades ago CPU would render GFX but as it was extremely inefficient, GPU took off?
Orolol@reddit
GPU are already quite specialized in matrix multi, which is the core of the inference compute. The other intensive operation depend of the memory bandwidth. The only way to make it more specialized is to commit to a certain architecture, which make it model dependant, of to use HBM, which is very expansive. This is what cerebras and groq did, and its only efficient for datacenter
BuildAQuad@reddit
Maybe, but sucks when the hardware isnt compatible with architecture changes. Who knows what changes will be made in the future.
ShelZuuz@reddit
The problem is the memory rather than the GPU itself. That need doesn't go away with an ASIC.
"Hey I can sell you something that 75% of the cost of a 5090 but you can't do any gaming or graphics processing on it" won't have much of a consumer appeal.
rosstafarien@reddit
They're in your phone.
Current generation phones can run useful models using the onboard TPU. And every year, the performance of phone AI processing has increased by at least 50%, so if it's not enough for your use case this year, it will be soon.
The other awesome trend is better and better small models. Qwen3.5, Gemma4, a few others are doing more and more with ~2b weights.
Mister__Mediocre@reddit
GPUs become very expensive very quickly. So they're targeted towards the people who're willing to pay. But any GPU is a consumer GPU if you're willing to pay enough. Ie you can get 48Gig cards for a few thousand dollars.
Consumer inference chips will not be GPUs, but instead hardware very customized to running a specific model. There are many startups trying this, but the nature of hardware is how slow it is to iterate. I'd say give it 5-10 years before such a thing becomes practical for the consumer. Taalas is a very good example, but they'll need time to build up scale and reduce costs enough for a consumer offering.
I personally consider it inevitable that in a few years, we'll have a bunch of devices getting shipped with very capable local models. That's already what Microsoft and Apple have in mind, but it'll take time for their goals to be realized. At some point, they'll commit to some model being "it", and then iterate hardware to be brutally optimized for that model. But neither is willing to make that commitment just yet.
SnooStories2864@reddit (OP)
hardware is slow and expensive, sure. But money is just flowing in at a stupidly stupid rate. We have had cryptomining asics with yearly iterations from startup with way less funding. Am i missing the AI inference consumer asics?
LowerEntropy@reddit
Bitcoin is based on SHA-256. That's 256 bits of data, the needed circuits to calculate it are tiny, the problem is embarrassingly parallel, minimal storage needed, minimal memory needed, and the algorithms have been locked down for years. Even the tiniest AI-models need gigabytes of memory, gigabytes of storage, and their structure is still changing constantly.
You're missing a very basic understanding of how both crypto mining and AI-model inference works.
wotoan@reddit
And even then you’d have hardware become totally obsolete on the next node shrink and it was still a race.
Mister__Mediocre@reddit
Every tensor core or neural core you see advertised is precisely this though, a regular GPU with a AI asic tagged on. Problem is that things are still very fluid, so nobody wants to commit to anything beyond a matrix multiplication unit just yet. We'll have to enter the diminishing returns phase to give manufacturers the confidence to do more.
The equilvant of the dominant crypto asics is Google's TPU. I haven't been following crypto closely, but surely all the dominant ASISC are placed in datacenters and not households? Anyway, I think the one big difference is that crypto had a pre-determined workload that everybody could optimize for while AI moves fast and if you make hardware tweak assume the current state of affairs, you may entirely miss out on a new software optimization released a year from now.
Middle_Bullfrog_6173@reddit
I'm not saying it won't happen but it would be a quick way to produce e waste. Any model from 12 months ago is obsolete with the speed things are going.
ResidentPositive4122@reddit
Obsolete as in raw performance increase? Sure. But there comes a point where something is "good enough" for certain tasks, and will continue to be "good enough". That's where model on a chip might work. Especially if it comes as a hardware extension that you can plug into your laptop/workstation/etc. You can have the "latest" model do some high-level stuff, and leave the tedious repetitive tasks to the "good enough" model. W/ some guidance it could work.
Middle_Bullfrog_6173@reddit
At some point yes, but I don't think we are there yet for any task that would be worth baking models into hardware for.
Like, sure current embedding models are good enough, but that's not something you'd run on a separate device.
JollyJoker3@reddit
Live audio and video embedding? Not sure how that even works tbh
Mister__Mediocre@reddit
There are lots of use cases where a model can take an unsolved problem and just solve it 100%. Then you no longer care about how things can improve in the future.
For example, a camera in your fridge that has a hardcoded model to simply check which fruits and vegetables are there and count them. Or something in your security cameras to check whether the person it sees is you or a stranger. Or a bit more general purpose, a tiny model whose only job is to summarize every document in your file system so that you can search over them, that apple includes in the chip.
Middle_Bullfrog_6173@reddit
I don't think any of those is a use case for a separate model specific inference chip. They could all run on a mobile SoC adapted from a phone.
Mister__Mediocre@reddit
Of course every model can be run by a strong enough general purpose chip. But at scale, costs and latencies go down significantly if you optimize for a single model (like etch the weights directly into the silicon).
Middle_Bullfrog_6173@reddit
But anything special purpose needs to be designed and fabbed separately, instead of benefiting from the scale of existing manufacturing.
I just can't see it ever being worth it for a fridge or other narrow use case.
starkruzr@reddit
I don't think this speed of obsolescence is going to remain the same forever.
Middle_Bullfrog_6173@reddit
Maybe not forever, but I don't see an immediate slowdown either. Or in the next couple of years.
sibilischtic@reddit
I imagine a low profile accelerator card,
24Gb of unified memory on chip. Plugs into an m.2 slot. Has cold storage which loads on boot containing specific model layers. You can re-flash with new models.
wingsinvoid@reddit
You made preciselly the point with "commit to some model being "it"". The initial models and software stack to run them were terribly inefficient. Look at the progress that was being made, currently a 32B parameter model is equal if not better in performance to former SOTA like 4o, and can be run locally.
We ill start to see integrated solutions to run AI at the edge once there is limited performance to be squeezed from the stack and baking the compute into hardware makes sense. The hardware lifecycle is years, and by the time Taalas finishes designing and taping out a model baked in hardware, it is already obsolete.
Baking the compute of a model in hardware is more or less designing an ASIC. This scenario played out in crypto mining, CPUs were replaced by GPUs only to be later replaced by ASICS, when the algorithms were sufficiently mature and it became economically feasible.
Unfortunately for local inference, the economics if it, specifically the economics of scale made it cheaper for large operators to squeeze out hobbyists. This means also that the hardware vendors will target those big players which have big budgets and predictable demand. Can you buy a local PCIe bitcoin ACIC to plug into your computer? No, all ACICs are designed for datacenter operation, with kilowatt level power requirements and screaming fans or liquid cooling.
So, for now you can only see hybrid solutions like Groq (now NVIDIA) and Sambanova.
Our only hope for local inference i for the Chinese and maybe Google and Meta to keep releasing local open weights models, and for the hardware vendors to develop AI inference solutions for the "edge" that we can repurpose for local AI. Think Tenstorrent, Axelera, Hailo, etc. The main demand will be for egde AI for video surveillance, voice to text, tex to voice and other use cases where having to be run at the edge means OCIe or M3 interface, which we mere mortals can repurpose.
Or, of course, pay the NVIDIA and GPU tax and run it on programmable GPUs while also running games or CFD simulations.
Mister__Mediocre@reddit
I agree more or less.
Firms want all the money they can get their hands on. So unless they think releasing a consumer product will cannibalize on their data center operations, they should do both.
The main problem today is the supply constraint. If you sign a contract with a suppler for x units, it'll always make sense today to dedicate all of that to the data center operations which have much higher margins. But eventually (think a decade out), supply WILL relax. That's when there's hope for us.
wingsinvoid@reddit
Dim hope! A decade out I will be too old and too tired to give a shit. I guess we will ultimately own nothing and be ... miserable in my case, but too tired to care.
Dore_le_Jeune@reddit
I personally always thought that in the future, we will have to subscribe old school Cable TV style : Math expert, Home Renovation expert, Style expert etc. These will either run in the cloud or in a PC. Maybe throw in a master AI to orchestrate. Maybe some companies will sell their experts and provide updates for a few years, I can see that if people are willing to pay for models every year or few years like they are doing with cell phones now and laptops in the past (or it seemed anyway).
Xyrus2000@reddit
This isn't about "the whole industry is just trying to milk consumers through API subscriptions forever." Nothing is preventing you or someone else from securing venture capital and building such a device.
Questions like yours pop up in other subs about things like this, and they always follow the same theme. "When are we getting X? Big bad market preventing me from getting cheap 'something I want'."
That isn't how things work. If there is profit to be had, then someone will step in and try to make that profit. There are a million ideas out there. There are millions of R&D developments. However, there is a very, very large gap between having a good idea or something created in a lab or a prototype and turning that into something marketable and profitable.
There are plenty of people smarter than you (and me) with access to more resources who have looked into this. Companies like Talaas are making some big promises along these lines, but so far, all they have is a small model on a chip (and it isn't cheap). They have stated goals of consumer-priced hardware, but they are nowhere near that at the moment.
Maybe they'll get there. Maybe they won't. Maybe someone else will come along and do it, or maybe the market just isn't there. Maybe printed older models will be fine, or maybe people will say "technology is moving too fast".
Even the best ideas and superior products can fall flat in the market.
ihexx@reddit
it's a supply and demand thing.
if a chip was made that could do fast inference for consumers, what's stopping it from being used by data centers? suddenly you are competing with sam altman on supply, they all get bought up, price goes up.
ProfessionalJackals@reddit
The fact that right now there are more "normal" chips already produced, then datacenter capacity...
https://www.wheresyoured.at/four-horsemen-of-the-aipocalypse/
People keep forgetting that all this AI GPUs suck power where most datacenters are not equipped for handling. Power in > Heat out ...
And even more important is that space / power / heat extraction are a premium. Sure, you can stack some servers with B200s but everybody are drooling over Rubin's because multiple times performance for ~ the same footprint.
Remember what i said about datacenters not being able to keep up ... Reality is, Nvidia is going to need to find a outlet for their production if stocks keep stacking up. While at the same time other competitors are entering the market... Even some unexpected ones like Google.
So your eventually going to get a need to push more of those AI chips to the more lower end corporate market, as in business / and consumer markets.
Nvidia is for a long time already planning for this, when you see some of their development. First flood the large AI companies with specialized hardware, then it goes to business, and later on to more normal consumers.
I mean, have we forgotten the times where it was normal to have a GPU, a Audio card and a PhysX PPU in your PC. I see no reason why this cycles does not come back when the upper market slows down, as prices increase, and local becomes more the choice.
And with smaller models becoming more capable, that need for "frontier" cloud only start to reduce ...
SnooStories2864@reddit (OP)
I'm talking about very capable OS models (in a "at home" context), not CS high-end models like what OpenAI and Anthropics are doing.
ihexx@reddit
you'd find there isn't a clean separation of the two; if it works well for use case A, there's very little stopping the hyperscalers from just buying 1000 of them for use case B.
case in point: gaming gpus. they could easily serve that use case, and 10 years ago, you could get them cheap.
the gpus today that cost 800 dollars were the skews that used to sell for 200 dollars 10 years ago.
data center gpus made the price of gaming gpus skyrocket.
same deal with ram; a m2 mac can easily run 27-35B models (which easily outperform gpt-4 and are good enough for 'at home' context work these days), but you're looking at 1000 dollars+ for getting one with enough ram.
same deal on the windows side with things like ryzen ai cpu+npu deals.
Grayly@reddit
When model development slows down, to be honest.
Right now, by the time you’ve designed it, validated it, produced it, shipped it, and distributed it, you’re target market (those interested in a local model) are already downloading the next release. And the casual market is just using the free chatbot.
It’s a great idea. It just doesn’t have a market yet.
ego100trique@reddit
Consumer does not pay as much as companies hence why every new tech focuses on companies and then consumers
One-Employment3759@reddit
All we need is for Nvidia to take fractionally less margin.
nothingInteresting@reddit
Nvidia makes around $29k of profit per Blackwell card vs $2k of profit for 2 5090s (roughly require similar resources).
It wouldn’t be taking “fractionally” less, it would be taking 15x less because consumers will never pay anywhere near as much for equipment as a data center. Not even in the same stratosphere
ihexx@reddit
listen, jensen's leather jacker needs its own personal masseuse. they aren't running a charity out here
pulse77@reddit
I hope they will make a chip which stores LLM parameters in EEPROM instead of RAM: Parameters are not changed during inference - so RAM is an overkill here and also very expensive. RAM is needed only for context. And: inference can start instantly if you don't need to load parameters into RAM on startup. I hope people from Taalas (https://taalas.com) are reading this. I know they store parameters in ROM, but EEPROM would be nicer: if architecture does not change, we can update the chip with newest parameters (Qwen3 -> Qwen 3.5 -> Qwen 3.6 -> ...).
d41_fpflabs@reddit
I think its a capacity issue
Atom_101@reddit
They likely use mask rom which has better power and space efficiency. Eeprom cannot store on die data at LLM weight level. Taalas already has trouble storing more 20B weights per die. There is only so much physical space on a chip. They can already change weights at tapeout level btw. The weights require only a small part of the die to change. The consumer can't do it but at the chip level going from qwen 3.5 to 3.6 weights is trivial for them.
pulse77@reddit
I guess/hope they will first implement the whole architecture on the chip and then - at the last moment - get the latest Qwen3 update. If they make it this way, then can start "printing money" with it...
Middle_Bullfrog_6173@reddit
Isn't eeprom more expensive and slower than RAM?
ImproveYourMeatSack@reddit
I so hope that taalas takes off, but I wouldn't be surprised if it was brought out and shelved, or if the core members suddenly died in a freak helicopter accident or something like that.
ihexx@reddit
why would they? there's soooooooooooo many chip companies in the ai space, and taalas is such a niche one. would any of the big players really care?
SnooStories2864@reddit (OP)
That's the thing! So many players and yet all gating their products behind APIs....
ihexx@reddit
yeah, it's not a 'they don't want to sell' thing, it's more of a 'they don't want to deal with the billion issues of selling discrete units to end users'.
Software is hard.
consumer setups have soooooooo many configurations, oses, power envelopes that you need to support.
vs:
if you just have your thing behind an api, you only have to get it working on 1 environment: yours.
SnooStories2864@reddit (OP)
yeah, taalas have a card to play but their product page is just "Request API access" ... I'd spend $1000 easily on a capable OS model chips right now...
Content_Mission5154@reddit
I believe R9700 from AMD is aimed at local LLM and consumers. The problem is, as with anything that uses RAM nowadays, the price. If this was priced under 1000$, local LLMs would be a feasible thing for any PC owner.
05032-MendicantBias@reddit
Right now we are in the "get as much money from people with more money than sense!" phase. That requires promising absurd and impossible things, like ASI, replacing engineers with LLMs, and humanoid robots that are people in suit or remote controlled by AI (Actual Indians).
Once the hype dies down, and the sensless piles of money are burned to ash, that's when AI has to make money by offering great and value adding products, and that's when we'll see such value adding products taking off.
I have a date for that. 9 june. When Elon Musk goes public with SpaceX+xAI+Twitter that burn over a billion a month at a loss but is somehow valued 2 trillion dollars, and pension funds are going to buy in day one without the one year typical wait period to let price discovery take place.
That I predict is the event that will reprice more accurately AI ventures.
Aggeloz@reddit
Literally everything is moving towards "service" and "renting", how do you expect to suddenly have something like this? This is not going to happen and if it does it wont be any time soon. As the others said in the comments and im sure you've seen elsewhere online: You will own nothing and you will be happy.
comatrices@reddit
That's basically what Hailo is doing. Their chip is available with 8gb of onboard RAM in the AI Hat+ 2 for Raspberry Pi. HP also made an M.2 module with one of Hailo's chips but only paired it with 4gb of RAM https://h20195.www2.hp.com/v2/GetDocument.aspx?docname=4AA8-4879ENW
Speeds aren't very impressive though.
Herr_Drosselmeyer@reddit
Locking yourself to a baked-in model at this point in time, where new and better models are released every week, is just silly.
mateszhun@reddit
We acutally are getting them, they are called NPUs, and they tried it to be a thing in 2023-2024. Every chip an consumer laptop/PC announcement was about how well it runs AI. But consumers widely laughed at it, and said "this is not what we want"
Best-Total7445@reddit
You already have multiple devices available to run local LLM's on. Cell phone laptop, desktop, tablet, they all have hardware that can run local LLM's and you don't have to pay anything because your already have them.
This device you speak of at this stage would be obsolete before anyone got the first order.
jikilan_@reddit
Wait until current generation of server hardware is phasing out. Then u can buy them off the market
Substantial-Ebb-584@reddit
It depends what are your expectations. There is this "Tiiny" idea that seems to work if tdp is your concern. IMHO they will emerge eventually after reaching some platoe in creation of open LLMs - when they stop changing so much/fast. Ooor it won't happen since reaching that platoe makes the idea of creating such chips obsolete in a way.
RoomyRoots@reddit
When the bubble burst and companies start selling used hardware they were hoarding. It's crypto all over again.
FastDecode1@reddit
Not gonna happen.
Widespread consumer proliferation will happen through integration with either iGPUs or dedicated NPUs. Eventually the former, I'd say. It's already happening, you just don't notice it because it takes years and years for hardware designed 5 years ago to become ubiquitous. In 10 years you'll notice one day that basically everyone has a machine that can run basic AI/ML tasks with little power use.
Gamers/enthusiasts will always have dGPUs, and the more demanding tasks will always require one.
send-moobs-pls@reddit
Because the market of people capable and interested in running models at home and willing to spend significant money on it is about 5 people relative to almost any other market in the world that a business could target. Not only is it just like a generally true fundamental of business but it's also made more significant in the context of hardware because it's much less efficient to produce a small number of something niche. No imaginative " they want us to rent" fantasy required, Nvidia is even dialed back on consumer GPUs because why use the same materials and same machinery to produce something that's less profitable. It's just math
shovepiggyshove_@reddit
This arrogance will bite Nvidia in the ass. The same happened to Intel after AMD blew right past it with their Ryzen processors.
I'm betting China is already hell bent to bypass Nvidia within the 10-15 years. It's hard to beat their level of determination. They've already proved they can do it by becoming experts for EVs and solar panels, upgrading their infrastructure and moving to renewables. This is a major blow to petrodolar.
b3081a@reddit
Autoregressive LLM runs at highest efficiency when you batch a lot of requests together. It is impossible for local devices to compete with cloud in terms of performance per dollar unless you have a really large company that can fully load the device at all time. At the time such $300 device technically are feasible to exist, the cloud vendors would already be providing the same API at a much lower cost by then. It's a fundamental problem for local personal users that they don't fully load their devices.
fantasticsid@reddit
You're both completely right and completely miss the point. Yes, 1-wide decoding is inefficient, but if we're talking about some end-user-targeted 30W edge device (like OP suggested) the inefficiency simply does not matter, all that matters is generation speed and the cost to purchase the thing in the first place.
The AI datacentre bubble is 100% unsustainable and built on 1999 economics. People like LLMs. LLMs, therefore, are gonna move to the edge. Google is already testing the waters here with "AI Edge Gallery" and the E2B/E4B Gemmas.
Something like Qwen 3.6 35/A3 or even Qwen 3.5 9B in a Talaas-style ASIC would be cheap as hell to make millions of, the only hard part would be the design and up front committment (fabbing non-cutting-edge-process ICs has a very large, fixed up front cost offset by a typically tiny per-unit cost.) Put that IC in either a SFF PC with an ethernet hole or wifi and front it with the OpenAI Completions API and you've got something that hobbyists would buy in great numbers. Put a shiny UI on it (in, idk, tablet form factor or something) and non-technologists would probably buy a crapload of them as well.
Middle_Bullfrog_6173@reddit
There are consumer inference chips. They are called "GPUs" and "NPUs". Both have been shipped to a lot of people. The latter have pretty much no use outside inference.
I think a separate inference dongle with its own memory would be either extremely underpowered or very expensive. It would also require it's own software stack.
Betadoggo_@reddit
Pricing will be relative to how much money they could make being used for commercial hosting, so they wouldn't be cheaper than cloud hosts. It's the same reason memory is so expensive right now. There's a big industry which can directly convert these chips into cash, so they will always outbid consumers for them so long as there is large enough demand from the enterprises.
SnooStories2864@reddit (OP)
why would datacenter outbid consumers on a chips that run a small OS model? The only current big player i'd see doing that is Cloudflare with its Worker AI free tier that offer stupid but very useful free OS llm access for free. Do you think that kind of demand is enough?
Betadoggo_@reddit
There might be very specific use cases that make a small model running quickly on a chip more valuable to a local user than an enterprise, but if there's demand for the small fast model enterprises will also look to capture that demand by offering a hosted version.
SnooStories2864@reddit (OP)
Obviously can't disagree, i'm just hoping that in a crowded market, someone will play the disrupter ...
GMerton@reddit
I think you are basically looking for Jetson Nano.
SnooStories2864@reddit (OP)
I'll give it a look, thanks a lot.
Worldly_Expression43@reddit
Because there's a very small market for this
missingno_85@reddit
probably never.
Majority of end users wants the best model for free. They will even turn a blind eye to data privacy and security concerns to get it. No one is interested in paying for local hardware to run models that are not on par with what a free subscription can offer. Instead of maintaining hardware and software runs, they just want to fire their queries, get answers and carry on with living their lives.
SnooStories2864@reddit (OP)
I kinda agree with that but i recently talked with friends about the potential of having AI at home in term of domotic and eldercare and, beleive it or not, but none of them were willing to share their deep intimacy with any third party, eventho they were all agreeing the technical potential was huge. (Not saying this point close the discussion, this is just one use case i have extensivly discussed with friends and family recently)
Happy_Brilliant7827@reddit
We got consumer inferance chips its the vram thatd expensive.
SnooStories2864@reddit (OP)
Please, can you link me to existing consumer inference chip i can buy?