MoMoneyMoStudy

Possible for inference on 16GB phone? 5 tokens/sec? Unsloth tuned version for mem and perf. optimization the sweet spot? What stack is best for that config?

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

[-]

MoMoneyMoStudy@reddit

Your OSS GitHub PR code reviewer agent is "shocked". The AI Agent arguments over code superiority will now melt the GPUs, worse than a Discord human mocking by Linus or Hotz.

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

[-]

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

[-]

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

[-]

Cursor CEO bro now pushing BFF Sam's LLM over Sonnet for his customers. Follow the money - not always purely a tech choice, especially when a startup needs to start moving to profitability and OpenAI's investment side gig owns a lot of shares and influence. Cursor: $50OMil in ARR, $1Bil spend rate on Claude API.

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

[-]

MoMoneyMoStudy@reddit

Everything is a trade off between cost savings vs. time. If the paid tool and/or LLM API usage is under $100 a month but saves u at least a couple hours when factoring in accuracy, then it's a no brainer. Getting to the quantitative comparison w your choices out there is what can be hard when emotions are involved. But beware the 1 button does all Vibe coders like Replit and Bolt. YC bro Paul Graham really pushing his Replit investment on the AI buzz crowd.

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

[-]

MoMoneyMoStudy@reddit

Know anyone that Vibe Coded a React Native mobile app? Advice for best stack and best approaches?

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

[-]

MoMoneyMoStudy@reddit

Would like to see comparison of volume of usage (tokens, etc) for the LLMs for all coding use, including CLIs, Code editing GUIs, etc. Cursor alone was at an annual Sonnet API spending rate at $1Bil annually based on usage, much of that from customers using their free limit budget allowed by Cursor's subscription plans.

OpenAI GPT-OSS-120b is an excellent model

Posted by xxPoLyGLoTxx@reddit | LocalLLaMA | View on Reddit | 149 comments

[-]

MoMoneyMoStudy@reddit

No one can match OSS with GG, Linus, or GeoHot. And Linus and Geo will just mock you on Discord.

OpenAI GPT-OSS-120b is an excellent model

Posted by xxPoLyGLoTxx@reddit | LocalLLaMA | View on Reddit | 149 comments

[-]

MoMoneyMoStudy@reddit

Collab w GG himself - seems most devs would want this.

OpenAI GPT-OSS-120b is an excellent model

Posted by xxPoLyGLoTxx@reddit | LocalLLaMA | View on Reddit | 149 comments

[-]

MoMoneyMoStudy@reddit

Tiny box hardware w custom inference/training framework, but more like $15K. Search on GitHub. They are also enhancing the framework to work on AMD datacenter GPUs to replace the expensive Nvidia GPU/CUDA stack with AMD's full support

OpenAI GPT-OSS-120b is an excellent model

Posted by xxPoLyGLoTxx@reddit | LocalLLaMA | View on Reddit | 149 comments

[-]

MoMoneyMoStudy@reddit

Next: anyone vibe coded a React Native mobile app? What are the best practices vs. a React website?

Bye bye, Meta AI, it was good while it lasted.

Posted by absolooot1@reddit | LocalLLaMA | View on Reddit | 432 comments

[-]

MoMoneyMoStudy@reddit

At one time, before big money entered the picture, OpenAI contributed groundbreaking research papers at conferences. Mostly in RL, similar to Levine's lab at Berkeley. It all went South when fundraising became an issue, and initial fundraiser Elon's move was to insist Tesla acquire it. Elon left, Karpathy poached, VC guy Altman moves on it, etc.

Bye bye, Meta AI, it was good while it lasted.

Posted by absolooot1@reddit | LocalLLaMA | View on Reddit | 432 comments

[-]

MoMoneyMoStudy@reddit

Google had no financial use for generative AI at the time. OpenAI was still young and scrappy and experimenting. But Google and Deepmind have been the leaders in most AI innovations for over a decade - it's all there on arxiv.org and the top AI conference presentation archives.

Bye bye, Meta AI, it was good while it lasted.

Posted by absolooot1@reddit | LocalLLaMA | View on Reddit | 432 comments

[-]

MoMoneyMoStudy@reddit

No, it was all Schmidhuber all the time. He has the citations to prove it all. You've been warned.

Google massively slashes Gemini Flash pricing in response to GPT-4o mini

Posted by Vivid_Dot_6405@reddit | LocalLLaMA | View on Reddit | 72 comments

[-]

MoMoneyMoStudy@reddit

Small businesses that can afford 1 IT guy do a cost analysis vs. Cloud. Biggest factor for choosing cloud is fast, unplanned, and spikey growth -- e.g. spikes in inference demand as new products/features are released. Businesses that can afford it, understand the value of local models finetuned on customer's data for the customer's domain and use cases - accuracy is everything.

Meta just pushed a new Llama 3.1 405B to HF

Posted by Accomplished_Ad9530@reddit | LocalLLaMA | View on Reddit | 52 comments

[-]

MoMoneyMoStudy@reddit

The Diffs are "Zuck X Posts", bro. Allows Z-Man to virtue signal fast improvements on Social Media instead of unglamorous, technical info for the OSS devs - gotta recoup those $Bil+ training costs in branding equity value, or somethin' somethin'

Meta just pushed a new Llama 3.1 405B to HF

Posted by Accomplished_Ad9530@reddit | LocalLLaMA | View on Reddit | 52 comments

[-]

MoMoneyMoStudy@reddit

VRAM reduced for fine-tuning as well as inference? What percent?

AMD hopes to unlock MI300’s full potential with fresh code

Posted by No_Training9444@reddit | LocalLLaMA | View on Reddit | 31 comments

[-]

MoMoneyMoStudy@reddit

For fine-tuning, look into the sw performance enhancements to ROCm by Lamini AI Engage CTO Gregory Diamos on Twitter.

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

[-]

MoMoneyMoStudy@reddit

He'll never do Datacenter A100s, H100s, B100s or AMD ones. Motto is Petaflops for the people, so that is 1 or 2 quiet boxes in your home or office. If more needed, it fits in racks in a datacenter or home basement w lots of power piped in. He claims the 32 fans in each box keep the GPUs cool enough.You should engage George on Twitter.

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

[-]

MoMoneyMoStudy@reddit

Industry chip color coding: Red is AMD, Green is Nvidia, Blue is Intel. He uses 6 4090s or 7900XTXs If u want more compute, u link boxes together. They are planning to link 300 together for training at Comma AI. The network port is in the spec.

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

[-]

MoMoneyMoStudy@reddit

You're a good match for the new TinyBoxes for fine-tuning 70B models. Also get to try out the new SOTA framework that should surpass PyTorch training performance in near future. Only $15K. See hw specs and framework design approach: tinygrad.org George Hotz/TinyGrad on Twitter as well w tech debates (w/ AMD, Jim Keller, etc)

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

[-]

MoMoneyMoStudy@reddit

What's your monthly electric bill for training? I find it amusing that some bigger startups don't realize the costs, and alternatives. For example, Comma AI has been training ConvNets for years on hundreds of used A100s. Now that they're planning on switching to custom server boxes (see TinyBox), George started looking at elec bills and high Cali elec rates and finally realized there are alternatives. Solar didn't make sense, so now they're searching for a datacenter in a state w cheap elec and cool temps, w plans to really scale up their training compute w bigger models. Fan cooled AMD 7900XTXs punching above their weight class. Hope he read the Llama 405B paper on all the problems encountered w the various hw components of a multi-thousand GPU training Datacenter. Grok in Memphis is beginning to encounter their own challenges ( slight power surges, etc)

Anyone else find Llama 4 models kinda underwhelming?

Posted by Conutu@reddit | LocalLLaMA | View on Reddit | 103 comments

[-]

MoMoneyMoStudy@reddit

Elon approved. Also, Elon acquires Meta.

Anyone else find Llama 4 models kinda underwhelming?

Posted by Conutu@reddit | LocalLLaMA | View on Reddit | 103 comments

[-]

MoMoneyMoStudy@reddit

Future news update: all Nvidia consumer GPUs have been discontinued because the profit margins just don't make sense (silent voice: ... for our shareholders). Somewhere George Hotz is still raging about AMD's software devotion to their Datacenter GPUs over his boxed consumer ones.

Anyone else find Llama 4 models kinda underwhelming?

Posted by Conutu@reddit | LocalLLaMA | View on Reddit | 103 comments

[-]

MoMoneyMoStudy@reddit

W that compute u coulda post-trained it (along with screaming int8 quantization) to remove the guardrails and poor accuracy. Some bros are only "GPU Rich playas". Narrator: You mean Open Source LLMs implies contributing improvements to the free base models? Okay, nevermind mmmkay thanx bye...

Anyone else find Llama 4 models kinda underwhelming?

Posted by Conutu@reddit | LocalLLaMA | View on Reddit | 103 comments

[-]

MoMoneyMoStudy@reddit

Extra big tax on upgrade hopes.

Anyone else find Llama 4 models kinda underwhelming?

Posted by Conutu@reddit | LocalLLaMA | View on Reddit | 103 comments

[-]

MoMoneyMoStudy@reddit

U just don't know .000001 int quantization. Narrator: yes, we know that would be fp. and then you're just starting down the hobbyist rabbit hole q2? come on, man!

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

[-]

MoMoneyMoStudy@reddit

Just in time for the M4 laptops to run larger models with better accuracy and usefulness. Narrator: For $1K more, every hobbyist can become a tech startup. $15K for a TinyGrad TinyBox w 144GB VRAM if u got big LLM training/quantization entrepreneur ballz. See the AI Angel investors behind Datology AI.

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

[-]

MoMoneyMoStudy@reddit

What would be your annual cloud bill for inference if skipping your network?

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

[-]

MoMoneyMoStudy@reddit

Excellent analysis. So, Qualcomm hw is for small Q4 models - maybe just target smartphones for most entrepreneurial uses. See George Hotz w his Comma AI driving assist inference on old Androids.

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

[-]

MoMoneyMoStudy@reddit

Performance tests on Q4 without a thought on accuracy loss vs. Q8. But it's $690 cheaper hw, bro ! She got the gold mine / And I got the shaft -- 'Murican Country song

Quantize 123B Mistral-Large-Instruct-2407 to 35 GB with only 4% accuracy degeneration.

Posted by RelationshipWeekly78@reddit | LocalLLaMA | View on Reddit | 116 comments

[-]

MoMoneyMoStudy@reddit

What do you expect tok/sec performance to be on 2-way 3090 vs. 64GB universal RAM Mac M3?

Quantize 123B Mistral-Large-Instruct-2407 to 35 GB with only 4% accuracy degeneration.

Posted by RelationshipWeekly78@reddit | LocalLLaMA | View on Reddit | 116 comments

[-]

MoMoneyMoStudy@reddit

>1. Yep. EfficientQAT require training. It took me nearly > 47 hours to quantized Mistral-large-instruct 123B. How much VRAM required for this? How many TFLOPS utilized? Wondering about the usefulness of a TinyGrad TinyBox for training/quantization of 100B+ models. Has close to a Petaflop and 144GB VRAM.

Quantize 123B Mistral-Large-Instruct-2407 to 35 GB with only 4% accuracy degeneration.

Posted by RelationshipWeekly78@reddit | LocalLLaMA | View on Reddit | 116 comments

[-]

MoMoneyMoStudy@reddit

How much demand for 2bit vs 4/8? Mostly hobbyists trying things out w minimal hw investment until ready to scale up? What is mix between Mac and Nvidia users?

Quantize 123B Mistral-Large-Instruct-2407 to 35 GB with only 4% accuracy degeneration.

Posted by RelationshipWeekly78@reddit | LocalLLaMA | View on Reddit | 116 comments

[-]

MoMoneyMoStudy@reddit

How about a comparison w 4 and 8 bit quantization so the Mac local crowd can do an accuracy vs. hw cost vs tok/sec analysis. For inference, not sure 2-way GPU/VRAM is the future.

New medical and financial 70b 32k Writer models

Posted by mindwip@reddit | LocalLLaMA | View on Reddit | 100 comments

[-]

MoMoneyMoStudy@reddit

Performed benchmark tests w quantized vs non-quantized versions, e g. int8?

The software-pain of running local LLM finally got to me - so I made my own inferencing server that you don't need to compile or update anytime a new model/tokenizer drops; you don't need to quantize or even download your LLMs - just give it a name & run LLMs the moment they're posted on HuggingFace

Posted by AbheekG@reddit | LocalLLaMA | View on Reddit | 89 comments

[-]

MoMoneyMoStudy@reddit

"T-Panda" Rocky raccoon Checked into the chatroom Only to find an LLM Bible Narrator: i.e. your dogma may vary

Is the new DDR6 the era of CPU-powered LLMs?

Posted by AlexBefest@reddit | LocalLLaMA | View on Reddit | 146 comments

[-]

MoMoneyMoStudy@reddit

Self mutating Llamas is scary stuff - no way to quiet the psychopath "genes" like human prisons.

70b here I come!

Posted by Mr_Impossibro@reddit | LocalLLaMA | View on Reddit | 65 comments

[-]

MoMoneyMoStudy@reddit

PCIe is the way for combining compute and VRAM. See specs for the TinyBox w 6 GPUs (Nvidia or AMD) yielding 6X24GB VRAM w close to a Petaflop of compute for inference and training. www.tinygrad.org

70b here I come!

Posted by Mr_Impossibro@reddit | LocalLLaMA | View on Reddit | 65 comments

[-]

MoMoneyMoStudy@reddit

Even on a top spec M3?

Llama 3.1 405B EXL2 quant results

Posted by Grimulkan@reddit | LocalLLaMA | View on Reddit | 62 comments

[-]

MoMoneyMoStudy@reddit

Q8 may be the sweet spot and CPUs (M3, etc) will be too slow on tok/sec. What's the price for used 400GB VRAM of A100s, or for 3 TinyBoxes (TinyGrad Corp.) of AMDs? For usable local llama at reasonable price you're limited to quantized 70B until high fidelity distillation + LORA improvements are developed to "shrink" 400B models without losing any of the "power", or finetune specialized, smaller models from the 400B for non-general use.

Llama 3.1 405B EXL2 quant results

Posted by Grimulkan@reddit | LocalLLaMA | View on Reddit | 62 comments

[-]

MoMoneyMoStudy@reddit

What was the 70B secret sauce? Finely curated synthetic training data? Is there a risk of targeting the training at just the benchmarks, rather than customer's use cases? (the market)

Llama 3.1 405B EXL2 quant results

Posted by Grimulkan@reddit | LocalLLaMA | View on Reddit | 62 comments

[-]

MoMoneyMoStudy@reddit

What training datasets did you use, and how much manual curation involved? Nous Research has been posting impressive accuracy results on Hugging Face w their synthetic datasets.

Llama 3.1 405B EXL2 quant results

Posted by Grimulkan@reddit | LocalLLaMA | View on Reddit | 62 comments

[-]

MoMoneyMoStudy@reddit

Need detailed benchmark accuracy comparisons - q8 probably worth the extra cost for VRAM over q6, and fp16 not necessary for inference, but handy for further post-training with domain-specific data (healthcare, financial, legal, etc) and/or fine-tuning w customer data if u have the compute, or cloud budget.