MoMoneyMoStudy

NVIDIA has 72GB VRAM version now

Posted by decentralize999@reddit | LocalLLaMA | View on Reddit | 154 comments

NVIDIA has 72GB VRAM version now

Posted by decentralize999@reddit | LocalLLaMA | View on Reddit | 154 comments

GPT-OSS-20B is in the sweet spot for building Agents

Posted by sunpazed@reddit | LocalLLaMA | View on Reddit | 99 comments

GPT-OSS-20B is in the sweet spot for building Agents

Posted by sunpazed@reddit | LocalLLaMA | View on Reddit | 99 comments

MoMoneyMoStudy@reddit

Possible for inference on 16GB phone? 5 tokens/sec? Unsloth tuned version for mem and perf. optimization the sweet spot? What stack is best for that config?

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

MoMoneyMoStudy@reddit

Your OSS GitHub PR code reviewer agent is "shocked". The AI Agent arguments over code superiority will now melt the GPUs, worse than a Discord human mocking by Linus or Hotz.

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

MoMoneyMoStudy@reddit

Cursor CEO bro now pushing BFF Sam's LLM over Sonnet for his customers. Follow the money - not always purely a tech choice, especially when a startup needs to start moving to profitability and OpenAI's investment side gig owns a lot of shares and influence. Cursor: $50OMil in ARR, $1Bil spend rate on Claude API.

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

MoMoneyMoStudy@reddit

Everything is a trade off between cost savings vs. time. If the paid tool and/or LLM API usage is under $100 a month but saves u at least a couple hours when factoring in accuracy, then it's a no brainer. Getting to the quantitative comparison w your choices out there is what can be hard when emotions are involved. But beware the 1 button does all Vibe coders like Replit and Bolt. YC bro Paul Graham really pushing his Replit investment on the AI buzz crowd.

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

Wow anthropic and Google losing coding share bc of qwen 3 coder

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 128 comments

MoMoneyMoStudy@reddit

Would like to see comparison of volume of usage (tokens, etc) for the LLMs for all coding use, including CLIs, Code editing GUIs, etc. Cursor alone was at an annual Sonnet API spending rate at $1Bil annually based on usage, much of that from customers using their free limit budget allowed by Cursor's subscription plans.

OpenAI GPT-OSS-120b is an excellent model

Posted by xxPoLyGLoTxx@reddit | LocalLLaMA | View on Reddit | 149 comments

OpenAI GPT-OSS-120b is an excellent model

Posted by xxPoLyGLoTxx@reddit | LocalLLaMA | View on Reddit | 149 comments

OpenAI GPT-OSS-120b is an excellent model

Posted by xxPoLyGLoTxx@reddit | LocalLLaMA | View on Reddit | 149 comments

MoMoneyMoStudy@reddit

Tiny box hardware w custom inference/training framework, but more like $15K. Search on GitHub. They are also enhancing the framework to work on AMD datacenter GPUs to replace the expensive Nvidia GPU/CUDA stack with AMD's full support

OpenAI GPT-OSS-120b is an excellent model

Posted by xxPoLyGLoTxx@reddit | LocalLLaMA | View on Reddit | 149 comments

Bye bye, Meta AI, it was good while it lasted.

Posted by absolooot1@reddit | LocalLLaMA | View on Reddit | 432 comments

MoMoneyMoStudy@reddit

At one time, before big money entered the picture, OpenAI contributed groundbreaking research papers at conferences. Mostly in RL, similar to Levine's lab at Berkeley. It all went South when fundraising became an issue, and initial fundraiser Elon's move was to insist Tesla acquire it. Elon left, Karpathy poached, VC guy Altman moves on it, etc.

Bye bye, Meta AI, it was good while it lasted.

Posted by absolooot1@reddit | LocalLLaMA | View on Reddit | 432 comments

MoMoneyMoStudy@reddit

Google had no financial use for generative AI at the time. OpenAI was still young and scrappy and experimenting. But Google and Deepmind have been the leaders in most AI innovations for over a decade - it's all there on arxiv.org and the top AI conference presentation archives.

Bye bye, Meta AI, it was good while it lasted.

Posted by absolooot1@reddit | LocalLLaMA | View on Reddit | 432 comments

Google massively slashes Gemini Flash pricing in response to GPT-4o mini

Posted by Vivid_Dot_6405@reddit | LocalLLaMA | View on Reddit | 72 comments

MoMoneyMoStudy@reddit

Small businesses that can afford 1 IT guy do a cost analysis vs. Cloud. Biggest factor for choosing cloud is fast, unplanned, and spikey growth -- e.g. spikes in inference demand as new products/features are released. Businesses that can afford it, understand the value of local models finetuned on customer's data for the customer's domain and use cases - accuracy is everything.

Meta just pushed a new Llama 3.1 405B to HF

Posted by Accomplished_Ad9530@reddit | LocalLLaMA | View on Reddit | 52 comments

MoMoneyMoStudy@reddit

The Diffs are "Zuck X Posts", bro. Allows Z-Man to virtue signal fast improvements on Social Media instead of unglamorous, technical info for the OSS devs - gotta recoup those $Bil+ training costs in branding equity value, or somethin' somethin'

Meta just pushed a new Llama 3.1 405B to HF

Posted by Accomplished_Ad9530@reddit | LocalLLaMA | View on Reddit | 52 comments

AMD hopes to unlock MI300’s full potential with fresh code

Posted by No_Training9444@reddit | LocalLLaMA | View on Reddit | 31 comments

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

MoMoneyMoStudy@reddit

He'll never do Datacenter A100s, H100s, B100s or AMD ones. Motto is Petaflops for the people, so that is 1 or 2 quiet boxes in your home or office. If more needed, it fits in racks in a datacenter or home basement w lots of power piped in. He claims the 32 fans in each box keep the GPUs cool enough.You should engage George on Twitter.

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

MoMoneyMoStudy@reddit

Industry chip color coding: Red is AMD, Green is Nvidia, Blue is Intel. He uses 6 4090s or 7900XTXs If u want more compute, u link boxes together. They are planning to link 300 together for training at Comma AI. The network port is in the spec.

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

MoMoneyMoStudy@reddit

You're a good match for the new TinyBoxes for fine-tuning 70B models. Also get to try out the new SOTA framework that should surpass PyTorch training performance in near future. Only $15K. See hw specs and framework design approach: tinygrad.org George Hotz/TinyGrad on Twitter as well w tech debates (w/ AMD, Jim Keller, etc)

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

MoMoneyMoStudy@reddit

What's your monthly electric bill for training? I find it amusing that some bigger startups don't realize the costs, and alternatives. For example, Comma AI has been training ConvNets for years on hundreds of used A100s. Now that they're planning on switching to custom server boxes (see TinyBox), George started looking at elec bills and high Cali elec rates and finally realized there are alternatives. Solar didn't make sense, so now they're searching for a datacenter in a state w cheap elec and cool temps, w plans to really scale up their training compute w bigger models. Fan cooled AMD 7900XTXs punching above their weight class. Hope he read the Llama 405B paper on all the problems encountered w the various hw components of a multi-thousand GPU training Datacenter. Grok in Memphis is beginning to encounter their own challenges ( slight power surges, etc)

Anyone else find Llama 4 models kinda underwhelming?

Posted by Conutu@reddit | LocalLLaMA | View on Reddit | 103 comments

Anyone else find Llama 4 models kinda underwhelming?

Posted by Conutu@reddit | LocalLLaMA | View on Reddit | 103 comments

MoMoneyMoStudy@reddit

Future news update: all Nvidia consumer GPUs have been discontinued because the profit margins just don't make sense (silent voice: ... for our shareholders). Somewhere George Hotz is still raging about AMD's software devotion to their Datacenter GPUs over his boxed consumer ones.

Anyone else find Llama 4 models kinda underwhelming?

Posted by Conutu@reddit | LocalLLaMA | View on Reddit | 103 comments

MoMoneyMoStudy@reddit

W that compute u coulda post-trained it (along with screaming int8 quantization) to remove the guardrails and poor accuracy. Some bros are only "GPU Rich playas". Narrator: You mean Open Source LLMs implies contributing improvements to the free base models? Okay, nevermind mmmkay thanx bye...

Anyone else find Llama 4 models kinda underwhelming?

Posted by Conutu@reddit | LocalLLaMA | View on Reddit | 103 comments

Anyone else find Llama 4 models kinda underwhelming?

Posted by Conutu@reddit | LocalLLaMA | View on Reddit | 103 comments

MoMoneyMoStudy@reddit

U just don't know .000001 int quantization. Narrator: yes, we know that would be fp. and then you're just starting down the hobbyist rabbit hole q2? come on, man!

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

MoMoneyMoStudy@reddit

Just in time for the M4 laptops to run larger models with better accuracy and usefulness. Narrator: For $1K more, every hobbyist can become a tech startup. $15K for a TinyGrad TinyBox w 144GB VRAM if u got big LLM training/quantization entrepreneur ballz. See the AI Angel investors behind Datology AI.

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

MoMoneyMoStudy@reddit

Excellent analysis. So, Qualcomm hw is for small Q4 models - maybe just target smartphones for most entrepreneurial uses. See George Hotz w his Comma AI driving assist inference on old Androids.

Snapdragon X CPU inference is fast! (Q_4_0_4_8 quantization)

Posted by Some_Endian_FP17@reddit | LocalLLaMA | View on Reddit | 76 comments

MoMoneyMoStudy@reddit

Performance tests on Q4 without a thought on accuracy loss vs. Q8. But it's $690 cheaper hw, bro ! She got the gold mine / And I got the shaft -- 'Murican Country song

Quantize 123B Mistral-Large-Instruct-2407 to 35 GB with only 4% accuracy degeneration.

Posted by RelationshipWeekly78@reddit | LocalLLaMA | View on Reddit | 116 comments

Quantize 123B Mistral-Large-Instruct-2407 to 35 GB with only 4% accuracy degeneration.

Posted by RelationshipWeekly78@reddit | LocalLLaMA | View on Reddit | 116 comments

MoMoneyMoStudy@reddit

>1. Yep. EfficientQAT require training. It took me nearly > 47 hours to quantized Mistral-large-instruct 123B. How much VRAM required for this? How many TFLOPS utilized? Wondering about the usefulness of a TinyGrad TinyBox for training/quantization of 100B+ models. Has close to a Petaflop and 144GB VRAM.

Quantize 123B Mistral-Large-Instruct-2407 to 35 GB with only 4% accuracy degeneration.

Posted by RelationshipWeekly78@reddit | LocalLLaMA | View on Reddit | 116 comments

MoMoneyMoStudy@reddit

How much demand for 2bit vs 4/8? Mostly hobbyists trying things out w minimal hw investment until ready to scale up? What is mix between Mac and Nvidia users?

Quantize 123B Mistral-Large-Instruct-2407 to 35 GB with only 4% accuracy degeneration.

Posted by RelationshipWeekly78@reddit | LocalLLaMA | View on Reddit | 116 comments

MoMoneyMoStudy@reddit

How about a comparison w 4 and 8 bit quantization so the Mac local crowd can do an accuracy vs. hw cost vs tok/sec analysis. For inference, not sure 2-way GPU/VRAM is the future.

New medical and financial 70b 32k Writer models

Posted by mindwip@reddit | LocalLLaMA | View on Reddit | 100 comments

The software-pain of running local LLM finally got to me - so I made my own inferencing server that you don't need to compile or update anytime a new model/tokenizer drops; you don't need to quantize or even download your LLMs - just give it a name & run LLMs the moment they're posted on HuggingFace

Posted by AbheekG@reddit | LocalLLaMA | View on Reddit | 89 comments

Is the new DDR6 the era of CPU-powered LLMs?

Posted by AlexBefest@reddit | LocalLLaMA | View on Reddit | 146 comments

70b here I come!

Posted by Mr_Impossibro@reddit | LocalLLaMA | View on Reddit | 65 comments

MoMoneyMoStudy@reddit

PCIe is the way for combining compute and VRAM. See specs for the TinyBox w 6 GPUs (Nvidia or AMD) yielding 6X24GB VRAM w close to a Petaflop of compute for inference and training. www.tinygrad.org

70b here I come!

Posted by Mr_Impossibro@reddit | LocalLLaMA | View on Reddit | 65 comments

Llama 3.1 405B EXL2 quant results

Posted by Grimulkan@reddit | LocalLLaMA | View on Reddit | 62 comments

MoMoneyMoStudy@reddit

Q8 may be the sweet spot and CPUs (M3, etc) will be too slow on tok/sec. What's the price for used 400GB VRAM of A100s, or for 3 TinyBoxes (TinyGrad Corp.) of AMDs? For usable local llama at reasonable price you're limited to quantized 70B until high fidelity distillation + LORA improvements are developed to "shrink" 400B models without losing any of the "power", or finetune specialized, smaller models from the 400B for non-general use.

Llama 3.1 405B EXL2 quant results

Posted by Grimulkan@reddit | LocalLLaMA | View on Reddit | 62 comments

MoMoneyMoStudy@reddit

What was the 70B secret sauce? Finely curated synthetic training data? Is there a risk of targeting the training at just the benchmarks, rather than customer's use cases? (the market)

Llama 3.1 405B EXL2 quant results

Posted by Grimulkan@reddit | LocalLLaMA | View on Reddit | 62 comments

MoMoneyMoStudy@reddit

What training datasets did you use, and how much manual curation involved? Nous Research has been posting impressive accuracy results on Hugging Face w their synthetic datasets.

Llama 3.1 405B EXL2 quant results

Posted by Grimulkan@reddit | LocalLLaMA | View on Reddit | 62 comments

MoMoneyMoStudy@reddit

Need detailed benchmark accuracy comparisons - q8 probably worth the extra cost for VRAM over q6, and fp16 not necessary for inference, but handy for further post-training with domain-specific data (healthcare, financial, legal, etc) and/or fine-tuning w customer data if u have the compute, or cloud budget.