RMCPhoto

All of the major open weight labs have shifted to large params general models instead of smaller, more focused models. By this time next year, there won’t be much “local” about this sub unless the paradigm shifts to smaller models good at specific domains.

Posted by LocoMod@reddit | LocalLLaMA | View on Reddit | 247 comments

RMCPhoto@reddit

It's probably the prompting strategy. I have no doubt it's very smart, but my results have also been inconsistent. My guess is that it's the same old story. The training data instills a certain syntax / language / prompt structure that differs from the norm slightly. Could even be a very tiny variation that propagates an error. Newer models have been more tolerant to this compared with the earlier llamas...where adding a space before the first word would increase the error by 40% and similar other black box ??? This is honestly my biggest frustration. I'm very thankful that openai released such clear cookbook content for prompt formatting. Truly, every model designer should take note. Clear documentation is such a massive booster for adoption, public opinion and end user success. Even better if that documentation is instilled into a meta prompt for prompt refinement.

Best coding model under 40B

Posted by tombino104@reddit | LocalLLaMA | View on Reddit | 66 comments

RMCPhoto@reddit

It's like resolution and asking if 4k is really any different than 1080p... For my grandma? Hell no... I mean she's dead but...shed still know what show she's watching and get the plot. But if you're inches from the screen wondering if that teenis ball was inside or outside the line... Yes, it is critical. For coding, with complex syntax etc - you really don't want to gut a massive chunk of whatever unknown knowledge you're blindly assuming it doesn't need.

Anthropic’s ‘anti-China’ stance triggers exit of star AI researcher

Posted by balianone@reddit | LocalLLaMA | View on Reddit | 363 comments

RMCPhoto@reddit

While I am also impressed by China's growth, they are no more a beacon of "goodness" than the US tech elite. The tech elite is at least competitive with itself. China is driven by more singular authoritarian exploitation and optimization, while avoiding risky direct confrontation. Reading your message and knowing that china is also peppering the fabric of the entire internet with generative propaganda...it's hard to know if you are a real individual human, one who has fallen for the propaganda, or the propaganda itself. Let's noforget Hong Kong, Taiwan, south china, Africa, and in general how they burn super hot and often overshoot and self destruct. Ghost cities, millions of cars and no roads, massive polluton and then correction, cultural revolution, and the possibly high risk dive head first into AI...drones...etc... The authoritarian nature of the Chinese control model is high risk...it's good when it's aligned with the will of the people, but it can turn hellish on a dime.

nano-banana is a MASSIVE jump forward in image editing

Posted by entsnack@reddit | LocalLLaMA | View on Reddit | 133 comments

deepseek-ai/DeepSeek-V3.1-Base · Hugging Face

Posted by xLionel775@reddit | LocalLLaMA | View on Reddit | 196 comments

I tried the Jan-v1 model released today and here are the results

Posted by rm-rf-rm@reddit | LocalLLaMA | View on Reddit | 57 comments

RMCPhoto@reddit

This is why we need to know more about how it was trained and any specific prompts we should use - think the gpt 5 cookbook, gpt 4.1 cookbooks from OpenAI. It's much more important with small narrow models.

Jan v1: 4B model for web search with 91% SimpleQA, slightly outperforms Perplexity Pro

Posted by Delicious_Focus3465@reddit | LocalLLaMA | View on Reddit | 223 comments

RMCPhoto@reddit

Been a big fan of the last 3 models you've released, think this is the right direction. Can you share some of the prompts / system prompts that were used? Especially, not having access to the training data, I have been unable to replicate the results in other contexts.

gpt-oss-120b ranks 16th place on lmarena.ai (20b model is ranked 38th)

Posted by chikengunya@reddit | LocalLLaMA | View on Reddit | 97 comments

RMCPhoto@reddit

imo this is the most cursed benchmark of all time. We have no idea how manipulated any of it is. You should also all know that it's the primary site used for 'sports betting' pages.

🚀 Qwen3-4B-Thinking-2507 released!

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 137 comments

RMCPhoto@reddit

100% That is the future. I see so many rebarded takes on the daily that don't take this into account and havent even though it's clearly the roadmap. It's just going to take time to optimize a system like this. But it is the way, for both cost reduction (primary driver for industry - see GPT-5), and for the ability to tune each part of the system independently.... which is much better... Like... a billion times better lol. It's why gorilla could beat GPT4 in function call ing "way back when". The problem I see is that there hasn't been a consolidation around a framework or methodology to accomplish this. MCP's gets us somewhat there. But we need a bit more of an "agent" framework that's closer to metal and is a bit beyond "agent" and supports a more general concept of networking. Take away: \- smaller narrower models will always be better and more efficient at specific tasks. \- find high volume % specific tasks (we have plenty) \- find the smallest model that via training with task specific data performs at the target success rate. \- Need for a framework we agree on

I'm sure it's a small win, but I have a local model now!

Posted by LAKnerd@reddit | LocalLLaMA | View on Reddit | 116 comments

RMCPhoto@reddit

I think it was a really good decision. I think kimmi v2, mostly in the pattern that it is trained to reason is probably more or less where it should go. Still really like that model. It's not as overcooked as qwen 3. They just keep re-training it and distilling it more and more. And it gets higher benchmarks, but it still has a very high error rate in the output in my tests. And I think it's because they lost some essential pre-training data somewhere along the line of fine tuning and reinforcement. I think we're going to see more and more of that. You don't need to have a distinct "forced" reasoning stage. It's better for the reasoning to be allowed to occur in the middle of a prompt too. Or just for a second before calling a tool.

I'm sure it's a small win, but I have a local model now!

Posted by LAKnerd@reddit | LocalLLaMA | View on Reddit | 116 comments

RMCPhoto@reddit

Exactly, it's from the time before distillation and reinforcement learning on its own thoughts, getting high on its own supply...just me and julio down by the school yard.

Uncensored rp models

Posted by Imaginary_Bread9711@reddit | LocalLLaMA | View on Reddit | 10 comments

RMCPhoto@reddit

My feed said, "now in r/LocalLLaMA:: Uncensored rp models" I mean...always in r/LocalLLaMA...uncensored soulless goon squad in dark bedrooms with nothing but the dim light of their dark-mode chat interface lighting their throbbing vei.... wait wtf, how'd you do this to me?

🚀 Qwen3-4B-Thinking-2507 released!

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 137 comments

RMCPhoto@reddit

Small thinking models are really only good in the areas they've received explicit reinforcement learning in. They don't generalize very well. Which is fine. But they should be targeted at a limited number of use cases. Small general models are no good. Small narrow models can be amazing.

GPT-OSS looks more like a publicity stunt as more independent test results come out :(

Posted by mvp525@reddit | LocalLLaMA | View on Reddit | 228 comments

RMCPhoto@reddit

I'm just playing devil's advocate here. But, the way they approached this, and the "safety" etc. Will allow large corporations to adopt local models where previously there would be too much liability. IE they aren't going to run Qwen A3B in mail trucks.

🚀 Qwen3-4B-Thinking-2507 released!

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 137 comments

RMCPhoto@reddit

He means that sometimes it's better to use a 8b model that can get to the right answer much faster. Long long chain reasoning is an inherent problem with reinforcement learning if not tuned correctly. You can let the reinforcement learning cook forever and the reasoning ends up getting longer and longer on average. You can see deepseek did the same thing. A lot of Qwen models are falling into this trap. It makes them look great on benchmarks though.

🚀 Qwen3-4B-Thinking-2507 released!

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 137 comments

RMCPhoto@reddit

Yes, you're absolutely right. Frankly I'm not a fan of this either, ie I don't think the newest deepseek revision was worth the token cost. But check out non thinking models trained for function calling if you want efficiency - this is much closer to a router type model used in agent scenarios: [https://huggingface.co/watt-ai/watt-tool-8B](https://huggingface.co/watt-ai/watt-tool-8B) In my mind, the whole point of the expansion into "agents" is to enable multi-model systems where the most efficient tool for the job is used.

🚀 Qwen3-4B-Thinking-2507 released!

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 137 comments

RMCPhoto@reddit

They have completely different target use-cases despite being the same size. This is really going to be primarily a tool calling model where the optimization is more about pathfinding. Gemma 3n is designed to be more of a generative / data extraction translation type model. I wouldn't weigh them side by side. Plus, google's tool calling is some of the worst. 2.5 pro ranks like 40th on bfcl.

🚀 Qwen3-4B-Thinking-2507 released!

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 137 comments

RMCPhoto@reddit

[https://gorilla.cs.berkeley.edu/leaderboard.html](https://gorilla.cs.berkeley.edu/leaderboard.html) Definitely impressive. Puts it up near [https://huggingface.co/Salesforce/Llama-xLAM-2-8b-fc-r](https://huggingface.co/Salesforce/Llama-xLAM-2-8b-fc-r) at half the size. Wonder what the multi-turn is like though. That's usually where the small models struggle.

"What, you don't like your new SOTA model?"

Posted by Friendly_Willingness@reddit | LocalLLaMA | View on Reddit | 128 comments

RMCPhoto@reddit

It's not a bad model, it's just focused on a broader consumer market; something that many of the other open-source models have not managed to accomplish. This is more in line with google's true edge model philosophy. Primarily, the accomplishment of this release is not a high score on benchmarks - but a high score on inference efficiency. It would be interesting to compare it more directly to models with similar computational demands.

GPT-OSS looks more like a publicity stunt as more independent test results come out :(

Posted by mvp525@reddit | LocalLLaMA | View on Reddit | 228 comments

RMCPhoto@reddit

I think we need to manage expectations and see the real use case. Unless you built an AI rig. This is probably the best model you can run on your computer. It runs fine on CPU. ( Cerebras is serving it at something like 3k tps. ) It's very sensible and allows for integration into software consumers can actually use.

I made a comparison chart for Qwen3-Coder-30B-A3B vs. Qwen3-Coder-480B-A35B

Posted by Dr_Karminski@reddit | LocalLLaMA | View on Reddit | 42 comments

RMCPhoto@reddit

With essentially anything agentic, the gaps are exponential as errors compound. Just something to keep in mind. But no need to spoil this really cool release, hopefully it will be motivating to Google and Openai. They better stay frosty, or these Chinese teams are going to eat their lunch. Then their only business will be the industries they monopolize through regulatory capture. And the great drone wars of course.

🚀 Qwen3-30B-A3B-Thinking-2507

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 141 comments

RMCPhoto@reddit

I knew I was being inexact and lazy there. Thanks for calling me out. If I'm honest, I couldn't objectively figure out exactly what it was. Which is one of the problems with language models / ai in general - it is inexact and hard to measure. Personally, it hallucinated a lot more on the same data extraction / understanding tasks. from only moderate context (4k tokens max). And failed to use the structured data output as often (via pydantic\_ai's telemetry. With thinking turned off it was clearly inferior to the v2.5 equivalent, and I didn't personally have good reasoning tasks for it at the time. I think a much-much better adaptation of qwen 3 is jan-nano. Whereas if you look at the openLMAarena, qwen3 variants do not hold up for generalized world knowledge tasks. [https://huggingface.co/spaces/open-llm-leaderboard/open\_llm\_leaderboard#/?params=7%2C65](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?params=7%2C65) Qwen3 isn't even up there.

🚀 Qwen3-30B-A3B-Thinking-2507

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 141 comments

RMCPhoto@reddit

I don't quite believe this benchmark after using it a few times after release, and I definitely wouldn't take away from this that it's a better model than its much larger sibling or more useful and consistent than flash 2.5 I'd really have to see how these were done. It has some strange quirks...imo and I couldn't put it into any system I needed to rely on

Bye bye, Meta AI, it was good while it lasted.

Posted by absolooot1@reddit | LocalLLaMA | View on Reddit | 432 comments

RMCPhoto@reddit

I mean, he's right. But also, meta or zuck are also not the ones who should be in charge of super intelligence. The problem is, who should? Governments..the "trust us" people. Until we have true democracy it's going to be tough. Just keeping my fingers crossed for the true abundance path.

GLM4.5 released!

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 266 comments

New AI architecture delivers 100x faster reasoning than LLMs with just 1,000 training examples

Posted by Accomplished-Copy332@reddit | LocalLLaMA | View on Reddit | 127 comments

RMCPhoto@reddit

The role of the llm in the tool call scenario is both selecting the right tool, providing the correct input, and parsing the response. If the tool doesn't require natural language understanding then it's a bit of a waste to use a llm. You're right though, gorilla or Jan-nano is not "complete" . Jan can manage a few steps, but what is better is to have an orchestrator that is focused only on reasoning and planning and consolidating the data Jan retrieves. This fits best in a multi agent architecture as an even smarter search tool that shields the large model from junk tokens.

UIGEN-X-0727 Runs Locally and Crushes It. Reasoning for UI, Mobile, Software and Frontend design.

Posted by smirkishere@reddit | LocalLLaMA | View on Reddit | 77 comments

RMCPhoto@reddit

What would you say are the frameworks that this model does beat with, and which frameworks does it struggle with? Very cool models btw, been watching you guys and think you're on the right path. More narrow AI. More better AI.

Appreciation Post - Thank you unsloth team, and thank you bartowski

Posted by fuutott@reddit | LocalLLaMA | View on Reddit | 75 comments

New AI architecture delivers 100x faster reasoning than LLMs with just 1,000 training examples

Posted by Accomplished-Copy332@reddit | LocalLLaMA | View on Reddit | 127 comments

RMCPhoto@reddit

This is my belief too. I was convinced when we saw Berkeley release gorilla https://gorilla.cs.berkeley.edu/ in Oct 2023. Gorilla is a 7 b model specialized in calling functions. It scored better than gpt 4 at the time. Recently, everyone should really see the work at Menlo Research. Jan-nano-128k is basically the spiritual successor, a 3b model specialized in agentic research. I use Jan-nano daily as part of workflows that find and process information from all sorts of sources. I feel I haven't even scratched the surface on how creatively it could be used. Recently, they've released Lucy, an even smaller model in the same vein that can run on edge devices. https://huggingface.co/Menlo Or the nous research attempts https://huggingface.co/NousResearch/DeepHermes-ToolCalling-Specialist-Atropos Other majorly impressive specialized small models: jina ReaderLM V2 - long context formatting / extraction. Another model I use daily. Then there's uigen https://huggingface.co/Tesslate/UIGEN-X-8B a small model for assembling front end. Wildly cool. Within my coding agents, I use several small models to extract and compress context from large code bases fine tuned on code. Small, domain specific reasoning models are also very useful. I think the future is agentic and a collection of specialized, domain specific small models. It just makes more sense. Large models will still have their place, but it won't be the hammer for everything.

Qwen3-235B-A22B-Thinking-2507 released!

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 190 comments

RMCPhoto@reddit

I love what the Qwen team cooks up, the 2.5 series will always have a place in the trophy room of open LLMs. But I can't help but feel that the 3 series has some fundamental flaws that aren't getting fixed in these revisions and don't show up on benchmarks. Most of the serious engineers focused on fine tuning have more consistent results with 2.5. the big coder model tested way higher than Kimmi, but in practice I think most of us feel the opposite. I just wish they wouldn't inflate the scores, or would focus on some more real world targets.

Tested Kimi K2 vs Qwen-3 Coder on 15 Coding tasks - here's what I found

Posted by West-Chocolate2977@reddit | LocalLLaMA | View on Reddit | 63 comments

RMCPhoto@reddit

Haha, it's so true... They get on a confidence kick and then the autoregressive nature kicks in and they build into a manic state where everything is fixed and perfect while the whole code base burns around them.

Tested Kimi K2 vs Qwen-3 Coder on 15 Coding tasks - here's what I found

Posted by West-Chocolate2977@reddit | LocalLLaMA | View on Reddit | 63 comments

RMCPhoto@reddit

Kimi uses sparse routing (halved heads - 50% flop red), qwen3 uses wider attention and deeper kV cache. It's not as straight forward as parameters.

Tested Kimi K2 vs Qwen-3 Coder on 15 Coding tasks - here's what I found

Posted by West-Chocolate2977@reddit | LocalLLaMA | View on Reddit | 63 comments

RMCPhoto@reddit

This is not quite true, it is trained in reasoning, it just needs to be enacted in a different way. A good quick way to exercise the reasoning ability (without making your own complex prompt) is to use a mcp like Sequential-Thinking, or Clear-Thought. These create a structured approach to reasoning and are imo superior in token efficiency to the traditional reasoning + output model dynamic and give you far more control over the process. It also makes the models architecture as a whole more efficient. Ever try to use the qwen3 models with think turned off? They're so much worse than qwen 2.5 at the same size. That's a big downside. I think this will be the new way and that the current reasoning paradigm will go away.

Tested Kimi K2 vs Qwen-3 Coder on 15 Coding tasks - here's what I found

Posted by West-Chocolate2977@reddit | LocalLLaMA | View on Reddit | 63 comments

Qwen3-235B-A22B-2507 Released!

Posted by pseudoreddituser@reddit | LocalLLaMA | View on Reddit | 265 comments

RMCPhoto@reddit

That said, I wonder how well it really handles long context comprehension / without losing output quality. Looking at parasail on openrouter (and the price could just be intro) it's 1/5 the token cost and has a context window twice as large. I think these might just be very different models and not necessarily in direct competition... though they sure did take the gloves off with that bar chart... (so sick of benchmarks)

Qwen3-235B-A22B-2507

Posted by Mysterious_Finish543@reddit | LocalLLaMA | View on Reddit | 92 comments

RMCPhoto@reddit

Such good news, this is the way. The whole "reasoning" baked into the architecture as a front-loaded process doesn't make so much sense. The reasoning should be continuous throughout the generation on an as needed basis. The way reasoning models work...you could just have a separate model that ONLY does reasoning and no generation. That would be a more compelling architecture.

Qwen3-235B-A22B-2507

Posted by Mysterious_Finish543@reddit | LocalLLaMA | View on Reddit | 92 comments

RMCPhoto@reddit

The difference is in the agentic behavior and dynamic reasoning. I am exercising qwen's new bench king here - but what's special about Kimi is that it is not a reasoning model, but it is post-trained in reasoning before acting and planning. Which seems to work out better from a token conservation and logging point of view. Now the decisions are not obfuscated by "think" etc, you simply write instructions to reflect / plan / act as needed. And kimi is very very good at working like an agent. So, one shot code generation is one thing, but being an effective, capable coding partner is something else.

Meta says it won't sign Europe AI agreement, calling it an overreach that will stunt growth

Posted by ttkciar@reddit | LocalLLaMA | View on Reddit | 102 comments

RMCPhoto@reddit

Tbf, I am in Sweden and I would say the exact same thing. Most people in this space would agree if they read the bill. It seems like whatever you produce can be classified as a risk and shut down in almost all of the scenarios where AI may actually be useful and not just a toy. Yet there is no specificity as to what exactly the criteria are...it's just signing over absolute authority to arbitrary decisions made by future out of touch beurocrats.

Semantic code search for local directory

Posted by codingjaguar@reddit | LocalLLaMA | View on Reddit | 15 comments

RMCPhoto@reddit

I think you may be focused too narrowly. There are so many use cases for semantic code search that are not nearly as feasible with grep. For example, identifying coding patterns / reducing complexity by identifying redundant processes / finding code by what it does rather than explicit values (can also approach this by having an agent write a mock function as the query).

T5Gemma: A new collection of encoder-decoder Gemma models- Google Developers Blog

Posted by DeltaSqueezer@reddit | LocalLLaMA | View on Reddit | 22 comments

RMCPhoto@reddit

It's definitely interesting. I'm not sure it improves normal text gen use cases - but they cited that it did improve "safety" and control methods. Wondering what other unique use cases it might serve.

Skywork/Skywork-R1V3-38B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 37 comments

RMCPhoto@reddit

Tbh, this is a plague across the entire scientific / academic community. I just spent 3 weeks pouring over literally thousands of computer vision papers from 2023-2025 (tracking, segmentation, action identification, classification, video encoders, and others). Literally almost every single paper claimed that their solution beat the state of the art - and it was shown via select benchmarks. The problem with this academic bullshit is that most of the time it only works in the lab...or it is otherwise very fragile. I recreated at least 100 different solutions and none generalized to problems in the wild. And many were complete crap and I'm not sure how they got their results in the first place.

Skywork/Skywork-R1V3-38B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 37 comments

Here is how we beat ChatGPT at classification with 1 dollar in cloud compute

Posted by iamMess@reddit | LocalLLaMA | View on Reddit | 43 comments

New study from Cohere shows Lmarena (formerly known as Lmsys Chatbot Arena) is heavily rigged against smaller open source model providers and favors big companies like Google, OpenAI and Meta

Posted by obvithrowaway34434@reddit | LocalLLaMA | View on Reddit | 100 comments

RMCPhoto@reddit

I don't, but I also don't know how big it is. Gemini flash 2.5 is probably the most cost effective model for long context data extraction and summarization. Nothing is close here. Unfortunately, Gemini pro and flash both absolutely suck at using tools (including search).

Created an Open Source Conversation Response Path Exploration System using Monte Carlo Tree Search

Posted by ManavTheWorld@reddit | LocalLLaMA | View on Reddit | 15 comments

RMCPhoto@reddit

I think ops implementation (and clean code that the community can use) is excellent. However, there is a near mountain of research papers on using MCTS with LLM as judge. (Just a very very quick skim) [https://arxiv.org/abs/2505.23229](https://arxiv.org/abs/2505.23229) [https://arxiv.org/abs/2504.02426](https://arxiv.org/abs/2504.02426) [https://arxiv.org/abs/2504.11009](https://arxiv.org/abs/2504.11009) [https://arxiv.org/abs/2502.13428](https://arxiv.org/abs/2502.13428) [https://arxiv.org/abs/2503.19309](https://arxiv.org/abs/2503.19309)

Created an Open Source Conversation Response Path Exploration System using Monte Carlo Tree Search

Posted by ManavTheWorld@reddit | LocalLLaMA | View on Reddit | 15 comments

RMCPhoto@reddit

Cool project, love algorithmic approaches like this and it looks clean and actually usable. One option to grasp a better idea of how a user might respond is to lean on some free datasets: \- Create an embedding and find similar conversational objects / perform sentiment analysis etc. [https://github.com/PolyAI-LDN/conversational-datasets](https://github.com/PolyAI-LDN/conversational-datasets) [https://huggingface.co/datasets/allenai/WildChat-1M](https://huggingface.co/datasets/allenai/WildChat-1M) Otherwise, I would definitely recommend creating mechanisms for self improvement - if not in a live agentic loop, then by collecting the right data over time (assuming that's the goal and we don't want to actually run 5x chats for every message). In which case it can be helpful to perform a clustering or statistical semantic analysis on the winners and losers and identify patterns (and/or expand on the llm as a judge and additionally export structured information that can be used to improve the prompt.

Any updates on Llama models from Meta?

Posted by True_Requirement_891@reddit | LocalLLaMA | View on Reddit | 21 comments

RMCPhoto@reddit

They also may have been a LITTLE lucky. Considering the regressions and the underwhelming performance of 2.5 - especially with tool use.

Any updates on Llama models from Meta?

Posted by True_Requirement_891@reddit | LocalLLaMA | View on Reddit | 21 comments

GLM-4.1V-Thinking

Posted by AaronFeng47@reddit | LocalLLaMA | View on Reddit | 48 comments

RMCPhoto@reddit

Yeah, it does seem strange doesn't it... Some of this abstraction related confusion would be resolved by moving towards character level tokens, but this would reduce the throughput and require significantly more predictions. The tokens have also been adjusted over time to improve comprehension of specific content. Like tabbed codeblocks. I believe various tab/space combinations were explicitly added to improve code comprehension, as it was previously a bit unpredictable and would vary depending on the first characters in the code blocks. The error rate of early llama models would also vary WILDLY with very small changes to tokens. Something as simple as starting the user query with a space would swing error 40%. This is still a major issue all over the place. Small changes to text can have unpredictable impacts on the resulting prediction even though to a person it would mean the same thing.

GLM-4.1V-Thinking

Posted by AaronFeng47@reddit | LocalLLaMA | View on Reddit | 48 comments

RMCPhoto@reddit

If it's in the dataset and is important enough to be known verbatim, then yes, it would work. Think of it this way, LLMs are also not good at counting the words in a paragraph, the number of periods in ".........." Or other similar methods of evaluating the numerical nature of the prompt via prediction. It can get close because of its exposure in training data to labeled paragraphs of certain word counts, or similar to make a rough inference, but there is no reasoning / reinforcement learning method that can be used to do this accurately. In essence, the language model is not self aware and does not know that the prompt / context is tokens instead of text...I think they should instead ensure that RL/fine tuning instills knowledge of it's own limitations rather than wasting parameter configurations on fruitlessly 🍓 trying to solve this low value issue. In fact, even the dumbest language models can easily solve all of the problems above...very easily... I'm sure even a 3b model could. The solution is to ask it to write a python script to provide the answer. Most models / agents will hopefully have this capability. (Python in sandbox). And this is the right approach. 1. Use a llm for what it is good for. 2. Identify it's blind spots, and understand why those blind spots exist. 3. Teach the model about those blindspots in fine tuning and provide the correct tool to answer those problems.