llama 3.1 70B is absolutely awful at tool usage
Posted by fireKido@reddit | LocalLLaMA | View on Reddit | 41 comments
Hi guys,
I am experimenting with a langgraph multi-agent model, and I tested it with GPT-4o, everything works well and the results are pretty impressive.
after that, I tested it with ollama and llama3.1:70b-instruct-q8_0, however, the results are absolutely disappointing, it's not capable of correctly structuring a single tool call, ignoring completely the info I give them, forgetting parameters for the function calls, and other similar silly mistakes
my question is, is this your experience as well? I'm afraid I am doing something wrong because generally, I read positive stuff about llama3...
VulcanizadorTTL@reddit
Same. I cannot offer anything but 4o-mini for cheap tool usage, in Spanish, for any kind of professional work.
_errant_monkey_@reddit
One thing I've noticed (both llama 8B and 70B is that they perform much better without the "Environment: ipython" in the system prompt. That line makes the model pretty much refuse to reply even to 2+2 without calling a function. Plus I don't understand from https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling the added value of it.
Plus IMO there are a few mistakes in how they handle FC for llama 3.1 8B in the gorilla codebase. There are a couple of missing spaces in the system prompt they fed to the model.
llama 8B 3.1 instruct is still the base model of ToolACE which is one of the best 8B (and overall) model on the leaderboard.
Mushoz@reddit
Did you set the correct context size? Ollama uses a default of 2048. If you do not change it, and you exceed the context, then your LLM might not be getting (part of) your prompt since it's been truncated, explaining why it's not following your instructions (as it has no access to them)
fireKido@reddit (OP)
OH MY GOD, this was it… this is so dumb why would you set a memory so short by default, it was just being forgetful ahahah
Mushoz@reddit
See the top comment posted in this thread: https://www.reddit.com/r/LocalLLaMA/comments/1gcgptz/what_are_your_most_unpopular_llm_opinions/
It's been a major gripe for a long time now. Very unfortunate choice on Ollama's part.
fireKido@reddit (OP)
Uh, this is interesting, and could very well be the reason, thank you I’ll test that out
astralDangers@reddit
Yes this is common the issue is the quantization desteoys the accuracy need to structure the function call.
ResidentPositive4122@reddit
langchain/graph/whatever is based on "it works on my machine but only if you use oai gpts" prompts hidden under 5 levels of abstraction. No wonder they don't work ootb with other models.
mwmercury@reddit
This is the answer. Langchain is a poison to the LLM community.
No-Detective-5352@reddit
Take a look at the Berkeley Function Calling Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html There are some great smaller models with permissive licenses (e.g., Qwen2.5-7B).
The key metrics to consider are Abstract Syntax Tree evaluation, hallucination relevance (at least one of the function choices provided are relevant to the user query and should be invoked) and irrelevance (none of the function choices provided are relevant to the user query and none should be invoked).
AdditionalWeb107@reddit
https://huggingface.co/katanemo/Arch-Function-3B - while not published on the leader board is the best price/performance model for this use case.
No-Detective-5352@reddit
Thanks! That looks very promising, I will check it out.
Huanghe_undefined@reddit
this is not accurate at all. Their ground truth is halluciated to some extent and I do not think they are using the correct fc template for every model. Llama3 8B gives 95+% in my evaluation with their data(excluding the multi-turn part)
cyan2k@reddit
Feel free to share your data/templates/inference code. Would also love to compare. Also you can surely create a pull request because improvements in methodology are always appreciated!
Key_Extension_6003@reddit
Great link
gentlecucumber@reddit
Are you using an inference framework and tool template that supports llama tool calling, or just trying to prompt engineer it into doing tool calling like it's 2023?
OKArchon@reddit
What framework can you recommend for someone who's still in 2023?
gentlecucumber@reddit
I use vLLM as my backend for running models - both for work, and on my personal Windows machine with WSL. vLLM serves an OpenAI compatible endpoint, so you can use any python client that you would with GPT 4. Enabling tool calling is a simple parameter you set when launching the vLLM server. You'll also need a llama or Mistral tool calling Jinja template, both of which can be found in the vLLM GitHub repo and then you download those and pass the path as another input parameter in the vLLM serve command. Then BAM, all done. You can use your local model to do tool calling for any app or project that worked with OpenAI. I use Mistral Nemo 12b at FP8 for fast and simple agentic stuff in langgraph, and Mistral Small 22b at full precision if the tools are complicated or have a lot of nuance.
novel_market_21@reddit
You give your full command?? It would be very appreciated
gentlecucumber@reddit
Here's the command I run at home for mistral nemo 12b. It's looking at a local copy in my 'models' folder, running at FP8 precision and quantized kv-cache, running on both of my 3090 GPUs, enabling tool use with a jinja template, spoofing the 'gpt-4o-mini' model name to make some langgraph stuff a little simpler:
vllm serve models/Mistral-Nemo-Instruct-2407-FP8 --chat-template \~/chat-templates/tool_chat_template_mistral.jinja --max-model-len 10000 --gpu-memory-utilization 0.99 --kv-cache-dtype fp8_e4m3 --port 5111 --tensor-parallel-size 2 --enforce-eager --served-model-name localmodel gpt-4o-mini --enable-auto-tool-choice --tool-call-parser mistral --api-key foo
Use with the regular OpenAI client SDK and just set base URL to localhost.
fireKido@reddit (OP)
I’m using langgraph “OllamaChat” interface, but I’ll test with vLLM as you describe
Slimxshadyx@reddit
I used Ollama and connected it to Llama 3.1 8b in Python and tool calling worked well. With that smaller model it would tend to call tools a lot more than it should have, but the calls were right
fireKido@reddit (OP)
What did you use? I’m using langgraph’s interface, maybe that’s the problem
It just can’t get the tool parameter right.. also, it doesn’t seem to be capable of running without calling a tool, which is annoying
drunnells@reddit
Ha, I like the way you phrased that! I've had luck with my simple stuff and JSON examples in the prompt with 3.1 70b and the "like it's 2023" method using calls to llama.cpp llama-server. The post 2023 method sounds advanced, proprietary and scary to me.
glow_storm@reddit
Try llama3-groq-tool-use on Ollama best model I tried with Ollama and Langgraph for tool use.
fireKido@reddit (OP)
Nice tip, I’ll try it out
ithkuil@reddit
Did you use temperature 0?
fireKido@reddit (OP)
Yes I did… still no luck
gittb@reddit
Hello, try tabbyAPI and use the tool calling template included by the default in the repo.
I wrote the tools integration for it, and it will enforce json schema to require correct parsing of tool calls. Works quite well.
____vladrad@reddit
I use 70 awq 4 bit quant. I ran into these problems during my testing even with gpt4o at longer contexts. In the OpenAI docs they mention it’s not 100% accurate and to use structured extraction.
What I do I usually detect if there is a function in the string and if there is I extract the json with a library called instructor. Works 100% of the time. “In August 2024, we launched Structured Outputs. When you turn it on by setting strict: true, in your function definition, Structured Outputs ensures that the arguments generated by the model for a function call exactly match the JSON Schema you provided in the function definition.” this library is so good!
Also give lmdeploy a go for llama 70B tool api. Their open ai api does not support tool calling with streaming but works otherwise. Also I used to come in here telling people I’m getting 35 t/s but after the cache warms up I’m getting like 100-200 t/s on my a100. Whatever they do with lmdeploy is amazing.
BigYoSpeck@reddit
I've used semantic kernel in .NET which calls ollama through the OpenAI compatible API and it calls tools ok even with just the 3b model
The only issue I've had is multiple tool calls, where it wants to use the output of one for input to another. It'll hand the tool call itself as a string to the native function
AdditionalWeb107@reddit
That's because the llama 3.1 70B isn't trained on that yse case This might be unrelated to your specific llama 3.1 70B usage - but might want to give this a look: https://github.com/katanemo/arch. This is the framework around https://huggingface.co/katanemo/Arch-Function-3B. In this case, your APIs are the tools call and the project determines when to call which API based on the user prompt. Different approach.
ggone20@reddit
I’ve not had the same experience at all. I find that 3.1 70b is the best open model for most tasks. 3.2 11b and 90b seem like trash to me - META claimed they were drop in replacement but I find the vision variants to be pretty shit. They just echo statements and commands continuously. I had a conversation with both llama 3.1 405b and ChatGPT 4o about it and the consensus was that by adding vision capabilities it potentially made the model ‘more creative’ - so despite setting temperature to 0 it still gives wild outputs and hallucinates all the time.
I literally thought that most egregious hallucinations were a thing of the past but 3.2 series of LLaMA proved me wrong. I stick with the 3.1 70b and use 405b, chat 4o, or o1 for advanced queries.
fireKido@reddit (OP)
I’m starting to suspect the issue is not the llama model, but rather the ollama implementation of langgraph… I’ll try using vLLM which uses GPT’s interfaces, and hopefully it will work
rustedrobot@reddit
Have you tested any other models? Do you get the same results?
fireKido@reddit (OP)
I didn’t test too many, mostly some fine tuned versions of llama, and some other smaller models, but nothing satisfactory for now
Enough-Meringue4745@reddit
Qwen2.5 is wicked good at tools
gamesntech@reddit
Llama 3 models in general are terrible at tool usage. I have done a lot of tests with it. If you try to do anything more than a toy sample or example, it just performs poorly in this area. You're better off finding something more reliable.
ChernobogDan@reddit
You setting the context size or relying on the default of 2048
zra184@reddit
Has not been my experience, but I'm using my own implementation of Llama 3.1. Even 8b seems remarkably adept at calling tools. Do you have any prompts you'd be willing to share?
Linkpharm2@reddit
Try qwen