llama 3.1 70B is absolutely awful at tool usage

Posted by fireKido@reddit | LocalLLaMA | View on Reddit | 41 comments

Hi guys,

I am experimenting with a langgraph multi-agent model, and I tested it with GPT-4o, everything works well and the results are pretty impressive.

after that, I tested it with ollama and llama3.1:70b-instruct-q8_0, however, the results are absolutely disappointing, it's not capable of correctly structuring a single tool call, ignoring completely the info I give them, forgetting parameters for the function calls, and other similar silly mistakes

my question is, is this your experience as well? I'm afraid I am doing something wrong because generally, I read positive stuff about llama3...

[-]

VulcanizadorTTL@reddit

Same. I cannot offer anything but 4o-mini for cheap tool usage, in Spanish, for any kind of professional work.

[-]

_errant_monkey_@reddit

One thing I've noticed (both llama 8B and 70B is that they perform much better without the "Environment: ipython" in the system prompt. That line makes the model pretty much refuse to reply even to 2+2 without calling a function. Plus I don't understand from https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling the added value of it.

Plus IMO there are a few mistakes in how they handle FC for llama 3.1 8B in the gorilla codebase. There are a couple of missing spaces in the system prompt they fed to the model.

llama 8B 3.1 instruct is still the base model of ToolACE which is one of the best 8B (and overall) model on the leaderboard.

[-]

Mushoz@reddit

Did you set the correct context size? Ollama uses a default of 2048. If you do not change it, and you exceed the context, then your LLM might not be getting (part of) your prompt since it's been truncated, explaining why it's not following your instructions (as it has no access to them)

[-]

fireKido@reddit (OP)

OH MY GOD, this was it… this is so dumb why would you set a memory so short by default, it was just being forgetful ahahah

[-]

Mushoz@reddit

See the top comment posted in this thread: https://www.reddit.com/r/LocalLLaMA/comments/1gcgptz/what_are_your_most_unpopular_llm_opinions/

It's been a major gripe for a long time now. Very unfortunate choice on Ollama's part.

[-]

fireKido@reddit (OP)

Uh, this is interesting, and could very well be the reason, thank you I’ll test that out

[-]

astralDangers@reddit

Yes this is common the issue is the quantization desteoys the accuracy need to structure the function call.

[-]

ResidentPositive4122@reddit

langchain/graph/whatever is based on "it works on my machine but only if you use oai gpts" prompts hidden under 5 levels of abstraction. No wonder they don't work ootb with other models.

[-]

mwmercury@reddit

This is the answer. Langchain is a poison to the LLM community.

[-]

No-Detective-5352@reddit

Take a look at the Berkeley Function Calling Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html There are some great smaller models with permissive licenses (e.g., Qwen2.5-7B).

The key metrics to consider are Abstract Syntax Tree evaluation, hallucination relevance (at least one of the function choices provided are relevant to the user query and should be invoked) and irrelevance (none of the function choices provided are relevant to the user query and none should be invoked).

[-]

AdditionalWeb107@reddit

https://huggingface.co/katanemo/Arch-Function-3B - while not published on the leader board is the best price/performance model for this use case.

[-]

No-Detective-5352@reddit

Thanks! That looks very promising, I will check it out.

[-]

Huanghe_undefined@reddit

this is not accurate at all. Their ground truth is halluciated to some extent and I do not think they are using the correct fc template for every model. Llama3 8B gives 95+% in my evaluation with their data(excluding the multi-turn part)

[-]

cyan2k@reddit

Feel free to share your data/templates/inference code. Would also love to compare. Also you can surely create a pull request because improvements in methodology are always appreciated!

[-]

Key_Extension_6003@reddit

Great link

[-]

gentlecucumber@reddit

Are you using an inference framework and tool template that supports llama tool calling, or just trying to prompt engineer it into doing tool calling like it's 2023?

[-]

OKArchon@reddit

What framework can you recommend for someone who's still in 2023?

[-]

gentlecucumber@reddit

I use vLLM as my backend for running models - both for work, and on my personal Windows machine with WSL. vLLM serves an OpenAI compatible endpoint, so you can use any python client that you would with GPT 4. Enabling tool calling is a simple parameter you set when launching the vLLM server. You'll also need a llama or Mistral tool calling Jinja template, both of which can be found in the vLLM GitHub repo and then you download those and pass the path as another input parameter in the vLLM serve command. Then BAM, all done. You can use your local model to do tool calling for any app or project that worked with OpenAI. I use Mistral Nemo 12b at FP8 for fast and simple agentic stuff in langgraph, and Mistral Small 22b at full precision if the tools are complicated or have a lot of nuance.

[-]

novel_market_21@reddit

You give your full command?? It would be very appreciated

[-]

gentlecucumber@reddit

Here's the command I run at home for mistral nemo 12b. It's looking at a local copy in my 'models' folder, running at FP8 precision and quantized kv-cache, running on both of my 3090 GPUs, enabling tool use with a jinja template, spoofing the 'gpt-4o-mini' model name to make some langgraph stuff a little simpler:

vllm serve models/Mistral-Nemo-Instruct-2407-FP8 --chat-template \~/chat-templates/tool_chat_template_mistral.jinja --max-model-len 10000 --gpu-memory-utilization 0.99 --kv-cache-dtype fp8_e4m3 --port 5111 --tensor-parallel-size 2 --enforce-eager --served-model-name localmodel gpt-4o-mini --enable-auto-tool-choice --tool-call-parser mistral --api-key foo

Use with the regular OpenAI client SDK and just set base URL to localhost.

[-]

fireKido@reddit (OP)

I’m using langgraph “OllamaChat” interface, but I’ll test with vLLM as you describe

[-]

Slimxshadyx@reddit

I used Ollama and connected it to Llama 3.1 8b in Python and tool calling worked well. With that smaller model it would tend to call tools a lot more than it should have, but the calls were right

[-]

fireKido@reddit (OP)

What did you use? I’m using langgraph’s interface, maybe that’s the problem

It just can’t get the tool parameter right.. also, it doesn’t seem to be capable of running without calling a tool, which is annoying

[-]

drunnells@reddit

Ha, I like the way you phrased that! I've had luck with my simple stuff and JSON examples in the prompt with 3.1 70b and the "like it's 2023" method using calls to llama.cpp llama-server. The post 2023 method sounds advanced, proprietary and scary to me.

[-]

glow_storm@reddit

Try llama3-groq-tool-use on Ollama best model I tried with Ollama and Langgraph for tool use.

[-]

fireKido@reddit (OP)

Nice tip, I’ll try it out

[-]

ithkuil@reddit

Did you use temperature 0?

[-]

fireKido@reddit (OP)

Yes I did… still no luck

[-]

gittb@reddit

Hello, try tabbyAPI and use the tool calling template included by the default in the repo.

I wrote the tools integration for it, and it will enforce json schema to require correct parsing of tool calls. Works quite well.

[-]

____vladrad@reddit

I use 70 awq 4 bit quant. I ran into these problems during my testing even with gpt4o at longer contexts. In the OpenAI docs they mention it’s not 100% accurate and to use structured extraction.

What I do I usually detect if there is a function in the string and if there is I extract the json with a library called instructor. Works 100% of the time. “In August 2024, we launched Structured Outputs. When you turn it on by setting strict: true, in your function definition, Structured Outputs ensures that the arguments generated by the model for a function call exactly match the JSON Schema you provided in the function definition.” this library is so good!

Also give lmdeploy a go for llama 70B tool api. Their open ai api does not support tool calling with streaming but works otherwise. Also I used to come in here telling people I’m getting 35 t/s but after the cache warms up I’m getting like 100-200 t/s on my a100. Whatever they do with lmdeploy is amazing.

[-]

BigYoSpeck@reddit

I've used semantic kernel in .NET which calls ollama through the OpenAI compatible API and it calls tools ok even with just the 3b model

The only issue I've had is multiple tool calls, where it wants to use the output of one for input to another. It'll hand the tool call itself as a string to the native function

[-]

AdditionalWeb107@reddit

That's because the llama 3.1 70B isn't trained on that yse case This might be unrelated to your specific llama 3.1 70B usage - but might want to give this a look: https://github.com/katanemo/arch. This is the framework around https://huggingface.co/katanemo/Arch-Function-3B. In this case, your APIs are the tools call and the project determines when to call which API based on the user prompt. Different approach.

[-]

ggone20@reddit

I’ve not had the same experience at all. I find that 3.1 70b is the best open model for most tasks. 3.2 11b and 90b seem like trash to me - META claimed they were drop in replacement but I find the vision variants to be pretty shit. They just echo statements and commands continuously. I had a conversation with both llama 3.1 405b and ChatGPT 4o about it and the consensus was that by adding vision capabilities it potentially made the model ‘more creative’ - so despite setting temperature to 0 it still gives wild outputs and hallucinates all the time.

I literally thought that most egregious hallucinations were a thing of the past but 3.2 series of LLaMA proved me wrong. I stick with the 3.1 70b and use 405b, chat 4o, or o1 for advanced queries.

[-]

fireKido@reddit (OP)

I’m starting to suspect the issue is not the llama model, but rather the ollama implementation of langgraph… I’ll try using vLLM which uses GPT’s interfaces, and hopefully it will work

[-]

rustedrobot@reddit

Have you tested any other models? Do you get the same results?

[-]

fireKido@reddit (OP)

I didn’t test too many, mostly some fine tuned versions of llama, and some other smaller models, but nothing satisfactory for now

[-]