Small and fast function calling models

[-]

Omnic19@reddit

wait let me get this straight. your graphics card is amd 7900xt and the model you're running is tiny llama?

Reply

[-]

no, 7900x cpu. my gpu is a 6800 xt but the library wont compile with rocm. tinyllama should still be faster than it is though, testing shows it is 10x slower on llama-cpp-python than it is on llama.cpp

Reply

[-]

Omnic19@reddit

which library won't compile?

Reply

[-]

Jumper775-2@reddit (OP)

Llama-cpp-python with rocm enabled

Reply

[-]

Omnic19@reddit

which library won't compile? llama.cpp python?

Reply

[-]

Jumper775-2@reddit (OP)

https://github.com/abetlen/llama-cpp-python

Reply

[-]

Omnic19@reddit

maybe its a fedora issue. have you tried with any other linux distro? some distros have better gpu support than others. have you tried pop os?

Reply

[-]

Jumper775-2@reddit (OP)

I don’t want to reinstall my whole OS, I’ve tried distroboxes to no avail so I don’t imagine it would work.

Reply

[-]

Omnic19@reddit

man it would be a waste to let go of gpu acceleration since you have a really great gpu on hand. let's see if we can do something about this

Reply

[-]

Jumper775-2@reddit (OP)

That’s my thought in making this post. I don’t think the llama-cpp-python route is gonna work for me, vllm could but it doesn’t have native function calling yet. It should be possible to implement myself it if I can find documentation on the chatml-function-calling format. This honestly might be the way to go because cpu support seems to be too slow even on a top of the line processor, so supporting it shouldn’t really matter. I can fall back to some if else’s instead.

Reply

[-]

Omnic19@reddit

have you done a proper install of rocm as given here? https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html#rocm-install-quick

Reply

[-]

Jumper775-2@reddit (OP)

no, rocm on fedora is only available from the fedora repos. those onlyy support rhel and ubuntu. thats why i think its a fedora only issue. Ubuntu distrobox with that install method still failed though so who really knows.

Reply

[-]

Omnic19@reddit

these things don't work well with distrobox and stuff since you don't want to reinstall your entire os you could dual boot fedora with Ubuntu. you could reserve a small 100 gb fpr development purposes like these. what's your disk capacity? if you decide to dual boot try pop os instead of Ubuntu. pop os is based on Ubuntu but has much better support for gpu drivers.

Reply

[-]

Jumper775-2@reddit (OP)

To be honest i think im just going to go the api route and require users to use OpenAI compatible. I can compile bare llama.cpp with rocm so it should be fine.

Reply

[-]

Omnic19@reddit

try dual booting with ubuntu/pop os. give that a try . maybe that'll work and would be useful for other projects as well. but you're choice anyways.

Reply

[-]

Jumper775-2@reddit (OP)

Yeah, ime rock has worked fine on everything up to this point so I’m hesitant to do all that just now, especially when api support will allow the use of OpenAI models with much better support. Its definitely something I’ll consider if I hit another roadblock there, though.

Reply

[-]

Omnic19@reddit

seems like you have to run through quite a lot of hoops to get gpus working on linux have a look at [this](https://www.reddit.com/r/archlinux/s/mmgHrqFFr5) and see if it might be of help

Reply

[-]

Jumper775-2@reddit (OP)

I mean it *works* now. Problem is the software that works with both function calling and rocm doesn’t work with my specific setup and I don’t want to reinstall. Something’s gotta budge, and I really hope it’s getting the software that supports both. I found a development llama.cpp branch that has function calling support and I’ve been using it with some success.

Reply

[-]

Omnic19@reddit

don't reinstall. do a dual boot keeping aside a small amount of hdd space

Reply

[-]

Jumper775-2@reddit (OP)

Yeah, that’s an option. It’s somewhat last resort though because I would to ultimately support my current setup in the software I’m making one way or another. Perhaps a docker/podman container could achieve the same thing?

Reply

[-]

Paulonemillionand3@reddit

get a GPU

Reply

[-]

Jumper775-2@reddit (OP)

I have one, but the problem is llama-cpp-python fails to compile for rocm. If there’s an alternative python library that I can do function calling + text generation in that would work with rocm that might also work.

Reply

[-]

Paulonemillionand3@reddit

Accept the fact that CUDA is dominant for a reason. If your project is intended for production use by real end users it'll be running on CUDA anyway.

Reply

[-]

Jumper775-2@reddit (OP)

I already support cuda, but I personally don’t have a cuda gpu to run it on to develop so I need to get rocm working. NVidia GPUs are just too expensive.

Reply

[-]

ramzeez88@reddit

For simple inference a 3060 12gb is enough.

Reply

[-]

Jumper775-2@reddit (OP)

Yeah sure, but my 6800xt is faster and I already have it. I really am not interested in buying a new gpu, sorry.

Reply

[-]

Pedalnomica@reddit

You're running a 1.1B param model on CPU and it isn't fast enough. Your problem isn't going to be solved with a different model. Models at that size just aren't very good. You absolutely need to find a way to run a model on a GPU. Your current GPU or another are both options.

Reply

[-]

Jumper775-2@reddit (OP)

Another is not an option. My focus has shifted to trying to get it to run on my 6800xt. Do you know the best way to set up function calling with vllm? I got that working with my gpu, but it seems to not be supported natively. I can’t find any documentation on chatml function calling, but I think I can scale up to something like mistral 7b 0.3 and implement it myself if I can find documentation. Do you know where I might find anything useful?

Reply

[-]

jackshec@reddit

have a look at some of the older Nvidia GPU’s like the P100

Reply

[-]

Jumper775-2@reddit (OP)

I already have a very expensive and capable AMD gpu. Rocm exists and is supported so I’m not spending 100+ dollars on another gpu just because NVidia can work better in a few cases.

Reply

[-]

phree_radical@reddit

There isn't enough information to undestand how function calling is involved or which step is slow

Reply

[-]

Jumper775-2@reddit (OP)

Sorry, I’ll elaborate. As of right now I need the LLM to call one of a few functions and then I won’t try to get a response from it, and there is one function that when called will return a list of things that meet the criteria the user provided and it will then ask the user to choose one of them. Then call another function and stop. The slow part is the LLM generation for both the function call and text. ~a minute per prompt. This seems slower than it should be, but llama-cpp-python is the fastest library I’ve found so instead I’m looking at downsizing the model.

Reply

[-]

phree_radical@reddit

if the first function call generation takes a while, it sounds like the initial input is large? 1. turn some large `initial_input` into function call 2. results are fetched, user chooses one `choice_text` 3. turn `initial_input` + `choice_text` into another function call Is that right? Do both function calls require arguments, or just classification? Does the second one depend on both inputs? And both inputs are large, causing long processing time?

Reply

[-]

Jumper775-2@reddit (OP)

that flow is correct, but the initial input is not particularly large (except perhaps for the system prompt telling it about what functions it can call?). The first user input it is given is max 20 tokens.

Reply

[-]

phree_radical@reddit

llama-cpp-python seems to enable llama.cpp's prompt caching (kv cache cache) by default. If you used llama.cpp directly you have to enable it explicitly. So I would expect llama-cpp-python to be faster after the first generation, once the prefix is cached including the long system prompt If gpu is available, make sure you're using it (n_gpu_layers=-1 in Llama constructor) Does the both functions require arguments? How many tokens does LLM need to generate? If one doesn't take arguments, you could turn it into a classification that takes only 1 token Once you move on to the second generation, do you still need the previous prompt and everything?

Reply

[-]

Jumper775-2@reddit (OP)

I have gpu enabled, but it fails to compile for rocm on fedora Linux so I have to use cpu. There are 5 functions, only one of which takes arguments, and I probably don’t need the original prompt to run the follow up function. My main concern is that llama.cpp is like 10x as fast as llama-cpp-python is.

Reply

[-]

ramzeez88@reddit

Maybe try open cl or vulkan. It will be faster than cpu

Reply

[-]

Jumper775-2@reddit (OP)

Opencl works and is faster, but only by like 10 seconds, which is not enough. Vulkan compiles but doesn’t work.

Reply

[-]

KaiwenKHB@reddit

.... It sounds like you're using an artillery to hit a mosquito. Have you tried just writing a Python script? Takes like 0.1 seconds to run

Reply

[-]

Jumper775-2@reddit (OP)

Yeah, I have it implemented and it works but I plan on expanding the LLM to do more than just the basics, I just need to cover the basics before I can do that.

Reply

[-]

remyxai@reddit

[This video](https://www.youtube.com/watch?v=Xe51b30PWxE) covers some experiments using [llama2.c](https://github.com/karpathy/llama2.c) to train models with tens or hundreds of millions of parameters for function calling.

Reply

[-]

vasileer@reddit

why not using a hosted one like TogetherAI or DeepInfra? TogetherAI gives $25 in credit which means 100 millions tokens request/response with Mistral-7B ($0.2/million tokens)

Reply

[-]

Jumper775-2@reddit (OP)

I have that implemented as well, I am working on a fully-local setup currently. Plain llama.cpp is fast enough, it’s just not when using llama-cpp-python

Reply

Reply to Post

43 Comments