llama.cpp server have built-in native tools (exec_shell, edit_file, etc.)

Posted by srigi@reddit | LocalLLaMA | View on Reddit | 48 comments

I was messing around with running local models recently, and while digging through the llama.cpp server docs, I noticed this experimental flag just sitting right there:

--tools TOOL1,TOOL2,...

It natively supports read_file, file_glob_search, grep_search, exec_shell_command, write_file, edit_file, apply_diff, and get_datetime. That is a battery of tools that basically turns llama-server into a mini agent harness. You really don't need anything more than your trusty .gguf file and the llama.cpp binary for basic AI assistance in your projects.

Note that file operations are relative to folder from which you started the server. There also isn't any security sandboxing yet, like a whitelist of allowed commands or strict denial of file operations outside the original folder. So, be very cautious with what you expose!

But still, I'm pretty amazed that llama.cpp is gaining these abilities natively. It completely eliminates the need to rig up MCPs or heavy wrappers just for things like getting the current date/time or reading the contents of a file.

[-]

AsliReddington@reddit

I want a very minimal coding or tool calling harness which is just a single python or bash file. Don't even want to both with these extensive code bases for no reason.

[-]

kivaougu@reddit

This is pretty simple to do by yourself as a tool call schemas are very intuitive to parse. Then you just add nudging for smaller models and some instructions. Core tools like read and write are pretty easy with python.

[-]

MoneyPowerNexis@reddit

Yep I have a one page tool use example that just uses json and rests.

I prefer to put tools in their own modules and load them from a directory. Once I decided on a format for a tool (I use a class with a spec function and a run function) its easy to show that to an llm and have it crank out another tool.

I dont use my harness for long running unsupervised jobs though its still just a chatbot. I still need a way to deal with context limits if I am doing long runs, for now I just start a new session if I get near my limit asking to make a plan for the next session if needed.

[-]

CheatCodesOfLife@reddit

Thanks, that simple script is cool.

[-]

MoneyPowerNexis@reddit

Not exactly one page but here is what I would consider a minimal harness with search, fileio, and url reading:

https://pastebin.com/pPLjbqqa

Optional command line arguments:

--llama_server LLM server URL. Defaults to http://localhost:8080.

--prompt_file File containing the initial user prompt.

--system_prompt File containing the system prompt. Replaces default.

--working_dir Base directory for file operations. Defaults to current directory.

--exit Terminate program after processing --prompt_file.

Hopefully the chat loop and whats going on with tools is still simple enough to still understand fully. This is obviously all vibe coded with minimal testing so take that for what its worth.

[-]

CheatCodesOfLife@reddit

Yep, still simple enough. It doesn't have the over-engineered look to it that other vibe coded scripts do. It works well with Gemma-4, thanks!

[-]

MoneyPowerNexis@reddit

The version I actually use is somewhat over engendered: https://imgur.com/a/28Fpx9l

Its so easy to just keep adding features. For me it basically has to be behind a web interface and that leans towards a more event driven loop but I get that that isnt a great starting point for someone who wants full understanding of the code to build their own harness.

[-]

AsliReddington@reddit

Thx will try this

[-]

MoneyPowerNexis@reddit

It's about as simple as I could make it with a chat loop that switches to responding to tool results but I would immediately change how the tool parameters are handled. I would not get a tool specific parameter in the chat loop but instead I would pass the entire parameter object to the tool function as well as other things a tool might need like a cancel event object, limits to the amount of data the tool should return etc:

import os
import asyncio
import urllib.request
import urllib.error
import socket
import ssl
import re

class URLFetch:
    def __init__(self, *args, **kwargs):
        pass

    def spec(self):
        return {
            "type": "function",
            "function": {
                "name": "urlfetch",
                "description": "Fetch data from a URL. Optionally save to a file.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "url": {
                            "type": "string",
                            "description": "The URL to fetch data from."
                        },
                        "save_to_file": {
                            "type": "string",
                            "description": "Optional path to save the fetched data. If not provided, data is returned."
                        },
                        "timeout": {
                            "type": "number",
                            "description": "Timeout in seconds for the request. Defaults to 60."
                        }
                    },
                    "required": ["url"]
                },
            },
        }

    def _is_within_directory(self, directory, target_path):
        abs_directory = os.path.abspath(directory)
        abs_target = os.path.abspath(target_path)
        return abs_target.startswith(abs_directory + os.sep) or abs_target == abs_directory

    async def run(self, *args, **kwargs):
        arguments = kwargs.get('arguments', args[0] if args else {})
        stop_event = kwargs.get('stop_event', asyncio.Event())
        byte_limit = kwargs.get('byte_limit', None)
        working_directory = kwargs.get('working_directory', None)

        url = arguments.get('url')
        save_to_file = arguments.get('save_to_file')
        timeout = arguments.get('timeout', 60)

        if not url:
            return "Error: URL is required."

        original_timeout = socket.getdefaulttimeout()
        socket.setdefaulttimeout(timeout)

        try:
            loop = asyncio.get_event_loop()

            def fetch_data():
                try:
                    parsed_url = url
                    extra_headers = {}

                    if 'imgur.com' in url:
                        match = re.search(r'(?:i\.)?imgur\.com/([a-zA-Z0-9]+)', url)
                        if match:
                            image_hash = match.group(1).split('.')[0]
                            parsed_url = f"https://i.imgur.com/{image_hash}.png"
                            extra_headers = {
                                "Referer": "https://imgur.com/",
                                "Origin": "https://imgur.com"
                            }

                    headers = {
                        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
                        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
                        "Accept-Language": "en-US,en;q=0.9",
                        "Accept-Encoding": "gzip, deflate, br",
                        "Connection": "keep-alive",
                        "Upgrade-Insecure-Requests": "1",
                        "Sec-Fetch-Dest": "document",
                        "Sec-Fetch-Mode": "navigate",
                        "Sec-Fetch-Site": "none",
                        "Sec-Fetch-User": "?1",
                        "Cache-Control": "max-age=0",
                    }
                    headers.update(extra_headers)

                    req = urllib.request.Request(parsed_url, headers=headers)

                    ssl_context = ssl.create_default_context()
                    ssl_context.check_hostname = True
                    ssl_context.verify_mode = ssl.CERT_REQUIRED

                    with urllib.request.urlopen(req, context=ssl_context, timeout=timeout) as response:
                        chunk_size = 8192
                        data_chunks = []
                        while True:
                            if stop_event.is_set():
                                raise asyncio.CancelledError("Request cancelled by user.")
                            chunk = response.read(chunk_size)
                            if not chunk:
                                break
                            data_chunks.append(chunk)
                        return b''.join(data_chunks)
                except urllib.error.URLError as e:
                    return f"Error fetching URL: {e.reason}"
                except Exception as e:
                    return f"Error fetching URL: {str(e)}"

            try:
                data_or_error = await loop.run_in_executor(None, fetch_data)
            except asyncio.CancelledError:
                return "Request cancelled by user."
            except Exception as e:
                return f"Error during fetch: {str(e)}"

            if isinstance(data_or_error, str) and data_or_error.startswith("Error"):
                return data_or_error

            data = data_or_error

        except Exception as e:
            return f"Error fetching URL: {str(e)}"
        finally:
            socket.setdefaulttimeout(original_timeout)

        if save_to_file:
            if working_directory:
                save_to_file = os.path.join(working_directory, save_to_file)
                if not self._is_within_directory(working_directory, save_to_file):
                    return f"Error: Save path '{save_to_file}' is outside the working directory."
            try:
                with open(save_to_file, 'wb') as f:
                    f.write(data)
                return f"Successfully saved data to {save_to_file} ({len(data)} bytes)"
            except Exception as e:
                return f"Error saving to file: {str(e)}"
        else:
            if byte_limit is not None and len(data) > byte_limit:
                return f"Error: Data size ({len(data)} bytes) exceeds byte limit ({byte_limit} bytes)."
            try:
                return data.decode('utf-8', errors='replace')
            except Exception as e:
                return f"Error decoding data: {str(e)}"

As you add more complexity the surface area for bugs and for more features also expands and sometimes that might mean overhauling the entire thing. I have started to look at other harnesses and notice a lot of them build the tools themselves as state machines so for example a file io tool might have an option to set the path and then the next time the tool is called it is still in that path where as with my tool it has to specify the full path each time. With file io thats fine but models are getting good enough that if you give them a python sandbox they will try to run scripts and then interact with the running script they made so having some way to store the id of running scripts and to get new console output enter new input means the bot can test things like building webservers and testing their APIs. Qwen 27b is good enough for all that but it also means properly sandboxing everything becomes a priority.

[-]

ForestHubAI@reddit

The unsandboxed concern is real. Once you're shipping to actual devices in the field, "just trust the model with shell" stops being romantic — we wrap tool dispatch in a whitelist + per-tool capability check, so a broken model call can't reach anything it doesn't already have. Same idea, just paranoid about which tool reaches which subsystem.

Building agent lifecycle on edge at — happy to compare notes if you're going production with this.

[-]

Parzival_3110@reddit

Native tools are the right direction, but I think the missing primitive is receipts.

For files and shell, the question is what was allowed and what changed. For browser tools it gets worse because the agent has login state, tabs, forms, and possible duplicate submits.

I am building FSB for that browser side: https://github.com/LakshmanTurlapati/FSB

The lesson so far is that tool calling needs scoped ownership, approvals, and verification after the action. A model saying it clicked something is not enough. The harness has to prove which tab it touched and what changed.

[-]

Enough-Astronaut9278@reddit

Most agent tasks boil down to running commands and editing files anyway.

[-]

Badger-Purple@reddit

very cool discovery thanks for pointing it out. I always wondered why simple tools like a web scraper and a shell command could not be part of the runtime itself

[-]

srigi@reddit (OP)

I'm hoping that they add an option to add own native tool(s) - really the only thing missing is web_fetch and web_search.

If they don't provide these (I guess they don't - it is soo much outside of the scope of llama.cpp), there should be an easy way to add own implementation.

[-]

Parzival_3110@reddit

That split matters a lot. web_fetch is great for docs and static pages, but it falls apart once the task needs a logged in browser, redirects, cookies, forms, downloads, or evidence after a submit.

I am building FSB for that browser tool layer: https://github.com/LakshmanTurlapati/FSB

The part I would want llama.cpp to standardize is receipts. Not just call a tool, but return what tab changed, what action completed, and enough state for the model to avoid doing the same thing twice.

[-]

10F1@reddit

You can add mcp servers in the web UI.

[-]

yes_its_that_bad@reddit

Yes this seems to be a reasonable guide: https://old.reddit.com/r/LocalLLaMA/comments/1rnyz75/how_i_got_mcp_working_in_the_llamaserver_web_ui_a/

[-]

Karyo_Ten@reddit

A web scraper is not a simple tool.

Once in a blue moon you need the "About us" link for debugging but more often than not you need just the "main content" to avoid polluting the LLM context.

[-]

srigi@reddit (OP)

That's why I'm begging for "add/define your own tool" functionality. There are a couple of good web_fetch projects out there on GitHub (search "language:Rust web_fetch" on their page), so there is no need to reinvent the wheel.

[-]

annodomini@reddit

But there is "add/define your own tool" functionality. It's called MCP servers. There's a link for it right in the sidebar of the UI.

This new functionality is simply more convenient because it's built in so you don't have to manage separate MCP server processes, but the ability to add your own tools has existed for a while.

[-]

gladfelter@reddit

Mcp servers are inferior to skills that reference cli binaries that are optimized for the task. Progressive discovery is better than dumping giant API definitions into the context, and models are great at using bash to filter and search the output of cli binaries, whereas the entire contents of the mcp call are dumped into context.

[-]

BlobbyMcBlobber@reddit

Web scraping is not a simple tool. Web search is not simple either.

Shell command is very simple to implement but also very dangerous. It's good that it's not an option by default.

[-]

Far-Low-4705@reddit

these tools are not sandboxed, so be very careful with them, they run directly on your computer.

[-]

redditpad@reddit

what's the current standard approach for this? I have tested it with OpenCode and it works

[-]

yeah-ok@reddit

Another seldomly talked about feature flag is (after the hf integration) the "--offline" param, worth it for us taking-local-seriously ppl!

[-]

johnnaliu@reddit

cases that bit me weren't "rm -rf", it was the agent "cleaning up" the working dir after finishing the task. what are people using to bound what these tools can touch? if not in a sandbox

[-]

CatTwoYes@reddit

Love that this is landing in the binary, but the security gap is real — exec_shell_command with no sandboxing is one prompt injection away from disaster. For read-only operations (read_file, file_glob_search) this is genuinely useful and covers 80% of what people reach for MCP to do. Hope they add a basic command whitelist before this ships as non-experimental.

[-]

postitnote@reddit

When did you decide to start being a full AI bot?

[-]

llama-impersonator@reddit

after may 14, 2026, his account has an em-dash in nearly every post. before this, looks human generated.

[-]

CheatCodesOfLife@reddit

Can you tell this one https://old.reddit.com/r/LocalLLaMA/comments/1tluma3/llamacpp_server_have_builtin_native_tools_exec/onkq875/ ?

(More obvious if you click the profile)

Seems like a lot of them write like this now. Once you notice it, you see them everywhere.

[-]

llama-impersonator@reddit

yeah these ones annoy me as it's like a bad imitation of my own style.

[-]

Foreign_Risk_2031@reddit

—

[-]

lioffproxy1233@reddit

You can just specify what tools in a list instead of --all

[-]

Agreeable_System_785@reddit

So this would only work if your inference machine is also your dev machine? If you have a separate inference server, you still need to mount or something to make it work.

[-]

annodomini@reddit

Yeah. If you want to run tools on a different machine, you need to use an MCP server. This just gives you a convenient way to run a few built-in tools on the same machine as inference.

[-]

srigi@reddit (OP)

Unfortunately, yes. I use a MacBook for developing and a gaming rig for running local LLMs. Invoking llama-server on a gaming machine makes it "see" files there, and not on my MacBook where I access the llama-server's web UI.

This is really annoying, but maybe it is actually good, since my gaming machine is more disposable (eventual damage done there will hurt less) - so I can just fork projects there and keep it working outside of my notebook.

[-]

Napster3301@reddit

the approval ux is the surface issue. auto-approve requires trust in the tool calls and llama.cpp doesnt give you that yet. embedded chat templates on most ggufs still emit bracket variants ([function=X], function=NAME) instead of clean openai tool_calls arrays, so your auto-approver random-parses garbage. fix is override with --chat-template-file pointing at the upstream fixed template (unsloth has them on hf).

the other half is the model itself. a censored model running exec_shell decides your rm temp.tmp "looks dangerous" at step 47 of your loop and aborts the task. abliterated/uncensored weights remove that failure mode but most public llama.cpp tutorials skip that part.

tool list is great. infrastructure for auto-running an agent is still diy.

[-]

AdmirablePresence216@reddit

the exec_shell one being native is kinda wild to think about for client deployments, like the security implications alone probably need a lot of thought before you'd hand that to an unvetted model, even locally. are you running this behind any kind of permission layer or just letting it go free range?

[-]

srigi@reddit (OP)

Only reasonable thing to do for now is to manually inspect every `exec_shell` command, never ever to configure it to auto-approve. Also the “approving UI” hasn’t the optimal UX right now (see image). You must manually expand the tool call block to see the command which is requested. In most modern harnesses this is auto rendered for you.

[-]

VoiceApprehensive893@reddit

some clueless user is gonna get that unsandboxed rm -rf

[-]

Ok-Measurement-1575@reddit

I also noticed this the other day but couldn't get any of it to work.

[-]

srigi@reddit (OP)

It's really no rocket science, just start your (updated) llama-server with a list of tools you want in a folder where you want to operate:

llama-server --api-key secret --metrics --threads "$(sysctl -n hw.ncpu)" \
  --models-max 1 --models-preset "$HOME/.config/llms.ini" \
  --tools file_glob_search,get_datetime,grep_search,read_file

Then you'll see the configuration in the Settings panel.

[-]

mantafloppy@reddit

No.

https://github.com/ggml-org/llama.cpp/discussions/22132#discussioncomment-16865312

[-]

CommonPurpose1969@reddit

Built-in tools are definitely a thing. Just tested it.

[-]

annodomini@reddit

That answer is just wrong. It absolutely does enable built-in tools.

I saw this flag a few days ago in the docs, but was a bit skeptical of it as it doesn't do any sandboxing. However, I just realized that I already run my llama.cpp in an ephemeral container with just a few directories mounted for models and settings, so it already has at least some sandboxing so I figure YOLO, and tried it out. Yep, it works.

[-]