How to get realtime logging of LLM activity?

Posted by dtdisapointingresult@reddit | LocalLLaMA | View on Reddit | 17 comments

(yes this is a long post and I used some markdown formatting, like I always did to organize my comments long before the invention of LLMs. [For example in 2021.](https://reddit.com/r/selfhosted/comments/rcvih1/you_should_know_about_using_zerotier_or_tailscale/). I'm gonna block any tidepod-eating zoomer who calls me a bot, like what happened on my last long post) I'm a local LLM user, and I want to achieve 2 things: 1. **Realtime** monitoring of LLM's activity: I need a real-time view, whether by web UI or bash command, where I see the token stream. What I really want here is to spy on what the LLM is doing when I'm using some random AI app. Many apps hide what's going on because it's more "user friendly." That's usually fine for fast & intelligent cloud models, not so much when you're using a small, almost-regarded local model like Qwen that is completely off-base half the time. I don't want to wait 5 minutes for a bad response to finish, I'm a micro-manager and want to interrupt early when I see the LLM is on the wrong path. 2. History of all prompts + responses: if I have the realtime monitoring, might as well log the data so I can analyze it for educational purposes later on Various options I thought about: 1. Using an established logging engine like LiteLLM. I haven't been able to find one with realtime monitoring. They wait for the response to finish before it's saved + observable. 2. Vibecode my own realtime monitoring/logging proxy to put in front of vllm/llama-server, + web dashboard 3. Use someone else's version of #2. These are impossible to find. I know they exist because I've seen a couple linked in random comments over the years, but they're not showing up in search. I'd appreciate any advice here.

Reply to Post

17 Comments

[-]

GaryDUnicorn@reddit

\*Whistles innocently\* [https://imgur.com/a/QJKsciV](https://imgur.com/a/QJKsciV)

[-]

dtdisapointingresult@reddit (OP)

Oh, is this something you'd like to share with the rest of us? I'm about a third of the way through on vibecoding my own, but I'm not married to it, I only started cause I got no recommendations here. The main issue is waiting on Claude limits to reset, I think it's going to take me a week at this rate (I'm trying to do this properly, with tests, QoL features, and when I finish the backend, add a web UI).

[-]

wombweed@reddit

i dont track token stream specifically, but i do have observability set up for latency and usage tracking. litellm emits prometheus and otel metrics. ingest them in prometheus and use something like grafana for display. this is the industry standard/production-ready way afaik. i have tried various other/newer/more specialized solutions and have always come away disappointe.

[-]

DeProgrammer99@reddit

Just for the record: llama-server also has Prometheus-compatible metrics available via an API. [https://github.com/willjoha/llama.cpp/blob/master/tools/server/README.md#get-metrics-prometheus-compatible-metrics-exporter](https://github.com/willjoha/llama.cpp/blob/master/tools/server/README.md#get-metrics-prometheus-compatible-metrics-exporter)

[-]

dtdisapointingresult@reddit (OP)

Claude Code doesn't show you the output stream for a few months now. Qwen Code shows live reasoning but will truncate, also you can't see what subagents are doing. Crush is the same as Qwen Code. They're OK-ish, better than Claude Code, but still not as nice as I want. It makes sense, no application designer wants their UI to be a mix of UI elements + raw outputs, because all agentic apps are designed for cloud models. Only slow local users like me would care so much about realtime viewing. (I'm sure Pi can be customized to do this if I spend 2 weeks on it, but: 1. I don't have the time to effectively learn Pi, and 2. A general solution is more generally useful to me while allowing me to stay on superior alternatives to Pi.)

[-]

wombweed@reddit

fwiw i switched from claude and crush to opencode and pi, it was very painless. takes like an afternoon if yr already familiar with other harnesses. i just had my agent read and apply this: [https://github.com/can1357/oh-my-pi](https://github.com/can1357/oh-my-pi) and it does show token stream out of the box...

[-]

kaliku@reddit

> Vibecode my own realtime monitoring/logging proxy to put in front of vllm/llama-server, + web dashboard this plus lots of care to not break the sse stream. responses are streamed. you'll probably introduce some latency. this could be an interesting weekend project. ive set up a reverse proxy with nginx to log all requests/responses to elastic and its finnicky sometimes. at the same time its' a weird set up with some lua script... maybe not the best. For example hermes via telegram doesn't work well when it goes through that. also an idea would be to go straight to the source and tee it from there somehow. but I think the proxy one is better in some regards because you can hit the stop button to interrupt the connection and stop the token generation. I think mitmproxy can do something like this, too (without the interrupt part) also fyi nobody's impressed with your account' age or cares about blocks. chill.

[-]

dtdisapointingresult@reddit (OP)

I looked into your solution, since I already use nginx on my VPS...Claude seems a bit fan of using OpenResty (nginx + lua) to achieve this. I'll consider this, thanks for the advice.

[-]

Liquos@reddit

I’m using llama-swap which is just a wrapper to manage multiple llama-server instances and it does exactly this. The output is in json so I added an additional view that prettifies the text and shows it as a chat dialog.

[-]

dtdisapointingresult@reddit (OP)

Never tried llama-swap. Is it logging DURING the request (realtime), or AFTER the request finishes? If my LLM is generating an essay, I need to see the response stream as it's being generated, not when the essay is finished.

[-]

qubridInc@reddit

You’re probably looking at this the wrong way if you’re trying to find a “logging platform” first. What you actually want is visibility into the streaming layer itself. Most local stacks already expose token streaming over SSE/websocket/OpenAI-compatible chunked responses. The issue is the UI apps consume the stream internally and only render the final “friendly” output. So the cleanest solution is usually: * put a thin proxy in front of vLLM / llama-server / Ollama * intercept the streamed chunks in realtime * tee them to: * terminal (`tail -f` style) * websocket dashboard * structured logs/db Then forward the stream unchanged to the client app. That gives you: * realtime token visibility * interruption when the model starts hallucinating * full prompt/response history * timings / TPS / latency * multi-app observability Honestly this is like 200-300 lines in FastAPI/Node these days, not some giant infra project. A couple implementation details that matter: * Don’t wait for completion events. Log chunks as they arrive. * Store both: * raw streamed deltas * reconstructed final response * Capture system prompts too. Half the weirdness comes from hidden prompts in apps. * Add request IDs because concurrent generations become unreadable fast. If you want something quick-and-dirty *today*: * `mitmproxy` works surprisingly well for spying on OpenAI-compatible traffic * `litellm --detailed_debug` helps a bit, but it’s not really built for realtime introspection * Open WebUI has decent history visibility but not true low-level token observability Also: if you’re using local Qwen variants specifically, realtime monitoring is genuinely useful because you can usually tell within \~20 tokens whether the model has “locked onto” the wrong trajectory. Huge time saver on slower rigs. Feels like there’s still a gap in tooling here tbh. Most observability stacks are optimized for API billing analytics, not “I want to watch my slightly-unhinged local model think in realtime before it wastes 5 minutes of my GPU time.”

[-]

dtdisapointingresult@reddit (OP)

Thanks Claude.

[-]

JaapieTech@reddit

I have full login enabled on LMStudio, which does exactly this - shows you every prompt and tokens in+out. What tooling are you using - this may make it easier to reccomend a fix

[-]

funding__secured@reddit

litellm

[-]

Juulk9087@reddit

Set up a watcher script that watches the VLLM instance. I had opus do that for me and it just parses the data into an easy to read format. From there you could easily have it show up in a web view if you wanted that updates real time as the data flows into VLLM instance. Just tell opus to do exactly what I said and you'll have it done in 5 minutes.

[-]

Shot-Ad8790@reddit

Setting up a reverse proxy that streams tokens as they're generated might work. You could use websockets to send live output to a simple dashboard or even your terminal with a custom script.

[-]

HVACcontrolsGuru@reddit

Are you wanting the actual LLM output stream with the context, tool calls etc? More on the harness side and Otel is what I’ve been using as well as some other self coded projects for runtime tracking and provenance. YouTube channel I’ll have to find I watched last night he where he was doing a lot of under the hood work. Found the channel from a post in this subreddit yesterday I think.