TextGen is now a native desktop app. Open-source alternative to LM Studio (formerly text-generation-webui).

Posted by oobabooga4@reddit | LocalLLaMA | View on Reddit | 76 comments

Hi all,

I have been making a lot of updates to my project, and I wanted to share them here.

TextGen (previously text-generation-webui, also known as my username oobabooga or ooba) has been in development since December 2022, before LLaMa and llama.cpp existed.

In the last two months, the project has evolved from a web UI to a no-install desktop app for Windows, Linux, and macOS with a polished UI. I have created a very minimal and elegant Electron integration for that. (Did you know LM Studio is also a web UI running over Electron? Not sure many people know that.)

It works like this:

You download a portable build from the releases page
Unzip it
Double-click textgen
A window appears

There is no installation, and no files are ever created outside the extracted folder. It's fully self-contained. All your chat histories and settings are stored in a user_data folder shipped with the build.

There are builds for CUDA, Vulkan, CPU-only, Mac (Apple Silicon and Intel), and ROCm.

Some differentiating features:

Full privacy. Unlike LM Studio, it doesn't phone home on every launch with your OS, CPU architecture, app version, and inference backend choices. Zero outbound requests.
ik_llama.cpp builds (LM Studio and Ollama only ship vanilla llama.cpp). ik_llama.cpp has new quant types like IQ4_KS and IQ5_KS with SOTA quantization accuracy.
Built-in web search via the ddgs Python library, either through tool-calling with the built-in web_search tool (works flawlessly with Qwen 3.6 and Gemma 4), or through an "Activate web search" checkbox that fetches search results as text attachments.
Tool-calling support through 3 options: single-file .py tools (very easy to create your own custom functions), HTTP MCP servers, and stdio MCP servers. You can enable confirmations so that each tool call shows up with approve/reject buttons before it executes. I have written a guide here.
The ability to create custom characters for casual chats, in addition to regular instruction-following conversations:

OpenAI and Anthropic compliant API with very strict spec compliance. It works with Claude Code: you can load a model and run ANTHROPIC_BASE_URL=http://127.0.0.1:5000 claude and it will work.
Accurate PDF text extraction using the PyMuPDF Python library.
trafilatura for web page fetching, which strips navigation and boilerplate from pages, saving a lot of tokens on agentic tool loops.
Chat templates are rendered through Python's Jinja2 library, which works for templates where llama.cpp's C++ reimplementation of jinja sometimes crashes.

I write this as a passion project/hobby. It's free and open source (AGPLv3) as always:

https://github.com/oobabooga/textgen

[-]

Succubus-Empress@reddit

Are you really that oobabooga?

[-]

oobabooga4@reddit (OP)

The one and only lol

[-]

kulchacop@reddit

I don't want to believe that. The frog is missing from the pfp.

[-]

iamapizza@reddit

I remember this name being mentioned quite a bit in some discussion threads... but in relation to Stable Diffusion in the early days, and I can't remember why.

[-]

No_Afternoon_4260@reddit

Because he did for llm what was automatic1111 for stable diff?

[-]

Inevitable-Start-653@reddit

Yeass! Thank you frog person <3

[-]

cafedude@reddit

Can you just point it to where your LMStudio models are stored?

[-]

Silver-Champion-4846@reddit

Did you ever consider compliance with the WCAG for screenreader accessibility?

[-]

SolemnFuture@reddit

I tried it a week ago but I couldn't find a system prompt. I couldn't get my character(s) to work either, the model was just base and didn't use my character descriptions. Also no group chat with multiple characters at once feature. Spent like 2 hours looking for solutions but failed. I get this is a new project, but I need an accessible system prompt function.

[-]

Succubus-Empress@reddit

In textgen How to install latest llama.cpp from their repo?

[-]

oobabooga4@reddit (OP)

You can replace the contents of app/portable_env/Lib/site-packages/llama_cpp_binaries/bin/ with your own llama.cpp. The binaries shipped with the portable builds are compiled on https://github.com/oobabooga/llama-cpp-binaries and are very aligned with the upstream workflows.

[-]

mintybadgerme@reddit

Does it cope with MTP models out of the box then?

[-]

oobabooga4@reddit (OP)

If you compile the MTP PR branch on llama.cpp and replace the files it should work, yes.

[-]

doc-acula@reddit

Very cool!

[-]

norcom@reddit

Caught my eye with the "alternative to LM Studio", unfortunately not what I was looking for.

I've been wanting a native, simple macOS app GUI that would allow me to either select a local inference engine executable I want to run, set a path, options how to execute and run it with one click. Or to add a remote API. I like the simplicity of llama-server but I don't like using a browser UI and it doesn't work with other engines.

Example of what I wanted ie: I clone the latest llama.cpp, mlx-lm, mlx-vlm, vllm or whatever fork, compile it and setup the GUI to run it. The models stay where I want, and I just have the option to click-run engine/model, instead of what's built into the other apps.

So I vibed something sloppy to let me do just that. And for the most part, it works. Multiple engines, multiple windows, multiple chats. But at some point I went too vibertastic with it, and the thing sidetracked into having too many cooks in the kitchen. lol (need to simplify and standardize some options) It wasn't supposed to go past mlx-lm and remote API but with newer models, stuff had to be added.

Here's a screenshot if the above didn't make sense. I've been too busy and lazy to fix it up.

If anyone knows of something similar for a macOS native app project, please tell. I guess I just need one with the API interface really.

Processing img ch6k9xuhix0h1...

[-]

Caelarch@reddit

If I am running (and enjoying, thank you!!) the webui, is there any real advantage to using it as an app?

[-]

oobabooga4@reddit (OP)

Just the feeling of having something self-contained that you control (it doesn't even require a browser). If you keep the zip, it will work even 10 years from now.

[-]

blastcat4@reddit

This is neat! I've been wanting to have an easy-to-set up portable inference engine that I can use on my friend's PC. I've set it up on a flash drive with Gemma 4 e4b and it works!

The only hitch so far is that I can't get multimodal working. I've put the associated mmoproj for Gemma 4 in the /user_data/mmproj folder and I can see and select it in the multimodal section in the Model setttings. However, when I attach a file, like an image, the system seems to hang.

[-]

----Val----@reddit

I wont lie, I absolutely despised ooba's old web UI and dropped it years ago.

This however is an unexpected surprise, will be checking it out!

[-]

Ok_Procedure_5414@reddit

Amazing, hell to the yeayuh. Oobabooga did you ever look into Tauri to drive what Electron currently does in your codebase?

[-]

waywardspooky@reddit

we're so back!

[-]

Blackmarou@reddit

The only thing pushing me to lm studio is their new beta feature lm link, so I could use my machine locally from another one… does this have any similar feature, or an alternative?

[-]

oobabooga4@reddit (OP)

Yes, if you use the --listen flag, you can access it from another computer on the local network. I do it all the time. For instance, if you also want a password:

--listen --gradio-auth youruser:yourpassword

[-]

Blackmarou@reddit

I’ll try it later, but just to make sure, it’s not just opening a port to send in requests, it’s really using another instance of lm studio to connect to another running lm studio instance so you can manipulate it (and monitor it) as if it was on the same host. Makes it easier to kinda manage deployed models and all.

[-]

oobabooga4@reddit (OP)

Ah I see, that's something I want to implement but haven't gotten around to yet.

[-]

NineThreeTilNow@reddit

Very nice work dude.

The one thing I still can't get Gemma 4 31b to do properly in LM Studio chat is use it's thinking mode. It's infuriating. I tried every tip I found across reddit or whatever. Nothing. The correct tags and jinja and adding it to the system prompt. It works 50% of the time.

Any luck with the thinking mode for Gemma 4 operating properly with your build?

I appreciate the "No phone home" stuff. Even if they want to track "anonymous" telemetry it's super hard to trust that stuff.

[-]

oobabooga4@reddit (OP)

Thinking with gemma 4 works fine in the UI, it also alternates between thinking and calling tools automatically if you have tools enabled. I have tested this model very extensively.

[-]

nickless07@reddit

"Select a file that matches your model. Must be placed in ...user_data/mmproj/" Where are the settings to change the default path for models, mmproj and so on?

[-]

oobabooga4@reddit (OP)

You can customize the models folder, see here: https://www.reddit.com/r/LocalLLaMA/comments/1tbyyee/comment/olkwd6a/

But there isn't a --mmproj-dir folder right now. If on Linux, you can remove the folder and replace it with a symlink as a workaround.

[-]

ai_without_borders@reddit

used the old text-generation-webui back in early 2023. gradio update hell was real — the UI would randomly break after pip installs and debugging it was miserable. electron was the right call. curious how --fit on handles kv cache overhead — is it just fitting weights or does it account for cache at current context length?

[-]

oobabooga4@reddit (OP)

It also does account for context length, and also for MoE layers (what the old --cpu-moe flag used to do is now done automatically). It's a great feature in llama.cpp really.

[-]

siege72a@reddit

I'm currently using LM Studio, but I'm always interested in options. I have some (hopefully) quick questions:

I'm running two mismatched GPUs (16GB 5060 Ti and 8GB 4060). If I select "tensor", will in correctly balance between them? Is there a way to set the 5060 to have higher priority?
Is there a way to use my LM Studio model directory, without having to duplicate files?

My PC is running Windows 11, if that makes a difference.

[-]

oobabooga4@reddit (OP)

I also use two mismatched GPUs. My experience has been that setting split-mode to tensor raises the tokens/second by 60% for generation when using Qwen 3.6 27b, but it also creates compute buffers that may cause OOM errors. You can work around by setting tensor-split to 60,40 for instance if the second GPU is OOMing.
Yes, you can use the --model-dir flag to load models from the existing LM studio models folder. To make it automatic on every launch, you can edit user_data/CMD_flags.txt once as described here: https://github.com/oobabooga/textgen#loading-a-model-automatically

[-]

siege72a@reddit

Thank you!

[-]

pmttyji@reddit

ik_llama.cpp builds (LM Studio and Ollama only ship vanilla llama.cpp). ik_llama.cpp has new quant types like IQ4_KS and IQ5_KS with SOTA quantization accuracy.

That's nice to have! Thanks for this big update!

[-]

sine120@reddit

I started on LM Studio and got kind of turned off of it in the past couple months, switched fully to llama.cpp and Openwebui/ Pi. I still have a couple of less techy friends I drag with me in the local LLM scene, and LM Studio was my entry point for them. I feel a lot better about recommending an actually local UI.

[-]

msitarzewski@reddit

Can't wait to try it. Downloaded the Apple Silicon version - macOS Tahoe said "No."

[-]

oobabooga4@reddit (OP)

See this issue, the build isn't signed, so you may need to run a command to tell macOS to stop blocking it: https://github.com/oobabooga/text-generation-webui/issues/7305 (I won't hack you I promise)

[-]

Sabin_Stargem@reddit

Hopefully, an addition can be made to the notebook: A collapsible tree structure, so that we can add discrete entries, alongside enabling or disabling them individually. That would be handy for my translation handbook rules, RPG lore, and so forth.

0000

I am guessing the app doesn't support MTP models, as it failed to load LLMFan's 35b Heretic+MTP.

0000

When trying to load a model in a multi-GPU setup with split-mode of 'tensor', it fails. I have a 3060 and a 4090.

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 12151.23 MiB on device 1: cudaMalloc failed: out of memory D:\a\llama-cpp-binaries\llama-cpp-binaries\llama.cpp\ggml\src\ggml-backend.cpp:119: GGML_ASSERT(buffer) failed alloc_tensor_range: failed to allocate CUDA1 buffer of size 12741484032 07:59:11-325274 ERROR Error loading the model with llama.cpp: Server process terminated unexpectedly with exit code: 3221226505

0000

Also, it would be nice if TheTom's TurboQuant+ is added to the KV settings. It should be noted that KV settings should be asymmetric if implemented.

[-]

boredquince@reddit

any plans for memory-like feature, or project memory or similar? like chatgpt or Claude? most if not all local apps don't have support for this. why? is it very hard to implement?

i know most have mcp support and MCP servers for that but not included which adds to complexity

[-]

Dany0@reddit

Since vibe coding has solved programming, why won't you let it write the ui in high performance c or assembly. Why another clunky oversized electron app

I heard ralph can debug the app for you too

[-]

AnOnlineHandle@reddit

LLMs are good but they're nowhere near that good.

Even the leading models from Google and Anthropic regularly make simple mistakes which need human review to correct them. e.g. They all still struggle with remembering that Pytorch 2's optimized attention function uses inverted mask logic to the older methods, and ML programming is the one place that the model creators can absolutely test the model and spot mistakes themselves and would likely want the models to be very good at.

[-]

Pleasant-Shallot-707@reddit

Harness better

[-]

Pleasant-Shallot-707@reddit

Rust, you uncouth swine, not c or assembly.

[-]

iamapizza@reddit

Since vibe coding has solved programming

Wow, even Dario Amodei is commenting here.

[-]

iamapizza@reddit

I remember trying this project a year or so ago but it looks like it's come a long way since then. I like that you said portable build and Linux. I will try this tonight with llama.cpp, cheers for that.

[-]

EncampedMars801@reddit

Just wanna say, I remember trying your UI yeeaars ago back when it used that default orange gradio theme. Wasn't particularly impressed at the time, but finally tried it again a couple weeks ago and it's genuinely a great UI now. Great work! I'm glad it hasn't stagnated like maaaany other UIs

[-]

Merchant_Lawrence@reddit

Thanks for making comeback, i hope you well and have good day

[-]

jamaalwakamaal@reddit

Thank you

[-]

Borkato@reddit

Finally, a private alternative to LM studio!! Thank you <3

Loved ooba from its beginnings!

[-]

noneabove1182@reddit

I think you mean an open alternative ;)

[-]

Due-Function-4877@reddit

Any hope of allowing power users to link an external build of llama.cpp in the future?. It was a long time ago, but the main reason I shifted over to running my own backend directly was to get access to bleeding edge builds. I always appreciated the way text-gen-web-ui/textgen let me configure my backend config from a GUI. The command line is obtuse. Always has been and always will be.

[-]

oobabooga4@reddit (OP)

You can already do it. See this comment: https://www.reddit.com/r/LocalLLaMA/comments/1tbyyee/comment/olk7wl3/

[-]

AdIllustrious436@reddit

[-]

pl201@reddit

Can you go more details on the ability to create custom characters for casual chats? How do you handle the long term memory? Is it possible to load the character card? What’s the default system prompt for the character chat?

[-]

ComplexType568@reddit

THANK YOU SO MUCH!! MORE COMPETITION TO LM STUDIO, PLEASE! I'M GETTING SICK OF IT.

apologies for the caps lock, i could write a whole essay about why LM Studio... well, pisses me off, to say the least.

[-]

Succubus-Empress@reddit

It doesn’t even start on my windows system

[-]

ComplexType568@reddit

Did you click start.bat? It works fine for me and I'm running pretttty vanilla Windows 11 (the ik_llama.cpp version on CUDA 12.4)

[-]

brickout@reddit

Mine neither

[-]

Visual-Afternoon-541@reddit

Great thanks, looking forward to seeing your project grow

[-]

seccondchance@reddit

Og bro

[-]

thatoneshadowclone@reddit

this is a great step forward, but GOD i'm SO sick of electron apps.

[-]

silenceimpaired@reddit

Does this version have EXL3 built in?

I really wish you could save and use different model loading setups. KoboldCPP does, and it works well for adjusting settings to ideally fit specific context sizes.

[-]

oobabooga4@reddit (OP)

No, for EXL3 you need to use the old installer described here: https://github.com/oobabooga/textgen#full-installation

This also unlocks LoRA training (I have completely refactored it and it's very aligned with axolotl now, with good defaults) and image generation with diffusers.

[-]

silenceimpaired@reddit

Not possible to have both GGUF and EXL3 in the software? I primarily have used your software for EXL3 since I’m used to other platforms for GGUF.

[-]

oobabooga4@reddit (OP)

Not in the portable builds, as EXL3 depends on Pytorch which is a \~10 GB dependency. But the full install does include EXL3, llama.cpp, and ik_llama.cpp all in one install.

[-]

AltruisticList6000@reddit

Yeah textgen is very nice, I use it all the time. It's like the A1111 of text generation, it's easy to use but also up to date. It both works as an app now and still can be run like a regular webui from browser (which I prefer), from the same ZIP without install.

[-]

-p-e-w-@reddit

Great to see this project improving continuously over the years!

Are you planning to get off your Gradio fork and upgrade to Gradio 6? There are some very noticeable performance improvements in recent versions, and the number of dependencies has been substantially reduced.

[-]

oobabooga4@reddit (OP)

Gradio has this issue where each time you update, the UI breaks completely. stable-diffusion-webui never updated to Gradio 4 for this reason, for instance.

I chose a third route (not updating, not moving away from), which was to fork Gradio and optimize it from the inside. The performance gains are truly huge and I'm at a point where I can't find things to optimize anymore. I also removed unused large requirements like matplotlib. Source is here: https://github.com/oobabooga/gradio/commits/main/

[-]

LMTLS5@reddit

damn the og is back. seriously easy app based text generation was such a huge gap. no real foss alternative so far. nice to see you back

[-]

dinerburgeryum@reddit

Hot damn dude, amazing work, as always.

[-]

thereisonlythedance@reddit

Congrats, looks very nice.

Is RAG functional these days? It be broken is why I drifted away from your otherwise excellent project.

[-]

oobabooga4@reddit (OP)

Text/pdf/docx attachments work but are put in full in the chat history. Models are loaded with `--fit on`, so the context length is automatically maximized given the available memory.

I haven't heard much of RAG these days, but it's something I could add on a future release.

[-]