GLM-4-9B(Q5_K_L) Heptagon Balls sim (multi-prompt) | TheaterFire

GLM-4-9B(Q5_K_L) Heptagon Balls sim (multi-prompt)

Posted by danihend@reddit | LocalLLaMA | View on Reddit | 45 comments

Title pretty much says it but just to clarify - it wasn't one-shot. It was prompt->response->error, then this:

Here is an error after running the sim:
<error>
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Users\username\anaconda3\Lib\tkinter_init_.py", line 1967, in call
return self.func(*args)
^^^^^^^^^^^^^^^^
File "C:\Users\username\anaconda3\Lib\tkinter_init_.py", line 861, in callit
func(*args)
File "c:\Users\username\VSCodeProjects\model_tests\balls\GLM49B_Q5KL_balls.py", line 140, in update
current_time_ms = float(current_time)
^^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: 'after#2'
</error>
Now think as hard as you can about why this is happening. Look at the entire script and consider how the parts work together. You are free to think as long as you need if you use thinking tags like this:
<think>thoughts here</think>.
Once finished thinking, just provide the patch to the code. No need to rewrite it all.

Then I applied the fix, got another error, replaced the original Assistant code block with the new code and presented the new error as if it were the 1st error by editing my message. I think that resulted in the working version.

So TL;DR - couple of prompts to get it working.

Simply pasting error after error did not work, but structured prompting with a bit of thinking seems to bring out some more potential.

Just thought I'd share in case it helps people with prompting it and just to show that it is not a bad model for it's size. The result is very similar to the 32B version.

[-]

Loud-Insurance4438@reddit

Okay, that's weird. Can you try retesting it using q6_k quant if possible? Cause in my case, it can zero-shot this prompt.

[-]

danihend@reddit (OP)

GLM-4-9B-Q6K_one-shot

Settigns:
Temp: 0.5
Top K: 35
Repeat Penalty: 1 (off)
Top P: 0.89
Min P: 0.01
System Prompt: empty

I tried a few other settings with 0.6 temp, 40 top k I think - they had errors.

[-]

Loud-Insurance4438@reddit

You using quant from Bartowski?

[-]

danihend@reddit (OP)

Ya I always use his when they're available.

[-]

Loud-Insurance4438@reddit

You're right, I tried out q6_k gave it 3 shots and yeah, out of those 3 tries, it could fail once, maybe even twice... I just hope the latest Llama.cpp build improves the model's performance.

[-]

danihend@reddit (OP)

Maybe I was lucky with my one shot using that quant, but I would never have suspected that there was anything wrong with the model tbh. Will be happy if it can be even better though, of course.

It's really a step change to have a model this good and this small. Can't imagine how good they will be in a couple of years.

[-]

Loud-Insurance4438@reddit

Yeah.

[-]

Loud-Insurance4438@reddit

I might test the same prompt using the Q6_K quant. If I don't forget, I will drop the output here.

[-]

ilintar@reddit

Nice! I did some tests with various parameters here: https://github.com/pwilkin/glm4-quant-tests, I'll have to test your optimal setting since it's pretty close to mine (tk40, temp 0.6, tp 0.8).

[-]

danihend@reddit (OP)

Will check them out thanks!

[-]

opi098514@reddit

It should be able to do be shot it or at least get really close. It’s in the training data for the model.

[-]

danihend@reddit (OP)

You don't know what exactly is in the training data tbf. But it can one-shot it - see here:

GLM-4-9B-Q6K_one-shot

[-]

opi098514@reddit

I think there is no way it’s not In the training data, even if it’s not in it intentionally the internet scraping would have pulled it and with the amount that it’s used as a benchmark it should be able to do it.

[-]

danihend@reddit (OP)

the 9b?

[-]

phenotype001@reddit

Why does it refuse so much? One time it said something was too complicated, and another time it just refused to comply.

[-]

danihend@reddit (OP)

Only happened to me once, otherwise it's surprisingly willing to write a lot of code which is unusual for a small model. It attempts complicated things that no others will.

What system prompt do you use and settings etc?

[-]

phenotype001@reddit

No system prompt, temp=0.2, topk=40, topp=0.95, no minp.

Here's what I asked the first time:

Write a 3D demo with Python where I should continuously fly through an infinite procedurally generated tunnel. You should use native OpenGL with shaders. Assume I have a ./texture.jpg file with the texture for the inner walls of the tunnel.

It said that's too complicated. But when I led the answer so it continues it ("Here's the ....") it wrote the code - but it didn't work. Next thing I wanted was to draw a satanic pentagram in thick red - and it outright refused that. It worked again when I led the answer.

[-]

danihend@reddit (OP)

I tried with your settings + no sys prompt. Same as you - "quite extensive". Here's the basis structure' etc.

I asked Gemini, then showed the result to GLM. Asked what it thought was impossible, and what I could have said to make it more confident. Its suggested prompt for itself:

"Write a Python program using the PyOpenGL library and the pygame windowing system to create a 3D demo where the user continuously flies through an infinitely long, procedurally generated tunnel.

**Mandatory Technologies:**

* Use PyOpenGL for all OpenGL calls (native bindings).

* Use pygame for window creation and event handling.

* Implement shaders using the `shaders` module from PyOpenGL. Do not use higher-level abstractions.

**Tunnel Characteristics:**

* **Procedural Generation:** The tunnel geometry must be generated algorithmically.

* The tunnel path should wobble vertically and horizontally using sine/cosine functions with configurable parameters (e.g., `PATH_FREQ_X`, `PATH_AMP_X`, `PATH_FREQ_Y`, `PATH_AMP_Y`).

* The tunnel radius should vary along its length using another sine function (e.g., `RADIUS_BASE`, `RADIUS_FREQ`, `RADIUS_AMP`).

* The tunnel walls should be represented as quad strips connecting rings of vertices.

* **Infinite Illusion:** Implement logic to generate new tunnel segments continuously ahead of the camera as the player moves forward. Discard old segments behind the camera to maintain performance. Use a data structure like `collections.deque` to efficiently manage the active segments.

* **Texture:** Load a texture from a file named `texture.jpg` using PyOpenGL and apply it to the tunnel walls. Ensure the texture wraps correctly (e.g., using `GL_REPEAT` parameters). Generate vertex texture coordinates (`t_S`) that repeat the texture appropriately along the tunnel length (`t_L`) and around the tunnel circumference (`t_C`).

* **Camera:** The camera should move continuously forward along the centerline of the generated tunnel. Implement a smooth 'up' vector calculation to prevent the camera from rolling as the path curves.

* **Rendering Loop:** Create a main loop that handles events, updates the camera position and tunnel geometry, and renders the scene.

* **Shaders:**

* **Vertex Shader:** Pass vertex position (`vPosition`) and texture coordinates (`v TexCoord`) to the fragment shader. Transform positions by the model-view-projection matrix.

* **Fragment Shader:**

* Implement a fog effect. The fog density should decrease with distance from the camera (e.g., using a linear or exponential fog formula). Pass a fog color and density parameter from the CPU.

* Implement a very basic ambient light source (e.g., a constant color added to the fragment color).

**Performance Considerations:**

* Use Vertex Buffer Objects (VBOs) to store the static geometry data (vertices, normals, texture coordinates) for the tunnel walls. Update the VBO efficiently as new segments are generated and old ones are discarded.

* Use Vertex Array Objects (VAOs) if performance is an issue.

* Use `glCullFace` with `GL_BACK` for back-face culling.

**Structure:**

* Organize the code logically with functions for initialization, updating, rendering, and potentially utility functions for matrix math (though using numpy is acceptable).

* Include comments explaining the purpose of key sections of code.

**Assumptions:**

* The required Python libraries (`pygame`, `PyOpenGL`, `numpy`) are installed.

* A file named `texture.jpg` exists in the same directory as the script.

**Output:** Generate the Python code as the final output."

I will not bother polluting the chat with the resulting code, as you can do that yourself, and I could not test it because I don't want to install the required dependencies, but let me know if it is any better or even close to working. Sounds like a complicated task for a model this size, even given its impressive abilities.
the code Gemini 2.5 Pro wrote was over twice as long so I am guessing it would not have worked.

[-]

danihend@reddit (OP)

I had one time with a snake game yesterday where it left LOADS of placeholders 😆. Just an unlucky roll of the dice I guess. I have not had any issues otherwise.

Do you have issues with refusals a lot? What are your settings and system prompt and user prompt?

[-]

matteogeniaccio@reddit

I found a template bug that is causing degraded performance of the model in llama.cpp and ollama. I submitted the bugfix to both projects and I'm waiting for the merge before uploading the updated GGUF files.

After the fix the model is able to oneshot entire games like this without resorting to multi prompt requests: https://www.reddit.com/r/LocalLLaMA/comments/1k6nuo3/glm432b_missile_command/

[-]

segmond@reddit

link to llama.cpp PR?

[-]

matteogeniaccio@reddit

https://github.com/ggml-org/llama.cpp/pull/13099

[-]

Glittering-Bag-4662@reddit

Is the model on ollama? Where would I pull?

[-]

tengo_harambe@reddit

I don't fully understand the technical details of this bug. Why does the GGUF need to be regenerated if the inference engine itself is applying the wrong chat template for the model? Using koboldcpp (layer over llama.cpp), I manually format the prompt like below and have no apparent performance degradation.

[gMASK]<sop><|user|>
hello world<|assistant|>

[-]

matteogeniaccio@reddit

Then you are probably not affected.

The bug is triggered when using the json template baked into the gguf or the legacy template in llama.cpp.

[-]

Admirable-Star7088@reddit

I have waited a couple of days to try this model because people have commented that it's very buggy. Just to be clear, should it now perform bug-free if I use this in Koboldcpp / LM Studio with a recently created quant and set the prompt template manually?

[-]

matteogeniaccio@reddit

All other known bugs have been fixed in llama.cpp based backends. A recent quant with manual template should perform properly.

[-]

danihend@reddit (OP)

All other other bugs but this one? I'm not 100% clear if this bug is fixed or not.

[-]

matteogeniaccio@reddit

This one is not merged yet. The other ones are merged in llama.cpp.

[-]

danihend@reddit (OP)

ok thanks. So we should expect it to get even better then, nice :)

[-]

danihend@reddit (OP)

My understanding was that it was fixed but I may be wrong.

[-]

danihend@reddit (OP)

Those look great! They are by the 32B model though right? Mine was with the 9B model. Does it have the same issue?

[-]

matteogeniaccio@reddit

Yes. And yes.

It was made with the 32b chat model. The 9b model has the same issue

[-]

Dr_Karminski@reddit

Nice work 👍

[-]

offlinesir@reddit

It's impressive, but we do need new tests. This test is now definitely in the training data.

[-]

danihend@reddit (OP)

I am 100% in favor of new and evolving tests/benchmarks, but, I also think it's interesting to just use these tests as a barometer to just get a kind of feeling for the things the model finds easy or challenging compared to other ones.

[-]

NNN_Throwaway2@reddit

I don't think its particularly interesting as the test only tells you how the model handles that particular task, not anything else.

LLM performance is determined mostly by training data. Either a model needs to have seen a similar pattern or it needs to have been trained to solve similar kinds of problems, because they can't handle novel inputs and thus can't generalize or apply their knowledge as well as a human can.

imo any benchmarks with only dozens or hundreds of prompts are borderline useless and mainly exist to wow investors. Models need to be evaluated on much larger datasets, with tens or hundreds of thousands of test cases, to get a realistic idea of their performance. Moreover, the dataset can't be static, or you run into the same issue of benchmaxxing and stagnation.

[-]

danihend@reddit (OP)

I've run an unhealthy number of tests across virtually every model you can think of, and I definitely found it interesting! :)

When asking models to create a Snake game, I use specific prompts that aren't part of standard training data. This does help to evaluate how well models can generalize knowledge I think.

It's similar to how multiple choice benchmark performance craters when options are mixed up or questions slightly reworded. Likewise, prompts to create a Snake game, Tetris, or a simulation of balls bouncing in a heptagon aren't all identical. Each one could introduces variations and different challenges that effectively make it a different task more flexibility and generalization to solve the task. Asking "make the game "snake in python" will get you the most boring bland implementation that looks like 1 of 2 or 3 implementations that all models do. Specifying rules, styles, mechanics etc suddenly forces it to solve all of these different things in addition to making that snake game, and some models do this better than others.

There are clear patterns: frontier models typically perform better, with interesting outliers like GPT4.1mini, Grok3 mini, and GLM4 32/9B doing much better than expected, while o4-mini low/03 struggle with simple tasks and needs additional prompting similar to smaller models.

I haven't systematically tested how performance on these toy tasks correlates with more complex, novel challenges, but I suspect there's a meaningful relationship there.

The idea that "it's in the training data so don't bother testing" is overly simplistic I think.

[-]

NNN_Throwaway2@reddit

I didn't say "don't bother testing" but that a one-shot example does not, by itself, tell you anything conclusive. Do models do better on the types of test you run because they have certain advantages in how they're trained versus what they're trained on, or is it a simple numbers game down to parameter size and training datasets? You can think you've come up with something novel, but we don't ultimately know what the training data looks like if it isn't open source along with the model.

QwQ is a great example of this. It can ace some tests with similar quality to frontier models, but I've had it trip over basic problems that a model with hypothetical real "smarts" should have been able to reason through regardless of specific knowledge of the problem.

To put it succinctly, I doubt there is much of a useful correlation, if any at all, between the results you observed and novel challenges. Its been shown that even frontier models show a huge performance deficit in these situations.

[-]

tengo_harambe@reddit

This is definitely in the training data FYI, they even demoed it on the huggingface page. Impressive model either way.

[-]

danihend@reddit (OP)

I did notice that - although I'm not convinced it's in the training data, but I would also not be so surprised if it was. But ya, either way, these GLM-4 models are an obvious advancement in intelligence per weight, no doubt.

Speaking of knowing what's in the training data - The Allen Institute for AI let's you see what docs were drawn upon for a response by their models - I found that to be pretty cool and have seen nobody else doing that.

You can see their models here: playground.allenai.org

[-]

DinoAmino@reddit

That's interesting! Nobody else does it because not many publish the datasets they use in training the models. AllenAI publishes everything including the training recipes.

[-]

danihend@reddit (OP)

Ya it's pretty cool that they do that. Am hoping to see something better from them in the future. They have just been building on Qwen/Llama models for now and they're not that good.

[-]

Jarlsvanoid@reddit

I'm impressed by this model. Not only in coding skills, but also in logical reasoning in the legal field. It passes all my tests flawlessly and with excellent language.

[-]

Cool-Chemical-5629@reddit

This is why I love these models. They are pushing for that ChatGPT quality in small package.