I evaluated several small and SOTA LLMs on Python code generation

Posted by spacespacespapce@reddit | LocalLLaMA | View on Reddit | 1 comments

Recently I've been experimenting with an agent to produce 3D models with Blender Python code.

Blender is a specialized software for 3D rendering that supports Python script eval. Most LLMs can produce simple Blender scripts to make pyramids, spheres, etc. But making complex geometry really puts these models to the test.

Setup

My architecture splits tasks between a 'coder' LLM, responsible for syntax and code generation, and a 'power' LLM, responsible for reasoning and initial code generation. This hybrid approach was chosen because early on I realized 3D modelling scripts are too complex for a model to make in one-shot and will require some iteration and planning.

I also developed an MCP server to allow the models to access up-to-date documentation on Blender APIs (since it's a dense library).

The models I used:

Experimenting

I ran multiple combinations of models on a range of easy to hard 3D modelling tasks, ranging from "a low poly tree" to "a low poly city block".

Each model can call an LLM whenever it needs to, but since calls may get repeated in the same loop, I added a "memory" module to store tool calls. This was also turned on/off to test its affects.

Key Takeaways

Qualitative observations