Qwen3.5-4B-Base-ZitGen-V1

Posted by lolzinventor@reddit | LocalLLaMA | View on Reddit | 10 comments

Hello LocalLLamas,

I'd like to share a fine-tuned model I've been working on:

Model: https://huggingface.co/lolzinventor/Qwen3.5-4B-Base-ZitGen-V1

I thought some of you might find it interesting. It is an image captioning fine-tune optimized for Stable Diffusion prompt generation (i.e., image-to-prompt).

What Makes This Unique

What makes this fine-tune unique is that the dataset (images + prompts) was generated entirely by LLMs tasked with regenerating a target image.

The Process

The process is as follows:

The target image and the last generated image (blank if it's the first step) are provided to an LLM with a comparison prompt.
The LLM outputs a detailed description of each image and the key differences between them.
The comparison results and the last generated prompt (empty if it's the first step) are provided to an LLM with an SD generation prompt.
The output prompt is sent to the ComfyUI API using Z-Image Turbo, and the output image is captured.
Repeat N times.

Training Details

The system employed between 4 and 6 rounds of comparison and correction to generate each prompt-image pair. In theory, this process adapts the prompt to minimize the difference between the target image and the generated image, thereby tailoring the prompt to the specific SD model being used.

The prompts were then ranked and filtered to remove occasional LLM errors, such as residuals from the original prompt or undesirable artifacts (e.g., watermarks). Finally, the prompts and images were formatted into the ShareGPT dataset format and used to train Qwen 3.5 4B.

Dataset

Given that all the data used to create the fine-tune was created synthetically, is it free from any copyright issues?

[-]

verdooft@reddit

Interesting, have you uploaded the model as gguf file and the mmproj gguf anywhere? I only see model.safetensors.

[-]

lolzinventor@reddit (OP)

uploading.... BF16 and Q8

[-]

verdooft@reddit

Thank you again, I tested it with this photo and this commandline:

./build/bin/llama-mtmd-cli -m models/Qwen3.5/Qwen3.5-4B-Base-ZitGen-V1/Qwen3.5-4B-Base-ZitGen-V1-Q8_0.gguf --mmproj models/Qwen3.5/Qwen3.5-4B-Base-ZitGen-V1/Qwen3.5-4B-mmproj-BF16.gguf -c 32768 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-last-n -1 --presence-penalty 0.0 --repeat-penalty 1.0 --jinja --threads 8 --image 00008.jpg -p 'Generate a detailed prompt to recreate this image.'

Result:

Close-up macro photography of a delectable café dessert spread arranged on a glossy white ceramic plate with a reflective surface, captured from a slightly elevated 30-degree angle using a 90mm macro lens at f/2.8 for shallow depth of field. The composition features a rectangular slice of dark chocolate cake with a smooth, glossy black chocolate glaze on top, revealing a dense chocolate sponge interior with visible crumb texture and no visible layers or fruit filling. The cake is garnished with a generous dollop of white whipped cream and topped with a single dark cherry possessing a shiny surface and a short stem. Surrounding the cake are exactly three scoops of ice cream: two white scoops and one pink scoop, each displaying distinct soft melting edges where the ice cream meets the white ceramic plate. To the right of the plate sits a white ceramic cup with a smooth glossy finish, featuring the text "Internetcafé" printed in black serif font on the side, filled with dark coffee liquid from which thin, translucent wisps of steam rise gently from the surface. A single silver spoon rests on a matching white ceramic saucer beside the cup. The background depicts a softly blurred interior of an internet café with warm ambient lighting, featuring indistinct shapes of vintage beige CRT computer monitors and keyboards on counters, completely devoid of visible screens displaying blue or green text, with all background elements rendered as indistinct bokeh orbs. Lighting consists of warm ambient illumination supplemented by a strong directional key light from the front-left, creating sharp specular highlights on the glossy plate, shiny chocolate glaze, and reflective cup surfaces while casting deep, rich shadows to establish high dynamic range and dramatic depth. The color palette is dominated by warm chocolate browns, creamy whites, and soft pinks against a backdrop of amber wood tones and muted beige grays, with a Kodak Portra 400 film aesthetic for natural warm color rendition. Tactile sensations include the cold smoothness of the ceramic, the warm viscosity of the melted ice cream, and the moist texture of the cake. The atmosphere exudes a steamy warmth with a nostalgic, cinematic feel. No multi-colored cherries, no layered cakes, no blue or green computer screens, no high-saturation neon colors, no text watermarks.

For a 4B model this is very cool.

[-]

lolzinventor@reddit (OP)

I'm glad it its working. Putting the prompt into Z-Image turbo:

[-]

verdooft@reddit

Putting the prompt into Qwen-Image, 8 Steps Lora (i have no gpu, this is, why image generation takes long time), the last no ... i added to the negative prompt without no. I will test it with other images in future.

The original prompt was much shorter, but results depend more on seed then. :-)

[-]

verdooft@reddit

Thank you, i mostly use BF16 mmproj and Q8 for the main model. I tested recreating photos with generated prompts in past too, will test your model.

[-]

verdooft@reddit

Works really good:
Executed this:
./build/bin/llama-mtmd-cli -m models/Qwen3.5/Qwen3.5-4B-Base-ZitGen-V1/Qwen3.5-4B-Base-ZitGen-V1-Q8_0.gguf --mmproj models/Qwen3.5/Qwen3.5-4B-Base-ZitGen-V1/Qwen3.5-4B-mmproj-BF16.gguf -c 32768 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --repeat-last-n -1 --presence-penalty 0.0 --repeat-penalty 1.0 --jinja --threads 8 --image 00008.jpg -p 'Generate a detailed prompt to recreate this image.'

Result:
Close-up macro photography of a delectable café dessert spread arranged on a glossy white ceramic plate with a reflective surface, captured from a slightly elevated 30-degree angle using a 90mm macro lens at f/2.8 for shallow depth of field. The composition features a rectangular slice of dark chocolate cake with a smooth, glossy black chocolate glaze on top, revealing a dense chocolate sponge interior with visible crumb texture and no visible layers or fruit filling. The cake is garnished with a generous dollop of white whipped cream and topped with a single dark cherry possessing a shiny surface and a short stem. Surrounding the cake are exactly three scoops of ice cream: two white scoops and one pink scoop, each displaying distinct soft melting edges where the ice cream meets the white ceramic plate. To the right of the plate sits a white ceramic cup with a smooth glossy finish, featuring the text "Internetcafé" printed in black serif font on the side, filled with dark coffee liquid from which thin, translucent wisps of steam rise gently from the surface. A single silver spoon rests on a matching white ceramic saucer beside the cup. The background depicts a softly blurred interior of an internet café with warm ambient lighting, featuring indistinct shapes of vintage beige CRT computer monitors and keyboards on counters, completely devoid of visible screens displaying blue or green text, with all background elements rendered as indistinct bokeh orbs. Lighting consists of warm ambient illumination supplemented by a strong directional key light from the front-left, creating sharp specular highlights on the glossy plate, shiny chocolate glaze, and reflective cup surfaces while casting deep, rich shadows to establish high dynamic range and dramatic depth. The color palette is dominated by warm chocolate browns, creamy whites, and soft pinks against a backdrop of amber wood tones and muted beige grays, with a Kodak Portra 400 film aesthetic for natural warm color rendition. Tactile sensations include the cold smoothness of the ceramic, the warm viscosity of the melted ice cream, and the moist texture of the cake. The atmosphere exudes a steamy warmth with a nostalgic, cinematic feel. No multi-colored cherries, no layered cakes, no blue or green computer screens, no high-saturation neon colors, no text watermarks.

Thank you for the model, for a 4B model this is very cool.

[-]

reto-wyss@reddit

I'm working on something similar, but a bit broader using synthetic (ZiT and Flux2-klein-4b) and real images.

I'm going to make it have multiple modes, like:

Write the {image-generation-model} prompt for this image in the voice of {caption-mode or stylel}, e.g. "Write the Z-Image-Turbo prompt for this image in the voice of Gemma-4"
Write a description for this image in the voice of {caption-model}

Did you use various aspect resolutions and total pixel counts? How many image-caption pairs did you use? Will you make the dataset available?

[-]

lolzinventor@reddit (OP)

It's about a 50/50 split of landscape and portrait (1600x1200). These were then downscaled for LLM training to 768 pixels on the longest side, so that I could train with 768x768 total pixels. There are about 1,000 pairs. I'm just going through the dataset; it still needs some cleaning. However, given that it's locally generated, I assume there are no copyright issues. Is it OK to share the data?

[-]

reto-wyss@reddit

I'm not a law expert, so this is just my best understanding:

I like to declare my image data-sets under dual license
CC0 the images (or no claims to the artifacts - provided as-is),
and Attribution (share-alike) for the curation, compilation, etc. work
If you haven't contaminated the license (only used Apache 2, MIT etc. models), this should be a fairly clean way to publish the data-set with minimal exposure.