UI Icon Detection with Qwen3.5, Qwen3.6 and Gemma4
Posted by Jian-L@reddit | LocalLLaMA | View on Reddit | 2 comments
Hey everyone,
I did a small personal benchmark on using local models to detect UI icons from application screenshots. English is not my first language, so sorry for any grammar mistakes! I just wanted to share what I found in case it helps someone doing similar stuff.
Models includes(none quantization):
- Gemma4-31B-it
- Qwen3.5-27B
- Qwen3.6-35B-A3B
Approach:
I feed the app screenshot into the LLM and ask it to recognize the UI icons and return the bbox_2d coordinates. After it gives me the coordinates, I use supervision to draw red bounding boxes on the image. Finally, I just check the results manually by eye.
For the setup, I used the newest vLLM v0.19.1 doing offline inference. I set the starting temperature to 0 because I want the most confident output. If the model returns 0 icons, I gradually increase the temperature: 0 -> 0.3 -> 0.6 -> 0.9.
Overall Results:
Overall, the Dense model is much better than the MoE model for this task. My ranking: Qwen3.5 > Qwen3.6 ≈ Gemma4
Some specific findings:
- Gemma4 and Qwen3.6 are both tied for last place. They are noticeably worse than Qwen3.5.
- Gemma4 completely failed on the Cursor IDE screenshot. I tried 4 times, everytime pushing the temperature all the way to 0.9, and it still couldn't detect a single icon.
- Qwen3.6 did something really funny on the Photoshop screenshot. It basically recognized the whole entire image as one giant icon and drew a massive box around the screen. 😅
- For the other app scenarios, you can check the comparison pictures below.
Here are the detail vllm parameters:
- name: gemma-4-31B-it
family: gemma4
params_b: 31
vllm_kwargs:
model: google/gemma-4-31B-it
tensor_parallel_size: 8
max_model_len: 8192
max_num_seqs: 1
gpu_memory_utilization: 0.85
limit_mm_per_prompt:
image: 1
audio: 0
video: 0
mm_processor_cache_gb: 0
skip_mm_profiling: true
mm_processor_kwargs:
max_soft_tokens: 1120
- name: qwen3.5-27b
family: qwen3.5
params_b: 27
vllm_kwargs:
model: Qwen/Qwen3.5-27B
tensor_parallel_size: 8
max_model_len: 32768
max_num_seqs: 1
gpu_memory_utilization: 0.9
limit_mm_per_prompt:
image: 1
audio: 0
video: 0
mm_processor_cache_gb: 0
mm_encoder_tp_mode: data
skip_mm_profiling: true
- name: qwen3.6-35b-a3b
family: qwen3.5
params_b: 35
vllm_kwargs:
model: Qwen/Qwen3.6-35B-A3B
tensor_parallel_size: 8
max_model_len: 32768
max_num_seqs: 1
gpu_memory_utilization: 0.9
limit_mm_per_prompt:
image: 1
audio: 0
video: 0
mm_processor_cache_gb: 0
mm_encoder_tp_mode: data
skip_mm_profiling: true
Has anyone else tried UI element detection with local models recently? Curious if you guys have any tricks for getting better bounding boxes.
GaryDUnicorn@reddit
I was just testing and comparing a bunch of VL models in UFO2 earlier... for reasons...
Jian-L@reddit (OP)
Makes sense. 122B and 235B are too big for my local machine. 'Driving' an agent is definitely much harder than just finding a static icon in my benchmark. Thanks for sharing the result!