UI Icon Detection with Qwen3.5, Qwen3.6 and Gemma4

Posted by Jian-L@reddit | LocalLLaMA | View on Reddit | 2 comments

Hey everyone,

I did a small personal benchmark on using local models to detect UI icons from application screenshots. English is not my first language, so sorry for any grammar mistakes! I just wanted to share what I found in case it helps someone doing similar stuff.

Models includes(none quantization):

Gemma4-31B-it
Qwen3.5-27B
Qwen3.6-35B-A3B

Approach:

I feed the app screenshot into the LLM and ask it to recognize the UI icons and return the bbox_2d coordinates. After it gives me the coordinates, I use supervision to draw red bounding boxes on the image. Finally, I just check the results manually by eye.

For the setup, I used the newest vLLM v0.19.1 doing offline inference. I set the starting temperature to 0 because I want the most confident output. If the model returns 0 icons, I gradually increase the temperature: 0 -> 0.3 -> 0.6 -> 0.9.

Overall Results:

Overall, the Dense model is much better than the MoE model for this task. My ranking: Qwen3.5 > Qwen3.6 ≈ Gemma4

Some specific findings:

Gemma4 and Qwen3.6 are both tied for last place. They are noticeably worse than Qwen3.5.
Gemma4 completely failed on the Cursor IDE screenshot. I tried 4 times, everytime pushing the temperature all the way to 0.9, and it still couldn't detect a single icon.
Qwen3.6 did something really funny on the Photoshop screenshot. It basically recognized the whole entire image as one giant icon and drew a massive box around the screen. 😅
For the other app scenarios, you can check the comparison pictures below.

Here are the detail vllm parameters:

  - name: gemma-4-31B-it
    family: gemma4
    params_b: 31
    vllm_kwargs:
      model: google/gemma-4-31B-it
      tensor_parallel_size: 8
      max_model_len: 8192
      max_num_seqs: 1
      gpu_memory_utilization: 0.85
      limit_mm_per_prompt:
        image: 1
        audio: 0
        video: 0
      mm_processor_cache_gb: 0
      skip_mm_profiling: true
      mm_processor_kwargs:
        max_soft_tokens: 1120


  - name: qwen3.5-27b
    family: qwen3.5
    params_b: 27
    vllm_kwargs:
      model: Qwen/Qwen3.5-27B
      tensor_parallel_size: 8
      max_model_len: 32768
      max_num_seqs: 1
      gpu_memory_utilization: 0.9
      limit_mm_per_prompt:
        image: 1
        audio: 0
        video: 0
      mm_processor_cache_gb: 0
      mm_encoder_tp_mode: data
      skip_mm_profiling: true


  - name: qwen3.6-35b-a3b
    family: qwen3.5
    params_b: 35
    vllm_kwargs:
      model: Qwen/Qwen3.6-35B-A3B
      tensor_parallel_size: 8
      max_model_len: 32768
      max_num_seqs: 1
      gpu_memory_utilization: 0.9
      limit_mm_per_prompt:
        image: 1
        audio: 0
        video: 0
      mm_processor_cache_gb: 0
      mm_encoder_tp_mode: data
      skip_mm_profiling: true

Has anyone else tried UI element detection with local models recently? Curious if you guys have any tricks for getting better bounding boxes.

[-]

GaryDUnicorn@reddit

I was just testing and comparing a bunch of VL models in UFO2 earlier... for reasons...

Jian-L@reddit (OP)

Makes sense. 122B and 235B are too big for my local machine. 'Driving' an agent is definitely much harder than just finding a static icon in my benchmark. Thanks for sharing the result!