UI Icon Detection with Qwen3.5, Qwen3.6 and Gemma4

Posted by Jian-L@reddit | LocalLLaMA | View on Reddit | 2 comments

Hey everyone,

I did a small personal benchmark on using local models to detect UI icons from application screenshots. English is not my first language, so sorry for any grammar mistakes! I just wanted to share what I found in case it helps someone doing similar stuff.

Models includes(none quantization):

Approach:

I feed the app screenshot into the LLM and ask it to recognize the UI icons and return the bbox_2d coordinates. After it gives me the coordinates, I use supervision to draw red bounding boxes on the image. Finally, I just check the results manually by eye.

For the setup, I used the newest vLLM v0.19.1 doing offline inference. I set the starting temperature to 0 because I want the most confident output. If the model returns 0 icons, I gradually increase the temperature: 0 -> 0.3 -> 0.6 -> 0.9.

Overall Results:

Overall, the Dense model is much better than the MoE model for this task. My ranking: Qwen3.5 > Qwen3.6 ≈ Gemma4

Some specific findings:

Here are the detail vllm parameters:

  - name: gemma-4-31B-it
    family: gemma4
    params_b: 31
    vllm_kwargs:
      model: google/gemma-4-31B-it
      tensor_parallel_size: 8
      max_model_len: 8192
      max_num_seqs: 1
      gpu_memory_utilization: 0.85
      limit_mm_per_prompt:
        image: 1
        audio: 0
        video: 0
      mm_processor_cache_gb: 0
      skip_mm_profiling: true
      mm_processor_kwargs:
        max_soft_tokens: 1120


  - name: qwen3.5-27b
    family: qwen3.5
    params_b: 27
    vllm_kwargs:
      model: Qwen/Qwen3.5-27B
      tensor_parallel_size: 8
      max_model_len: 32768
      max_num_seqs: 1
      gpu_memory_utilization: 0.9
      limit_mm_per_prompt:
        image: 1
        audio: 0
        video: 0
      mm_processor_cache_gb: 0
      mm_encoder_tp_mode: data
      skip_mm_profiling: true


  - name: qwen3.6-35b-a3b
    family: qwen3.5
    params_b: 35
    vllm_kwargs:
      model: Qwen/Qwen3.6-35B-A3B
      tensor_parallel_size: 8
      max_model_len: 32768
      max_num_seqs: 1
      gpu_memory_utilization: 0.9
      limit_mm_per_prompt:
        image: 1
        audio: 0
        video: 0
      mm_processor_cache_gb: 0
      mm_encoder_tp_mode: data
      skip_mm_profiling: true

Has anyone else tried UI element detection with local models recently? Curious if you guys have any tricks for getting better bounding boxes.