best image classifications for 8vram
Posted by ashendonep@reddit | LocalLLaMA | View on Reddit | 17 comments
i have rtx 3060ti 8vram , trying to use it to classify images like : 'out of 5k car images tell me which ones is red ' i tried with
- qwen3.5:9b
- moondream:latest
- haervwe/GLM-4.6V-Flash-9B:latest
- llava:7b-v1.6-mistral-q4_K_M
- llava:latest
the best one was qwen3.5:9b but also the slowest one (like 3 minutes per image ) , so having 5k images takes a decade , what can i do because ai did not help ToT
here is my options if it help
options: {
num_gpu: -1,
num_ctx: 4096,
temperature: 0,
top_k: 1,
top_p: 1,
repeat_penalty: 1,
use_mlock: false,
use_mmap: true,
flash_attn: true,
kv_cache_type: "q4_0",
num_keep: 0,
},
keep_alive: 120,
});
qwen_next_gguf_when@reddit
unsloth/Qwen3-VL-4B-Instruct-GGUF
PhoneOk7721@reddit
3.5, you mean?
ashendonep@reddit (OP)
is it good for my gpu ? also for this taks will it get the colors wrong or it's accurate
qwen_next_gguf_when@reddit
Yes
ashendonep@reddit (OP)
ty
Mashic@reddit
Step 1: Rescale the images to a smaller size using ffmpeg/pyvips/pillow. Step 2: Use Qwen3-VL-8B with a python script to batch process them.
I bet you could do 5000 images in max an 1,3 hours. I for example used it to OCR a hardcoded subtitle video, about 700 lines in maybe 20 minutes.
ashendonep@reddit (OP)
wow looks a good solution for sure i will try , ty
Top-Rub-4670@reddit
Don't use 8B, 4B is perfectly capable of telling car colors and will go at twice the speed.
Nick-Sanchez@reddit
You're trying to use an enormous lathe to convert a tree trunk into a toothpick. If you have to do it with an llm, try some qwen3-vl model, the smallest one you can find, although a python script with a simpler yolo model would be ideal.
ashendonep@reddit (OP)
i'm not forced to do it with llm , what is the other better options ? u said python script should i search for that or there's something more efficient
Nick-Sanchez@reddit
Tell the qwen 9b model to help you design something lighter! It's a decent coding model, it can sure assist you with that. For example:
CLIP is an "image-text" model that is much smaller and faster than a 9B VLM, but smart enough to understand colors. We use YOLO to find the cars first, then let CLIP "look" at the crops to confirm which ones are red.
The "Find & Confirm" Script
This script uses YOLO to find the objects and CLIP to verify the color.
Python
Why this architecture is superior:
7(truck) and the CLIP text to"a blue truck".By chaining these two models, you've created a specialized pipeline that is faster, cheaper, and more accurate than the "giant model" approach.
ashendonep@reddit (OP)
ty bro , for sure i will try this , and ty for putting that much effort for heping
comfyui_user_999@reddit
I bet something dumber and less resource intensive than a VLM would do this faster, but that may not be what you're looking for here.
ashendonep@reddit (OP)
i have no idea for real if there's something more efficient for doing that
New_Comfortable7240@reddit
Did you tried Florence 2 for the first pass?
ashendonep@reddit (OP)
no i didn't , for sure i will try ty
Forward_Compute001@reddit
It takes you 2 weeks not a decade, it dies the job basically...
buy a second 3060...its a pretty good bang for the buck if you get them second hand..