best image classifications for 8vram

Posted by ashendonep@reddit | LocalLLaMA | View on Reddit | 17 comments

i have rtx 3060ti 8vram , trying to use it to classify images like : 'out of 5k car images tell me which ones is red ' i tried with

qwen3.5:9b
moondream:latest
haervwe/GLM-4.6V-Flash-9B:latest
llava:7b-v1.6-mistral-q4_K_M
llava:latest

the best one was qwen3.5:9b but also the slowest one (like 3 minutes per image ) , so having 5k images takes a decade , what can i do because ai did not help ToT

here is my options if it help

options: {
        num_gpu: -1,
        num_ctx: 4096,
        temperature: 0,
        top_k: 1,
        top_p: 1,
        repeat_penalty: 1,
        use_mlock: false,
        use_mmap: true,
        flash_attn: true,
        kv_cache_type: "q4_0",
        num_keep: 0,
      },
      keep_alive: 120,
    });

[-]

ashendonep@reddit (OP)

is it good for my gpu ? also for this taks will it get the colors wrong or it's accurate

[-]

Mashic@reddit

Step 1: Rescale the images to a smaller size using ffmpeg/pyvips/pillow. Step 2: Use Qwen3-VL-8B with a python script to batch process them.

I bet you could do 5000 images in max an 1,3 hours. I for example used it to OCR a hardcoded subtitle video, about 700 lines in maybe 20 minutes.

[-]

ashendonep@reddit (OP)

wow looks a good solution for sure i will try , ty

[-]

Top-Rub-4670@reddit

Don't use 8B, 4B is perfectly capable of telling car colors and will go at twice the speed.

[-]

You're trying to use an enormous lathe to convert a tree trunk into a toothpick. If you have to do it with an llm, try some qwen3-vl model, the smallest one you can find, although a python script with a simpler yolo model would be ideal.

[-]

ashendonep@reddit (OP)

i'm not forced to do it with llm , what is the other better options ? u said python script should i search for that or there's something more efficient

[-]

Nick-Sanchez@reddit

Tell the qwen 9b model to help you design something lighter! It's a decent coding model, it can sure assist you with that. For example:

CLIP is an "image-text" model that is much smaller and faster than a 9B VLM, but smart enough to understand colors. We use YOLO to find the cars first, then let CLIP "look" at the crops to confirm which ones are red.

The "Find & Confirm" Script

This script uses YOLO to find the objects and CLIP to verify the color.

Python

import cv2
from ultralytics import YOLO
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

# 1. Load the "Specialists"
# YOLO for finding cars (The "Eyes")
detector = YOLO("yolo11n.pt") 
# CLIP for identifying "red" (The "Brain")
classifier = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def find_red_cars(image_path):
    img = cv2.imread(image_path)
    results = detector(img, verbose=False)

    found_red_car = False

    for r in results:
        for box in r.boxes:
            # Check if object is a car (class 2 in COCO)
            if int(box.cls) == 2:
                # Crop the car from the image
                x1, y1, x2, y2 = map(int, box.xyxy[0])
                car_crop = img[y1:y2, x1:x2]
                car_pil = Image.fromarray(cv2.cvtColor(car_crop, cv2.COLOR_BGR2RGB))

                # Ask CLIP: "Is this car red or another color?"
                inputs = processor(
                    text=["a red car", "a car that is not red"],
                    images=car_pil,
                    return_tensors="pt",
                    padding=True
                )

                outputs = classifier(**inputs)
                probs = outputs.logits_per_image.softmax(dim=1)

                # If "red car" has the highest probability
                if probs[0][0] > 0.8: # 80% confidence threshold
                    print(f"🎯 Red car spotted in {image_path}!")
                    found_red_car = True

    return found_red_car

# Run it
find_red_cars("parking_lot.jpg")

Why this architecture is superior:

Modular: If you suddenly decide you want to find blue trucks, you just change the YOLO class to 7 (truck) and the CLIP text to "a blue truck".
Performance: You aren't running the "Brain" (CLIP) on the whole 4K image. You only run it on the tiny little square where the "Eyes" (YOLO) already found a car.
Hardware: This will run comfortably on a laptop with 8GB of RAM, whereas a 9B model would likely crash it or crawl at one frame per minute.

By chaining these two models, you've created a specialized pipeline that is faster, cheaper, and more accurate than the "giant model" approach.

[-]

ashendonep@reddit (OP)

ty bro , for sure i will try this , and ty for putting that much effort for heping

[-]

comfyui_user_999@reddit

I bet something dumber and less resource intensive than a VLM would do this faster, but that may not be what you're looking for here.

[-]

ashendonep@reddit (OP)

i have no idea for real if there's something more efficient for doing that

[-]

New_Comfortable7240@reddit

Did you tried Florence 2 for the first pass?

How Florence detects cars in an image
1. Input the image → Pass the image into Florence’s vision encoder.  
2. Run object detection → Florence outputs bounding boxes with labels (e.g., “car”, “person”, “dog”).  
3. Check for “car” label → If any bounding box is classified as “car”, then the image contains at least one car.  
4. Optional confidence threshold → You can filter results by confidence score (e.g., only count detections above 0.7 probability).

[-]

ashendonep@reddit (OP)

no i didn't , for sure i will try ty

[-]

Forward_Compute001@reddit

It takes you 2 weeks not a decade, it dies the job basically...

buy a second 3060...its a pretty good bang for the buck if you get them second hand..

best image classifications for 8vram

qwen_next_gguf_when@reddit

PhoneOk7721@reddit

ashendonep@reddit (OP)

qwen_next_gguf_when@reddit

ashendonep@reddit (OP)

Mashic@reddit

ashendonep@reddit (OP)

Top-Rub-4670@reddit

Nick-Sanchez@reddit

ashendonep@reddit (OP)

Nick-Sanchez@reddit

The "Find & Confirm" Script

Why this architecture is superior:

ashendonep@reddit (OP)

comfyui_user_999@reddit

ashendonep@reddit (OP)

New_Comfortable7240@reddit

ashendonep@reddit (OP)

Forward_Compute001@reddit