How to improve response times for multimodal requests?

Posted by coolahavoc@reddit | LocalLLaMA | View on Reddit | 2 comments

I am running Gemma 3 12B on my local computer. My prompt is about 1000 tokens of text + 3-4 images. My computer is just a regular AMD CPU (no GPU) + 64GB of DDR5 RAM, so understandably the response is slow. Particularly I have noticed that it takes more time to just process my input.

My question is what hardware would help improve this:
1. Obviously a GPU would help - but what should I look for in a GPU to get better response times?
2. Would the newer AMD Ryzen™ AI 9 HX 370 APU help or would I need to go for an AMD Ryzen AI Max+ 395 APU's?
3. If I got for the AMD Ryzen™ AI 9 HX 370 APU, some PCs come with upgradeable RAM i.e. DDR5 (going up to 96GB), while others come with faster LPDDR5 RAM - but with the caveat that the max RAM is capped at 64 GB. I want to be able to run slightly larger models on it (e.g. Gemma 3 27B), but not sure if I need to go for the LPDDR5x versions.

How to improve response times for multimodal requests?

Background-Ad-5398@reddit

coolahavoc@reddit (OP)