How to benchmark local LLM

Posted by badabimbadabum2@reddit | LocalLLaMA | View on Reddit | 1 comments

planning to put AI on my website so that users can either ChAT, summarize content or do smt else. To find suitable GPU, I want to know how many users one GPU could serve simultaneously. So for example if same time 20 users asks a question from the local LLM, how fast a 4090 can serve the output to the users. So is there a test which could simulate user demand like websites tests? I might be able to do smt with locust but it would need some work. Anyone knows?

1 Comments

[-]

ReddaHawk@reddit

For a similar scenario, I needed to test the performance of the BGE-M3 embedding model as the number of concurrent users varied. To do this, I used the Python library Locust, which allows me to capture some metrics such as latency. I ran the tests across different GPUs hosted on RunPod to get the best for my purpose.

Reply to Post

1 Comments