Blog: AI evals are becoming the new compute bottleneck

Posted by evijit@reddit | LocalLLaMA | View on Reddit | 6 comments

Hi! I wanted to share my new blog on the costs of running AI Evals. We dig into how benchmarking frontier systems now routinely costs tens of thousands of dollars per run, why agent evals are especially unpredictable, and what that concentration of validation authority means for the broader research community.

[-]

9gxa05s8fa8sh@reddit

I love AI research, the studies and benchmarks are awesome, and the best stuff is not popular yet

[-]

iMakeSense@reddit

What other best stuff isn't popular?

[-]

9gxa05s8fa8sh@reddit

I recently enjoyed the mapcoder-lite study: https://www.reddit.com/r/LocalLLaMA/comments/1symfop/study_2x_coding_performance_of_7b_model_without/

but 500 studies go up on arxiv a day, so you can have at it:

https://arxiv.org/list/cs/recent?skip=0&show=50

[-]

abnormal_human@reddit

Evals are brutal, and honestly one of the best arguments for local AI today since they represent a full utilization, parallel task that can saturate a workstation while also doing valuable work.

[-]

lorddumpy@reddit

4xRTX6000

Just a cool $30,000+ lol

[-]

abnormal_human@reddit

Business. Expense. :)