bot_exe

Bullshit Benchmark - A benchmark for testing whether models identify and push back on nonsensical prompts instead of confidently answering them

Posted by bot_exe@reddit | LocalLLaMA | View on Reddit | 33 comments
Grok's system prompt censorship change about Musk and Trump has already degraded its performance.

Posted by bot_exe@reddit | LocalLLaMA | View on Reddit | 62 comments
Preliminary LiveBench results for reasoning: o1-mini decisively beats Claude Sonnet 3.5

Posted by bot_exe@reddit | LocalLLaMA | View on Reddit | 130 comments
GPT-4 Turbo takes first spot on LMSYS Chatbot Arena Leaderboard

Posted by bot_exe@reddit | LocalLLaMA | View on Reddit | 35 comments