bot_exe
-
Bullshit Benchmark - A benchmark for testing whether models identify and push back on nonsensical prompts instead of confidently answering them
Posted by bot_exe@reddit | LocalLLaMA | View on Reddit | 33 comments
-
Grok's system prompt censorship change about Musk and Trump has already degraded its performance.
Posted by bot_exe@reddit | LocalLLaMA | View on Reddit | 62 comments
-
Preliminary LiveBench results for reasoning: o1-mini decisively beats Claude Sonnet 3.5
Posted by bot_exe@reddit | LocalLLaMA | View on Reddit | 130 comments
-
GPT-4 Turbo takes first spot on LMSYS Chatbot Arena Leaderboard
Posted by bot_exe@reddit | LocalLLaMA | View on Reddit | 35 comments