There is a benchmark that kinda tests that. It’s call the bullshit benchmark. Basically, LLMs are given questions that don’t make sense in different ways, and their answers are judged based on how much they pushed back or bought in. Claude is in a league of its own when it comes to pushing back on non-sense questions.
Yes i saw that benchmark and was honestly not surprised with the results. It seems that Anthropic really focused on those issues above and beyond what was done in other labs.
There is a benchmark that kinda tests that. It’s call the bullshit benchmark. Basically, LLMs are given questions that don’t make sense in different ways, and their answers are judged based on how much they pushed back or bought in. Claude is in a league of its own when it comes to pushing back on non-sense questions.
https://petergpt.github.io/bullshit-benchmark/viewer/index.html
Yes i saw that benchmark and was honestly not surprised with the results. It seems that Anthropic really focused on those issues above and beyond what was done in other labs.