Researchers Say the Most Popular Tool for Grading AIs Unfairly Favors Meta, Google, OpenAI

· May 1, 2025 at 9:00 AM

Chatbot Arena is the most popular AI benchmarking tool, but new research says its scores are misleading and benefit a handful of the biggest companies.

Researchers Say the Most Popular Tool for Grading AIs Unfairly Favors Meta, Google, OpenAI — Photo by Saradasish Pradhan / Unsplash

The most popular method for measuring what are the best chatbots in the world is flawed and frequently manipulated by powerful companies like OpenAI and Google in order to make their products seem better than they actually are, according to a new paper from researchers at the AI company Cohere, as well as Stanford, MIT, and other universities.

The researchers came to this conclusion after reviewing data that’s made public by Chatbot Arena (also known as LMArena and LMSYS), which facilitates benchmarking and maintains the leaderboard listing the best large language models, as well as scraping Chatbot Arena and their own testing. Chatbot Arena, meanwhile, has responded to the researchers findings by saying that while it accepts some criticisms and plans to address them, some of the numbers the researchers presented are wrong and mischaracterize how Chatbot Arena actually ranks LLMs. The research was published just weeks after Meta was accused of gaming AI benchmarks with one of its recent models.