The most popular method for measuring what are the best chatbots in the world is flawed and frequently manipulated by powerful companies like OpenAI and Google in order to make their products seem better than they actually are, according to a new paper from researchers at the AI company Cohere, as well as Stanford, MIT, and other universities.
The researchers came to this conclusion after reviewing data that’s made public by Chatbot Arena (also known as LMArena and LMSYS), which facilitates benchmarking and maintains the leaderboard listing the best large language models, as well as scraping Chatbot Arena and their own testing. Chatbot Arena, meanwhile, has responded to the researchers findings by saying that while it accepts some criticisms and plans to address them, some of the numbers the researchers presented are wrong and mischaracterize how Chatbot Arena actually ranks LLMs. The research was published just weeks after Meta was accused of gaming AI benchmarks with one of its recent models.
If you’re wondering why this beef between the researchers, Chatbot Arena, and others in the AI industry matters at all, consider the fact that the biggest tech companies in the world as well as a great number of lesser known startups are currently in a fierce competition to develop the most advanced AI tools, operating under the belief that these AI tools will define the future of humanity and enrich the most successful companies in this industry in a way that will make previous technology booms seem minor by comparison.
I should note here that Cohere is an AI company that produces its own models and that they don’t appear to rank very highly in the Chatbot Arena leaderboard. The researchers also make the point that proprietary closed models from competing companies appear to have an unfair advantage to open-source models, and that Cohere proudly boasts that its model Aya is “one of the largest open science efforts in ML to date.” In other words, the research is coming from a company that Chatbot Arena doesn’t benefit.
Judging which large language model is the best is tricky because different people use different AI models for different purposes and what is the “best” result is often subjective, but the desire to compete and compare these models has made the AI industry default to the practice of benchmarking AI models. Specifically, Chatbot Arena, which gives a numerical “Arena Score” to models companies submit and maintains a leaderboard listing the highest scoring models. At the moment, for example, Google’s Gemini 2.5 Pro is in the number one spot, followed by OpenAI’s o3, ChatGPT 4o, and X’s Grok 3.
The vast majority of people who use these tools probably have no idea the Chatbot Arena leaderboard exists, but it is a big deal to AI enthusiasts, CEOs, investors, researchers, and anyone who actively works or is invested in the AI industry. The significance of the leaderboard also remains despite the fact that it has been criticized extensively over time for the reasons I list above. The stakes of the AI race and who will win it are objectively very high in terms of the money that’s being poured into this space and the amount of time and energy people are spending on winning it, and Chatbot Arena, while flawed, is one of the few places that’s keeping score.
“A meaningful benchmark demonstrates the relative merits of new research ideas over existing ones, and thereby heavily influences research directions, funding decisions, and, ultimately, the shape of progress in our field,” the researchers write in their paper, titled “The Leaderboard illusion.” “The recent meteoric rise of generative AI models—in terms of public attention, commercial adoption, and the scale of compute and funding involved—has substantially increased the stakes and pressure placed on leaderboards.”
The way that Chatbot Arena works is that anyone can go to its site and type in a prompt or question. That prompt is then given to two anonymous models. The user can’t see what the models are, but in theory one model could be ChatGPT while the other is Anthropic’s Claude. The user is then presented with the output from each of these models and votes for the one they think did a better job. Multiply this process by millions of votes and that’s how Chatbot Arena determines who is placed where on the leaderboards. Deepseek, the Chinese AI model that rocked the industry when it was released in January, is currently ranked #7 on the leaderboard, and its high score was part of the reason people were so impressed.
According to the researchers’ paper, the biggest problem with this method is that Chatbot Arena is allowing the biggest companies in this space, namely Google, Meta, Amazon, and OpenAI, to run “undisclosed private testing” and cherrypick their best model. The researchers said their systemic review of Chatbot Arena involved combining data sources encompassing 2 million “battles,” auditing 42 providers and 243 models between January 2024 and April 2025.
“This comprehensive analysis reveals that over an extended period, a handful of preferred providers have been granted disproportionate access to data and testing,” the researchers wrote. “In particular, we identify an undisclosed Chatbot Arena policy that allows a small group of preferred model providers to test many model variants in private before releasing only the best-performing checkpoint.”
Basically, the researchers claim that companies test their LLMs on Chatbot Arena to find which models score best, without those tests counting towards their public score. Then they pick the model that scores best for official testing.
Chatbot Arena says the researchers’ framing here is misleading.
“We designed our policy to prevent model providers from just reporting the highest score they received during testing. We only publish the score for the model they release publicly,” it said on X.
“In a single month, we observe as many as 27 models from Meta being tested privately on Chatbot Arena in the lead up to Llama 4 release,” the researchers said. “Notably, we find that Chatbot Arena does not require all submitted models to be made public, and there is no guarantee that the version appearing on the public leaderboard matches the publicly available API.”
In early April, when Meta’s model Maverick shot up to the second spot of the leaderboard, users were confused because they didn’t find it that good and better than other models that ranked below it. As Techcrunch noted at the time, that might be because Meta used a slightly different version of the model “optimized for conversationality” on Chatbot Arena than what users had access to.
“We helped Meta with pre-release testing for Llama 4, like we have helped many other model providers in the past,” Chatbot Arena said in response to the research paper. “We support open-source development. Our own platform and analysis tools are open source, and we have released millions of open conversations as well. This benefits the whole community.”
The researchers also claim that makers or proprietary models, like OpenAI and Google, collect far more data from their testing on Chatbot Arena than fully open-source models, which allows them to better fine tune the model to what Chatbot Arena users want.
That last part on its own might be the biggest problem with Chatbot Arena’s leaderboard in the long term, since it incentivizes the people who create AI models to design them in a way that scores well on Chatbot Arena as opposed to what might make them materially better and safer for users in a real world environment.
As the researchers write: “the over-reliance on a single leaderboard creates a risk that providers may overfit to the aspects of leaderboard performance, without genuinely advancing the technology in meaningful ways. As Goodhart’s Law states, when a measure becomes a target, it ceases to be a good measure.”
Despite their criticism, the researchers acknowledge the contribution of Chatbot Arena to AI research and that it serves a need, and their paper ends with a list of recommendations on how to make it better, including preventing companies from retracting scores after submission, being more transparent which models engage in private testing and how much.
“One might disagree with human preferences—they’re subjective—but that’s exactly why they matter,” Chatbot Arena said on X in response to the paper. “Understanding subjective preference is essential to evaluating real-world performance, as these models are used by people. That’s why we’re working on statistical methods—like style and sentiment control—to decompose human preference into its constituent parts. We are also strengthening our user base to include more diversity. And if pre-release testing and data helps models optimize for millions of people’s preferences, that’s a positive thing!”
“If a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly,” it added. “Every model provider makes different choices about how to use and value human preferences.”