Chatbots may be able to pass medical exams, but that doesn’t mean they make good doctors, according to a new, large-scale study of how people get medical advice from large language models.
The controlled study of 1,298 UK-based participants, published today in Nature Medicine from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford, tested whether LLMs could help people identify underlying conditions and suggest useful courses of action, like going to the hospital or seeking treatment. Participants were randomly assigned an LLM — GPT-4o, Llama 3, and Cohere’s Command R+ — or were told to use a source of their choice to “make decisions about a medical scenario as though they had encountered it at home,” according to the study. The scenarios included ailments like “a young man developing a severe headache after a night out with friends for example, to a new mother feeling constantly out of breath and exhausted,” the researchers said.
“One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care.”
When the researchers tested the LLMs without involving users by providing the models with the full text of each clinical scenario, the models correctly identified conditions in 94.9 percent of cases. But when talking to the participants about those same conditions, the LLMs identified relevant conditions in fewer than 34.5 percent of cases. People didn’t know what information the chatbots needed, and in some scenarios, the chatbots provided multiple diagnoses and courses of action. Knowing what questions to ask a patient and what information might be withheld or missing during an examination are nuanced skills that make great human physicians; based on this study, chatbots can’t reliably replicate that kind of care.
In some cases, the chatbots also generated information that was just wrong or incomplete, including focusing on elements of the participants’ inputs that were irrelevant, giving a partial US phone number to call, or suggesting they call the Australian emergency number.
“In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice,” the study’s authors wrote. “One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care.”
“These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health,” Dr. Rebecca Payne, lead medical practitioner on the study, said in a press release. “Despite all the hype, AI just isn't ready to take on the role of the physician. Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed.”

Last year, 404 Media reported on AI chatbots hosted by Meta that posed as therapists, providing users fake credentials like license numbers and educational backgrounds. Following that reporting, almost two dozen digital rights and consumer protection organizations sent a complaint to the Federal Trade Commission urging regulators to investigate Character.AI and Meta’s “unlicensed practice of medicine facilitated by their product,” through therapy-themed bots that claim to have credentials and confidentiality “with inadequate controls and disclosures.” A group of Democratic senators also urged Meta to investigate and limit the “blatant deception” of Meta’s chatbots that lie about being licensed therapists, and 44 attorneys general signed an open letter to 11 chatbot and social media companies, urging them to see their products “through the eyes of a parent, not a predator.”
In January, OpenAI announced ChatGPT Health, “a dedicated experience that securely brings your health information and ChatGPT’s intelligence together, to help you feel more informed, prepared, and confident navigating your health,” the company said in a blog post. “Over two years, we’ve worked with more than 260 physicians who have practiced in 60 countries and dozens of specialties to understand what makes an answer to a health question helpful or potentially harmful—this group has now provided feedback on model outputs over 600,000 times across 30 areas of focus,” the company wrote. “This collaboration has shaped not just what Health can do, but how it responds: how urgently to encourage follow-ups with a clinician, how to communicate clearly without oversimplifying, and how to prioritize safety in moments that matter.”
“In our work, we found that none of the tested language models were ready for deployment in direct patient care. Despite strong performance from the LLMs alone, both on existing benchmarks and on our scenarios, medical expertise was insufficient for effective patient care,” the researchers wrote in their paper. “Our work can only provide a lower bound on performance: newer models, models that make use of advanced techniques from chain of thought to reasoning tokens, or fine-tuned specialized models, are likely to provide higher performance on medical benchmarks.” The researchers recommend developers, policymakers, and regulators consider testing LLMs with real human users before deploying in the future.