AI Chatbots’ Hidden Health Dangers Exposed

Healthcare professional interacting with a smartphone displaying health-related icons

AI chatbots deliver nearly half their health advice with hidden dangers that sound convincingly right, potentially harming users who trust them blindly.

Story Snapshot

BMJ Open study finds 49.6% of AI responses to 50 health questions problematic, including 19.6% highly harmful.
Tested five chatbots—Gemini, DeepSeek, Meta AI, ChatGPT, Grok—on cancer, vaccines, stem cells, nutrition, athletic performance.
Responses confident without caveats; citations incomplete (40% median score); chatbots fabricate over admitting ignorance.
1 in 4 U.S. adults use AI for health per KFF poll, with low doctor follow-up rates.
Grok performed worst; Gemini best among tested models.

Study Tests AI on Misinformation-Prone Health Queries

Researchers posed 50 questions across five categories to Google’s Gemini, DeepSeek, Meta AI, ChatGPT, and Grok in February 2025. Categories included cancer, vaccines, stem cells, nutrition, and athletic performance—fields rife with misinformation. BMJ Open published results last week before March 2025 media coverage. Independent experts rated responses: 30% somewhat problematic, 19.6% highly problematic. Total problematic rate hit 49.6%.

Chatbots refused only 0.8% of 250 queries, preferring fabrications. Nutrition scored worst (+4.35 z-score), athletic performance next (+3.74), vaccines best (–2.57). Responses lacked balance, often omitting counter-evidence. Lead author Nick Tiller defined problematic as potentially harmful without full evidence context. Experts warn self-diagnosis risks real patient harm without doctor verification.

AI Chatbots Confidently Deliver Flawed Advice

Chatbots presented errors with unwavering certainty, skipping caveats or sources. Median citation completeness scored 40%. They bypassed real-time data, relying on training patterns prone to hallucinations—fabricated facts. Yale’s Lee Schwamm noted: chatbots wrong but never in doubt. Tiller highlighted AI trusts summaries over primary sources. This overconfidence erodes when users skip professionals.

Separate precedent showed 80% AI failure in differential diagnosis, mimicking doctor symptom analysis. Pre-2025 warnings flagged similar issues. KFF poll reveals 1 in 4 U.S. adults seek AI health advice; only 58% follow up for physical issues, 42% for mental health.

Stakeholders Face Scrutiny Over AI Health Risks

Nick Tiller led the BMJ Group study, published by British Medical Association-owned BMJ Open. Developers—xAI for Grok, Google, Meta, OpenAI, DeepSeek—prioritized usability over accuracy safeguards. Grok showed highest highly problematic rate; Gemini lowest overall issues. Researchers seek public safety through oversight. Developers compete fiercely, but equal critique across models shows no favorites.

Academic credibility trumps commercial hype. Tiller warns highly problematic answers could cause direct harm. Schwamm amplifies via media. No conflicts noted, but power lies with publishers exposing flaws. Everyday Health tied in KFF poll data, underscoring widespread use despite dangers.

Implications Demand Caution and Regulation

Short-term, study curbs blind AI reliance, cutting self-diagnosis pitfalls. Long-term, unchecked spread amplifies misinformation in vulnerable areas like cancer and vaccines. Mental health users, with lowest follow-up, face amplified risks. Social trust in digital tools erodes; political calls grow for education and oversight. Industry pressures mount for disclaimers and better citations amid rapid evolution.

Adversarial testing inflates issues slightly, and only five chatbots limit scope—AI advances fast. Still, facts align: treat AI as supplement, not substitute. Patients in high-risk groups must prioritize doctors to avoid harm from convincing fictions.