Our tools connect everyday people to clinical trial findings.
We don’t store anything you paste. Submissions run through the evidence layer in the moment and aren’t persisted, logged against an identity, or sent anywhere a third party can re-identify you. No account is required to try it.
We don’t tell you what to do. We surface what clinical trials actually said about people in situations like yours — and where the evidence runs out — so you and your clinician can decide. This is not a substitute for clinical care.
Think Carfax for used cars — but for the medical decisions you’re about to make.
We don’t store anything you paste. Submissions run through the evidence layer in the moment and aren’t persisted, logged against an identity, or sent anywhere a third party can re-identify you. No account is required to try it.
We don’t tell you what to do. We surface what clinical trials actually said about people in situations like yours — and where the evidence runs out — so you and your clinician can decide. This is not a substitute for clinical care.
Think Carfax for used cars — but for the medical decisions you’re about to make.
OpenAI grades every flagship release against its own HealthBench medical benchmark — criteria written by 262 physicians across roughly 48,000 rubric points. As of 2026, the picture across their published numbers:
| Model | HealthBench variant | Score |
|---|---|---|
| GPT-5 | HealthBench Hard | 46.2% |
| GPT-5.5 | HealthBench Hard | 31.5% |
| GPT-5.5 | HealthBench Consensus | 95.6% |
| GPT-5.5 | HealthBench Professional | 51.8% |
| ChatGPT for Clinicians (GPT-5.4) | HealthBench Professional | 59.0% |
| Physicians (baseline) | HealthBench Professional | 43.7% |
Original May 2025 paper used a single rubric-satisfaction rate: o3 = 60%, GPT-4.1 = 48%, o1 = 42%, GPT-4o = 32%, GPT-3.5 = 16%. HealthBench has since been split into the Hard / Consensus / Professional variants above.
OpenAI’s headline medical claim in 2026 is that ChatGPT for Clinicians beats physicians on HealthBench Professional (59.0 vs 43.7). That number is graded against the benchmark we audit — so the marquee medical-AI claim of the year rides on a scoring rubric whose reliability is itself measurable.
In our audit of HealthBench’s doctor-written gold answers themselves, we’ve found decision-changing errors in roughly 3% of claims in the first 110 audited (3 findings). A fourth triple-source-verified fabrication was added 2026-05-29 but is not yet reflected in that count. Not only can the AI players be wrong — even the doctors writing the benchmark can be wrong, and those errors propagate into every model graded against it.
Sources: HealthBench paper (May 2025) · HealthBench Professional paper · OpenAI: Introducing GPT-5 · OpenAI: Introducing GPT-5.5 · GPT-5.5 System Card · TechRepublic: GPT-5 medical benchmarks · Vellum: GPT-5 benchmarks · BenchLM: GPT-5.5 benchmarks 2026
How is our approach different?
What
Doctors and patients make decisions using medical AI claims based on broad summaries of clinical-trial studies.
Why
Broad medical AI summaries run the risk of overgeneralizing and overlooking granular details of clinical-trial findings that are pertinent to your unique personal context.
How
We run deterministic queries to cross-check clinical-trial claims against your personal context, composed via AI elicitation.
Why this matters now — two recent peer-reviewed studies on how clinicians and patients are already using medical AI:
Clinician-side: OpenEvidence accounted for 98.7% of searches across leading AI-enabled clinical reference tools, with traffic rising to ~1.59 million visits/month by June 2025.1
Patient-side: In a study of 617,827 Microsoft Copilot conversations, roughly 1 in 5 involved personal symptom assessment or condition discussion. Microsoft explicitly notes that benchmark performance does not predict real-world reliability for high-stakes health questions.2
References
1 Patel VR, Liu M, Jena AB. Public Interest in an AI-Enabled Clinical Decision Support Tool. JAMA Network Open, Nov 20, 2025.
2 Costa-Gomes B, Tolmachev P, et al. (Microsoft AI). Public use of a generalist LLM chatbot for health queries. Nature Health, April 16, 2026.
Not directly. Adjacent services handle pieces of the workflow — clinician-side evidence Q&A, patient-side AI triage, trial matching for enrollment — but none combine patient-supplied health data, a structured clinical-trial corpus, and a personalized applicability audit.
| Service | What they do | What they’re missing |
|---|---|---|
| OpenEvidence, UpToDate AI, AMBOSS AI, ReachRx | Clinician-side evidence Q&A | No patient-data ingestion, no applicability layer |
| Glass Health | Clinician CDS with FHIR context | Clinician-only — not patient-side audit |
| Hippocratic AI, Ada, K Health | Patient-facing AI agents (triage, intake, care management) | No structured trial backend |
| Deep 6 AI, TrialFit, TrialMatchAI | Trial matching for enrollment | Opposite direction — get into trials, not audit existing care |
| ChatGPT, Claude, Perplexity, Consensus, Elicit | General AI medical Q&A / paper search | No structured trial extraction, no patient-data ingestion |
| Cleveland Clinic Express Care, Mayo Clinic 2nd Opinion | Human clinician second opinions | Human-mediated, expensive, not data-driven |
Closest in shape: Glass Health (clinician-side). Closest in audience: Hippocratic AI (B2B health-system patient agents). What’s empty: the patient-side evidence-audit lane.
Long-form essays on the problem we’re working on, the medical AI landscape, and the open-source tooling we ship for builders:
Open-source: evidence-to-person-eval on GitHub — eval-driven design of FHIR Evidence representations, aligned with EBMonFHIR. v0.0.3, design phase.
Science writer and breast cancer survivor. Keeps us creative and grounded.
LinkedInNo B.S. Med retrieves and structures clinical-study evidence. It does not diagnose, prescribe, or replace professional medical judgment. Users should consult a qualified healthcare professional before making medical decisions.