The Evidence-to-Person Fit Problem

Evidence-to-Person Fit asks whether clinical-trial findings actually apply to the person, condition, intervention, comparator, outcome, and care context in question.

Just as Product-Market Fit measures how well a product matches its market, Evidence-to-Person Fit measures how well medical evidence matches the patient asking. A treatment recommendation is only useful if the evidence behind it applies to the person receiving it.

The mismatch happens whether you query AI directly — typing into ChatGPT or Claude yourself — or indirectly, through clinician AI tools (OpenEvidence, UpToDate Expert AI) that shape your doctor’s care plan. Either path can produce overgeneralized advice or miss findings that matter for your situation. Due diligence is what surfaces those facts — the ones that might otherwise be under your care team’s radar.

Diagram: Medical AI matches two artifacts (clinical-trial findings and a patient FHIR Bundle) through a matching process. If granular trial distinctions are preserved, the result is personalized advice and a care plan. If not, four patient-impact risks emerge: Safety x Overgeneralize (false positive), Safety x Overlook (false negative), Efficacy x Overgeneralize (false positive), Efficacy x Overlook (false negative). — Whether Medical AI’s matching preserves granular trial distinctions decides between a personalized care plan and four patient-impact risks.

The four patient-impact risks, illustrated

Using No B.S. Med, we identify two error types:

Precision errors — the answer overgeneralizes a cited study, treating it as more personally applicable than the trial’s population or context supports.
Recall errors — the answer overlooks relevant clinical-trial evidence, citing no study and leaving a high-stakes decision unsupported.

Across Safety and Efficacy, that’s four cells — one representative scenario each:

	AI overgeneralizes loose precision	AI overlooks loose recall
Safety	A sleep-aid trial reports tolerability in adults 18–65 — but excluded patients over 80. AI applies the result to a 92-year-old at high fall risk.	A query for “kava + anxiety” returns benefit trials, but misses adverse-event reports of panic episodes in patients with comorbid ADHD.
Efficacy	A GLP-1 agonist trial showed weight loss in patients with BMI ≥ 30 and type 2 diabetes. AI applies the same expected efficacy to a BMI 26 user without diabetes.	A user asks about CBT for depression. A narrow query misses CBT trials for PTSD, anxiety, and chronic pain — adjacent conditions where the same intervention is tested.

These are hypothetical scenarios — not documented incidents about specific products. They illustrate the failure space the rest of this post is about preventing.

Worked example: from a vague question to a patient-grounded answer

A user asks an AI: “Should I take a statin for my LDL of 145?”

Without the MCP: The AI searches the literature, surfaces an atorvastatin RCT showing a ~32% LDL reduction, and answers “Yes, statins safely lower LDL.” The cited trial happened to exclude women of childbearing potential — but that fact was stripped in summary. The user (32, actively trying to conceive) gets a wrong-for-her answer.

Round 1: query, then inspect what's missing

The AI calls our MCP with the user's question. The MCP returns matched findings plus a slot-completeness report showing which PICO slots (Population, Intervention, Comparator, Outcome — the canonical four-part clinical-question framework) are under-specified given the candidate findings:

{
  "findings_count": 12,
  "underspecified_slots": {
    "Population": [
      "age", "sex / childbearing status",
      "comorbidities (diabetes, renal, hepatic)",
      "concurrent medications"
    ],
    "Outcome": ["LDL reduction vs cardiovascular event reduction vs mortality"],
    "Comparator": ["placebo, other statin, ezetimibe, lifestyle"]
  },
  "candidate_subgroups_in_findings": [
    "primary-prevention adults",
    "post-MI patients",
    "diabetic patients",
    "women of childbearing potential — EXCLUDED in 11/12 trials",
    "elderly (>75)"
  ]
}

The AI uses this to ask the user clarifying questions:

“I see 12 statin trials. To narrow to findings that fit you specifically — your age and sex? Are you pregnant or trying to conceive? Diabetes, kidney/liver disease, other meds? Your goal: lower LDL, prevent cardiac events, or both?”

Round 2: re-query with completed PICO slots (Population, Intervention, Comparator, Outcome)

The user answers. Patient context now structured (partial FHIR):

{
  "Patient": { "gender": "female", "age": 32 },
  "Condition": [
    { "code": "Z31.41", "display": "Encounter for procreative counseling" },
    { "code": "E78.0", "display": "Pure hypercholesterolemia" }
  ],
  "Goal": "LDL reduction; safety priority"
}

The MCP re-queries the Finding IR against the patient's eligibility profile:

{
  "finding_id": "pmc-7xx-finding-3",
  "intervention": { "name": "atorvastatin", "dose": "20mg" },
  "comparator": { "name": "placebo" },
  "outcome": { "name": "LDL reduction", "effect": "−32%", "p_value": "<0.001" },
  "eligibility": {
    "exclusion": [
      "Pregnancy",
      "Women of childbearing potential not on contraception"
    ]
  }
}

Match result: applicability mismatch on Z31.41 (procreative counseling) vs eligibility.exclusion. Atorvastatin findings flagged as not-applicable to this patient.

The MCP also surfaces alternative findings from the structured layer that DO match this patient profile: ezetimibe trials enrolling women of reproductive age, bile-acid sequestrants, lifestyle-only interventions.

The AI's grounded response

“Standard statin trials (including the one most often cited) excluded women trying to conceive — they shouldn't be your first option. Evidence that DOES match your situation: ezetimibe trials in women of reproductive age show ~18% LDL reduction with no fetal-safety signals. Lifestyle interventions (Mediterranean diet, soluble fiber, plant sterols) have separate trial evidence in this population. Discuss with your clinician — and revisit statins post-pregnancy if LDL stays elevated.”

What this required structurally

An IR with explicit PICO slots so the MCP can detect which slots are under-specified.
An open Finding IR with parsed eligibility.exclusion per finding (not stripped in summary).
Patient context as structured FHIR so the MCP can match patient state against eligibility deterministically.
A multi-turn API contract — the MCP can return findings and a refinement report, not just a flat answer.

What makes person-to-finding fit testable

Treating the artifact as an IR — not a schema — surfaces three distinct test surfaces. Each can fail independently; each needs its own ground truth. (Some of this is shipped today, some is roadmap; we'll mark which as the test harness opens up.)

Tier 1 — Parser fidelity: paper → IR

Did extraction translate the natural-language paper into the IR correctly? Population stated as "adults 18–75 with LDL ≥ 130" — did that land in the right slots, with the right bounds, without dropping eligibility.exclusion for pregnancy? Parser tests pin a paper to its expected IR (the test fixture) and assert the extractor reproduces it. Failures here include hallucinated subgroups, dropped exclusion criteria, misaligned comparators, and lost follow-up windows.

Tier 2 — Question alignment: user question → IR query

Did the user's natural-language question get translated into the right query over the IR? "Can I take this safely while pregnant?" needs to compile to a check against eligibility.exclusion and adverse-event slots, not a fuzzy keyword match on the outcome field. Alignment tests pin a question to the expected IR-query (slots, predicates, bounds) and assert the question compiler produces it. A perfect IR queried wrongly returns the wrong findings.

Tier 3 — Semantic adequacy: IR → question coverage

Even with perfect parsing and perfect query compilation, the IR itself can be too poor to answer the questions people actually ask. If no slot encodes follow-up duration, a question about long-term outcomes is unanswerable in principle. Adequacy tests run a curated benchmark of clinically meaningful questions through the full pipeline and measure: of the questions a clinician would ask, how many can the IR answer at all? This is the test that drives schema growth.

Why three tiers, not one

A single end-to-end accuracy number can't tell you where the system failed: bad extraction, bad question compilation, or an IR that's too thin to express the answer. The three tiers map cleanly to compiler engineering practice (parser tests + semantic-analysis tests + reference-program coverage) and let independent contributors target one layer without owning the whole stack.

We open-source the test harness (reproducible checks at all three tiers) and the test fixtures (parsed findings from real studies — the IR instantiated for each paper, plus expected query compilations and the question benchmark). Anyone can point to a specific study and say “this finding was parsed wrong,” propose a correction, propose a new query alignment, or argue that the IR needs a new slot to cover an unanswerable question.

Citations are provenance; fixtures and tests are accountability. The evidence content and MCP runtime are operated by us; the IR, fixtures, and harness underneath are open. As AI systems increasingly mediate medical evidence (see the Medical AI Landscape), the evidence layer needs independent auditability.

Open repo: `evidence-to-person-eval`

The IR, fixture format, and three-tier benchmark harness live at github.com/borisdev/evidence-to-person-eval. The repo's job is to be a neutral comparison layer: any extractor (open-source pipeline, closed-source vendor, in-house team) can register and get scored on the same fixtures with the same 4-risk scorecard.

Status: v0.0.3, design phase. The IR shape, harness API, and fixture format are actively evolving and not yet stable. EBMonFHIR alignment audit complete; 3 named extensions plus a sub-extension on EBMonFHIR's existing relates-to-with-quotation. The first real ground-truth fixture is the highest-leverage next contribution — issue #1 spells out the criteria.

Three contributor personas, three folders: Tier 1 (extraction folks) author parsed-IR JSON for new papers; Tier 2 (clinical-question folks) author question-to-IR-query alignments; Tier 3 (clinicians and patients) author plain-language expectation YAMLs — no code required. Every assertion needs a reason: field.