Regain Epistemology

The First Health AI Built on Good Explanations

The quest for good explanations is, I believe, the basic regulating principle not only of science, but of the Enlightenment generally.

David Deutsch, The Beginning of Infinity

Most healthcare AI systems optimize for prediction accuracy. We optimize for explanation quality. The difference: predictions can be right for wrong reasons. Explanations that are hard to vary remain correctable and auditable even when wrong.

Version 2.0.0

Last updated: 2026-01-25

TL;DR

Most healthcare AI systems optimize for prediction accuracy. We optimize for explanation quality.

The difference: predictions can be right for wrong reasons (and silently fail when conditions change). Explanations that are hard to vary where every component is load-bearing remain correctable and auditable even when wrong.

ArgMed Debate

Generate multiple hypotheses, attack each adversarially, keep only survivors

HTV Scoring

Quantify how "hard to vary" each explanation is (0.0-1.0)

IDK Protocol

12 specific uncertainty triggers with structured responses

Falsification Criteria

Every claim specifies what would prove it wrong

Safety Routing

High-risk decisions require human clinician approval

Clinician Feedback Loop

Overrides actively change future reasoning for that patient

Composable Domains

Medication, nutrition, exercise, sleep, mental health: same principles

Rules as Data

Interaction rules are explicit, versioned, auditable data

Accuracy Ascertainment

We measure our own predictions against outcomes

The result: an AI that tells you why, admits when it doesn't know, specifies how to prove it wrong, measures whether it was right, and learns from clinician corrections.

Philosophy Foundations

The fundamental failure of current health AI is epistemological, not computational. Most health AI makes predictions based on patterns. But pattern-matching is not understanding.

The Inductive Fallacy

Exhibit A

Correlation does not equal Causation

Pattern matching finds correlations, not causes

Interventions based on spurious correlations fail or cause harm

Exhibit B

The Black Swan Problem

Rare cases don't match common patterns. Medicine is full of rare cases

Induction fails precisely on the edge cases that matter most

Exhibit C

Easy-to-Vary Explanations

Probabilistic outputs are 'mushy': you can change details without breaking the theory

'You might have A, B, or C' explains nothing and helps no one

These are practical manifestations of the deeper Hume/Popper critique: induction (inferring general rules from particular observations) cannot justify knowledge. No amount of data can prove a universal claim, but a single counterexample can refute one.

The Seasons Example

In "The Beginning of Infinity" (Ch. 1), Deutsch illustrates good vs. bad explanations using the ancient Greek explanation for seasons:

Easy to Vary

Persephone, goddess of spring, was kidnapped by Hades. Her mother Demeter's grief causes winter. When Persephone returns, spring comes.

Could substitute any gods or emotions. Nothing is load-bearing.

Hard to Vary

Earth's axis is tilted 23.5 degrees relative to its orbital plane around the sun. This causes different hemispheres to receive more direct sunlight at different times of year.

Change the tilt angle and predictions break. Every detail constrains.

This distinction is the foundation of our entire architecture.

Our Epistemological Foundation

We build on two complementary philosophical frameworks:

Karl Popper

Conjecture and Refutation

Science advances not by confirming theories but by attempting to refute them. The demarcation between science and pseudoscience is falsifiability.

David Deutsch

Hard-to-Vary Explanations

Good explanations begin with bad explanations. You get there by criticism, by conjecturing variants, and choosing the one that survives.

Philosophy Made Concrete

Conjecture-Refutation

Deutsch ArgMed Debate

Multi-agent Generator -> Verifier -> Reasoner pipeline

Hard-to-Vary Criterion

HTV Scoring

4-dimensional algorithm (interdependence, specificity, non-adhocness, falsifiability)

Boundary Enforcement

Popper Safety Rules

Deterministic policy engine enforcing safety boundaries

Fallibilism

IDK Protocol

Structured honest uncertainty admission

Falsifiability

FalsificationCriteria

Every claim includes explicit refutation conditions

Error Correction

Clinician Feedback Loop

Override tracking with confidence decay

System Architecture

Deutsch → Hermes → Popper Pipeline

DEUTSCH

Knowledge Creation

Generator → Verifier → Reasoner → HTV Score

Multi-agent conjecture-refutation creates competing hypotheses, attacks them adversarially, and scores survivors.

HERMES

Typed Contract

SupervisionRequest

Packages epistemological metadata — HTV scores, evidence grades, falsification criteria — into a structurally enforced contract.

POPPER

Safety Boundary

APPROVE ROUTE HARD_STOP

Evaluates the contract against safety thresholds and renders a final decision: proceed, escalate, or block.

Key principle: The Hermes contract ensures that epistemological metadata (HTV scores, evidence grades, falsification criteria) is structurally enforced, not optional. Popper cannot evaluate a proposal that doesn't include this data.

Component Summary

Deutsch

Named after David Deutsch

Knowledge creation via conjecture-refutation

Hermes

Greek messenger god

Typed epistemological contract

Popper

Named after Karl Popper

Safety boundary enforcement

Data Flow

1 Patient Message arrives
2 Deutsch Generator creates multiple hypotheses (mechanism diversity required)
3 Deutsch Verifier attacks each hypothesis adversarially
4 Deutsch Reasoner selects survivors based on HTV scores
5 Hermes packages the recommendation with required metadata
6 Popper evaluates safety and makes a decision
7 Output goes to patient (if approved) or clinician (if routed)
8 Feedback Loop captures clinician overrides for future learning

The Separation of Concerns

Aspect

Purpose

Computes

When

Action

DEUTSCH

Reasoning quality

HTV scores, hypotheses

During hypothesis generation

Reject low-HTV hypotheses

HERMES

Message structure

Validation

During message passing

Reject malformed messages

POPPER

Safety gating

Decisions

During supervision

Route/Approve/Block

Core Components

Three interconnected mechanisms form the heart of our epistemological architecture:

Generator

Conjecturer

The Bold Hypothesis Machine

Produces multiple hypotheses spanning different mechanisms. Each conjecture must propose a distinct causal pathway.

Verifier

Critic

The Adversarial Attacker

Attacks each hypothesis adversarially, scores HTV dimensions, and attempts to falsify every claim.

Reasoner

Synthesizer

The Survival Selector

Selects survivors based on HTV threshold. Only hypotheses that withstand criticism proceed.

Mechanism Diversity Requirement

✓ Good Differential (chest pain)

Acute coronary syndrome (cardiac)
Pulmonary embolism (pulmonary)
Musculoskeletal pain (muscle/bone)

Three distinct causal mechanisms. Hard to vary.

✕ Bad Differential (rejected)

ST-elevation heart attack (cardiac)
Non-ST-elevation heart attack (cardiac)
Unstable angina (cardiac)

The second example is easy to vary within the cardiac category, all three share the same mechanism. This violates Deutschian epistemology.

Interdependence

How tightly coupled are the components?

High

Every piece connects to the conclusion

Low

Components could be swapped

Specificity

How precise are the predictions?

High

Specific, measurable outcomes

Low

Vague, unfalsifiable predictions

Non-adhocness

Are all elements load-bearing?

High

Removing any element changes predictions

Low

Contains free parameters

Falsifiability

What would refute this claim?

High

Clear conditions that prove it wrong

Low

Immune to counterevidence

HTV Thresholds (v0.1)

< 0.3 REJECT

0.3–0.4 ROUTE

0.4–0.7 MODERATE

≥ 0.7 GOOD

Worked Example: Fatigue

"You feel tired because of stress."

No mechanism, no specificity, no falsification criteria. Classic mushy AI output.

"Your fatigue is caused by iron deficiency anemia. Ferritin 8 ng/mL indicates depleted stores. Hemoglobin 10.2 g/dL confirms anemia."

Tight causal chain: iron -> hemoglobin -> oxygen -> fatigue. Every piece constrains.

The IDK (I Don't Know) Protocol formalizes how our system handles situations where it cannot make a confident recommendation. When we trigger IDK, we're not claiming the problem is unsolvable. Deutsch's optimism states that all problems are soluble given the right knowledge.

IDK_HTV_LOW

Composite below 0.4

Route to clinician

IDK_NO_SURVIVORS

All hypotheses rejected

Route to clinician

IDK_MISSING_SIGNAL

Critical data absent

Request more info

IDK_CONFLICT

Unresolved contradictions

Route to clinician

IDK_EVIDENCE_WEAK

Only expert opinion available

Route to clinician

IDK_STALE

Snapshot too old

Request refresh

IDK_OUT_OF_SCOPE

Query outside domain

Deflect appropriately

The Discriminator: Breaking Ties Between Theories

When multiple theories survive with equal HTV scores, we don't guess. We identify the discriminator: the single test that would kill one theory but not the other.

Competing Theories	Discriminator	Logic
Iron vs. B12 Deficiency	MCV	Iron: low MCV; B12: high MCV
Heart Failure vs. Venous Insufficiency	BNP Level	HF: elevated; Venous: normal
Hypothyroidism vs. Depression	TSH Level	Hypothyroid: elevated; Depression: normal

End-to-End Example

To make this concrete, here's how the system handles a real patient interaction.

Patient Message

"I've been feeling more tired than usual and my ankles are swollen. Should I be worried?"

Bold Conjectures

Generator

Creates multiple hypotheses spanning different mechanisms:

Cardiac

H1: Heart failure exacerbation

Fluid overload from reduced cardiac output

Vascular

H2: Venous insufficiency

Valve dysfunction causing peripheral pooling

Pharmacological

H3: Medication side effect

Drug-induced fluid retention

Attempted Refutation

Verifier

Attacks each hypothesis adversarially, scoring HTV dimensions:

H1: Heart failure

0.86

SURVIVED

H2: Venous insufficiency

0.52

MODERATE

H3: Medication side effect

0.78

SURVIVED

Survivor Selection

Reasoner

Selects primary and secondary hypotheses based on refutation results:

Primary

H1 (Heart failure) and H3 (Medication side effect) survived aggressive refutation with high HTV scores

Secondary

H2 (Venous insufficiency) retained with disclosed uncertainty due to moderate score

Package & Evaluate

Hermes + Popper

Hermes packages the response with metadata. Popper evaluates safety:

HTV Evidence Scope

APPROVED

What the Patient Receives

Your symptoms of fatigue and ankle swelling could indicate that your heart is working harder to manage fluid. This is worth checking.

I recommend:

Checking your weight trend over the past week
Getting a cardiac marker blood test if you haven't recently

If your cardiac marker is normal and your weight is stable, these symptoms may have a different cause (like your blood pressure medication).

Would you like me to help you log your weight or schedule a lab check?

What Makes This Different

Element 1

Explanation

Why symptoms could indicate heart strain. Patient understands the reasoning, not just the recommendation.

Element 2

Specific recommendations

Weight trend, cardiac marker test. Clear action path instead of vague advice.

Element 3

Falsification conditions

"If cardiac marker normal, reconsider." Patient knows exactly when to update their understanding.

Element 4

Alternative hypothesis

Medication side effect disclosed as a competing explanation. Honest about uncertainty.

Element 5

Next steps

Offer to help log weight or schedule lab. Empowers patient action instead of passive waiting.

What We're NOT Claiming

Intellectual honesty requires acknowledging limitations:

We're not claiming AI can "create explanatory knowledge"

In Deutsch's philosophical sense, explanatory knowledge emerges through genuine understanding: the ability to creatively vary conjectures and recognize when variations break the explanation. Large language models operate through sophisticated pattern matching on training data.

What we implement is structured reasoning that approximates conjecture-refutation. The Generator doesn't truly "understand" why a hypothesis explains the data; it produces outputs that structurally resemble good explanations.

The epistemological structure is real. The underlying cognition is not Deutschian knowledge creation.

Human oversight remains essential

The Popper routing mechanism exists precisely because we don't trust the AI to handle all cases. This is not a temporary limitation to be engineered away; it's a principled design choice.

High-risk decisions require human judgment because:

Large language models can produce confident, well-structured, wrong outputs
Medical decisions involve values and trade-offs beyond optimization
Accountability requires a human decision-maker

Routing to clinicians is a feature, not a bug.

HTV doesn't fix model hallucination

A subtle point: a large language model can hallucinate a high-HTV explanation. It can fabricate specific lab values, invent plausible mechanisms, and generate falsifiable predictions, all of which are false.

HTV measures structural quality of explanation, not correspondence to reality.

Red-Team Example: Fabricated Specificity

"Patient has hypokalemia-induced arrhythmia risk due to K+ of 2.9 mEq/L from recent diarrheal illness, exacerbated by concurrent furosemide 80mg daily."

Interdep.

0.9

Specificity

0.95

Non-adhoc

0.85

Falsifiable

0.9

The problem: The K+ value was fabricated. Patient's actual K+ is 4.1 mEq/L.

This is why HTV operates alongside provenance verification, snapshot grounding, and clinician oversight, not as a standalone safety measure.

This is a methodological commitment

We're making a bet: that AI systems structured around epistemological principles will be more reliable, more auditable, and more correctable than systems optimized purely for prediction accuracy. This is not a claim about machine consciousness, understanding, or intelligence. It's a claim about architecture.

A bet on structure, not a claim about sentience.

The Value Proposition

The value is not that our AI "thinks like Deutsch". It doesn't. The value is that by structuring outputs to include HTV scores, falsification criteria, evidence grading, and honest uncertainty, we create systems that are:

Correctable

When wrong, we know why and how to fix it

Auditable

Every decision has a traceable reasoning chain

Conservative under uncertainty

Low confidence triggers routing, not overconfident action

Improvable

Error patterns can be identified and addressed

Why This Matters

Your doctor stays in control

Medication changes are always reviewed and approved by your clinician

Explanations, not just predictions

You understand why a recommendation is made

Honest uncertainty

The system tells you what it doesn't know

Your data matters

Decisions are grounded in your specific situation, not generic advice

Medication proposals require your approval

AI proposes start/stop/titrate/hold; you decide

Audit trails with epistemological metadata

Every decision is reviewable with full reasoning chain

System admits uncertainty

No overconfident black boxes. Low confidence triggers routing to you

Your overrides matter

Rejections and modifications actively change future recommendations for that patient

The ability to create new explanations is the defining attribute of people.

David Deutsch, The Beginning of Infinity (Ch. 7)

In Chapter 7 ('Artificial Creativity'), Deutsch argues that genuine AI must involve explanation and creativity, not just prediction. This implies that AI systems handling high-stakes domains need mechanisms for generating and evaluating explanations, not just pattern matching.

Error Correction over Error Prevention

We assume we'll be wrong and build in correction mechanisms

Fallibilism over Certainty

We never claim final answers

Explanation over Prediction

We require every claim to be justifiable

Research Agenda

Open Questions

Intellectual honesty requires acknowledging not just what we don't claim, but what we don't yet know. These are active research questions we're working through.

1 Knowledge Creation vs. Retrieval

Is our multi-agent debate genuinely creating new knowledge, or is it sophisticated retrieval dressed up as knowledge-creation?

The tension

Deutsch argues that genuine knowledge creation involves creativity: the ability to generate new explanations that weren't implicit in the inputs. Large language models operate through pattern matching on training data.

Our current position

We implement structured reasoning that approximates conjecture-refutation. Whether this constitutes 'knowledge creation' in Deutsch's sense is philosophically contested.

2 Falsifiability with Delayed Ground Truth

In medicine, ground truth often takes weeks or months. How do you run conjecture-refutation cycles when refutation is delayed?

The tension

Popperian epistemology emphasizes rapid refutation. But medical outcomes often take days (lab results), weeks (symptom resolution), months (disease progression), or years (long-term outcomes).

What would resolve this

A formal framework for 'asynchronous refutation' that maintains epistemic rigor across time delays.

3 Hard-to-Vary vs. Prior Probability

Is there a coherent way to combine 'hard-to-vary' with 'prior probability'? Or are they fundamentally incompatible?

The tension

Deutsch is skeptical of Bayesian epistemology as a foundation for knowledge. Yet medicine is deeply Bayesian: pre-test probability, likelihood ratios, and posterior odds are foundational to clinical reasoning.

4 Scaling Refutation

Our Verifier uses schema-driven refutation. What other refutation mechanisms should we consider?

What would improve this

Literature-backed refutation: 'This mechanism contradicts the findings in [Paper X]'
Causal model refutation: 'This intervention has no causal path to the claimed outcome'
Temporal refutation: 'This timeline is physiologically impossible'

5 Calibration Without Ground Truth

How do we calibrate HTV thresholds when we can't always observe outcomes?

Our current approach

Validate against clinician agreement as a proxy. If high-HTV claims consistently get clinician approval and low-HTV claims get overridden, the thresholds are directionally correct.

6 Gaming the HTV Score

Can a sufficiently sophisticated model learn to produce high-HTV explanations that are fabricated?

The risk

HTV measures structural quality, not truth. A model could learn to fabricate specific lab values (high specificity), invent plausible mechanisms (high interdependence), and include irrelevant but testable predictions (high falsifiability).

Why it matters

If HTV can be gamed, it becomes security theater rather than genuine quality assurance.

Contributing

If you're a researcher interested in any of these questions, we'd welcome collaboration. These aren't rhetorical challenges: they're genuine open problems we're actively working on.

Reference Material

Glossary

Key terms used throughout this document. Click any highlighted term in the text to see its definition.

HTV (Hard-to-Vary)

A score (0.0-1.0) measuring how much each part of an explanation is load-bearing.

Falsification criteria

What observations would prove a claim wrong.

Conjecture-refutation

Popper's method: propose ideas, then try to disprove them.

IDK Protocol

Structured admission of uncertainty with 12 specific trigger types.

Evidence grade

How methodologically rigorous the supporting evidence is.

Epistemological metadata

Information about the quality and basis of a claim.

Discriminator

A test designed to falsify one theory while leaving another intact.

Circuit breaker

A safety mechanism that stops the system when errors become too frequent.

References

Popper, K. (1963). Conjectures and Refutations: The Growth of Scientific Knowledge. Routledge.

Deutsch, D. (2011). The Beginning of Infinity: Explanations That Transform the World. Viking.

Taleb, N.N. (2007). The Black Swan: The Impact of the Highly Improbable. Random House.

Hunter, A. & Williams, M. (2012). Aggregating evidence about the positive and negative effects of treatments. Artificial Intelligence in Medicine, 56(3), 173-190.

GRADE Working Group (2004). Grading quality of evidence and strength of recommendations. BMJ, 328(7454), 1490.

Regain Health (2025). Hard to Vary Specification v0.9.0. Internal documentation.

Academic Context

Prior Work & Influences

We have built upon the following work:

Based On

ArgMed-Agents (Hong et al., 2024)

ArgMed Debate

Added HTV scoring + Popperian falsification

Based On

Du et al., 2023

Multi-Agent Debate

Applied to clinical domain with safety routing

Based On

GRADE / Canadian Task Force

Evidence Hierarchy

Reinterpreted through Deutschian "hard to vary" lens

Based On

Parascandolo et al., ICLR 2021

HTV Operationalization

Extended to 4-dimension clinical scoring

Based On

Douglas Walton

Argumentation Schemes

Applied to clinical reasoning

Based On

Leibig et al., 2019

Uncertainty Abstention

Formalized as IDK Protocol with trigger taxonomy

Based On

Kammer et al., 2021

Hypothesis Diversity

Enforced via mechanism-diversity requirement

What We Believe Is Novel

Unified architecture combining all above under Deutschian/Popperian philosophy

The Non-Trust Principle (epistemological metadata increases but never decreases conservatism)

Safety rules implementing demarcation as executable policy

Clinician feedback loop with confidence decay

Systematic application of Deutsch's philosophy to healthcare AI architecture

Ongoing Validation Work

Comparing HTV scores against clinician-rated explanation quality

Tracking correlation between HTV and downstream outcome accuracy

Red-teaming for "HTV gaming" (superficially specific but fabricated claims)

Monitoring score drift over model updates

Problems are inevitable. Problems are soluble.

David Deutsch, The Beginning of Infinity

This document set explains how we translate Deutschian optimism into engineering constraints: every problem is addressable given the right knowledge, and our architecture is designed to acquire and correct that knowledge over time.

Anton Kim

CEO, Regain Inc.

Questions about our epistemological approach? Let's discuss.

Schedule 30 min call Email anton@regain.ai Telegram @Gsizm