The Three-Gate Verification Architecture | Audit-Ready AI for Clinical Trials | NexTrial.ai

A note on posture

This is a methodology contribution offered under an engagement, not endorsement, posture. Nothing here claims regulatory approval, certification, or conformance to any named standard. We cite regulatory frameworks to show design alignment, not agency agreement. The methodology is GxP-aligned, not GxP-validated, and NexTrial holds no certification. Where a component is designed but not yet running in production, we say so plainly.

Why a confidence score cannot pass an inspection

Every regulatory framework that governs AI in clinical research rests on the same requirement. A regulated decision has to be reconstructible, from source data, through the logic that acted on it, to the rule that produced it, at the moment an inspector asks. That requirement is in 21 CFR Part 11. It is in ALCOA+. It is in ICH E6(R3). It is in EU AI Act Articles 11, 13, and 14. The wording changes across jurisdictions. The requirement does not.

A confidence score meets none of it. It is not reconstructible, it is not independently reproducible, and it is not inspectable in any way that survives a proceeding two years later. Worse, it is the model grading its own work. A score of 0.94 tells you the model is confident. It does not tell you which rule was checked, against which values, at which version, or what the operation deliberately left for a human. A probability has never been something you can take to an inspection.

So the question is not how to make the model more confident. It is how to produce an artifact a named person can stand behind and an inspector can independently re-verify. That artifact is what the architecture below exists to produce.

The RBQM pre-gate: risk is classified and frozen before anything is verified

Before a proposal reaches the verification gates, it passes through a risk-based quality management pre-gate. The pre-gate does not check conformance, so it is not a fourth gate. It does one thing: it classifies the decision’s risk, and that classification then governs how the rest of the pipeline behaves.

The class comes from two axes, the same two the FDA’s draft AI guidance uses to size model risk. Model influence is how much the AI-assisted output actually drives the decision relative to the other evidence in front of the reviewer. Decision consequence is the severity if the decision is wrong, from purely administrative at the low end to participant safety or regulatory standing at the high end. Each axis is scored high, medium, or low, and the pair places the decision into one of four classes: Critical, High, Moderate, or Low.

The placement is the point. This chain sits at the design layer, upstream of any monitoring dashboard, because the live integrity question in AI-assisted trials is not only catching errors after they happen. It is trusting that the identification of what mattered in the first place was sound. That trust is established or denied at the design layer, not on a dashboard downstream of First Patient In.

The class assigned here is frozen into the proof certificate at decision time, and it parameterizes three things: the rigor applied at each gate, the re-verification cadence, and the level of human attestation required. Because the class is assigned before verification and carried in the record, a reviewer cannot accidentally apply low-risk handling to a high-risk decision. Risk-proportionality becomes structural rather than a matter of anyone’s discretion after the fact.

The three gates

Gate 1

Deterministic compliance verification

The first gate applies the applicable rules, in force at the relevant version, to the named source values, and returns a pass-or-fail determination with citation precision. It is deterministic, and it is constrained to compliance checking rather than open-ended text generation. It is also rule-type-agnostic, because a rule is a rule regardless of where it comes from. The same operation checks a regulation cited to section and subsection, a protocol criterion, a standard-operating-procedure step, and a jurisdiction-specific requirement. One evidentiary regime spans all four sources, and jurisdiction is treated as a first-class dimension rather than flattened into a global default.

Gate 2

Structural proof

Designed, not in production

The second gate is a formal, machine-checkable proof of the determination’s structural integrity and logical form. It verifies that the required elements are present, that references resolve, that no structural contradiction exists, and that defined boundaries hold. It does not, and cannot, prove that an output is semantically correct, that the rule it applied captures what the regulation actually intends. A regulation is human-language prose that has to be translated into a computable check, and the proof operates on the translation, not on the regulation. A proof can be flawless while the encoding underneath it is wrong. Gate 2 does not remove that risk. It concentrates it into one inspectable place, the encoding itself. We describe it here as the architecture intends it, and we are careful to say where the line between designed and running currently sits.

Gate 3

Human oversight and attestation

The third gate is a qualified human reviewer. The reviewer evaluates the proposed determination, the certificate produced by the pre-gate and the first two gates, and the boundary statement, and then accepts, rejects, or asks for revision. The final regulated decision is the human’s. Because the determination arrives carrying its rule citations and source values, the reviewer is verifying a cited determination, not reconstructing the analysis from scratch. A rejection can drive a loop where the issue is mitigated, the verification is re-run, and the determination either converges or escalates. What the reviewer signs is specific, an account of what was actually checked, not a generic approval. A signature that attests to nothing in particular cannot survive an inspection two years later.

Why the three gates are uncorrelated, and why that is the whole point

A deterministic rule check, a formal structural proof, and a human attestation are three different substrates whose errors are unlikely to share a common cause. A rule check can be wrong in a way a structural proof would catch. A structural proof can pass on a determination a human would reject. A human can catch what neither machine operation was scoped to see. Evidence drawn from substrates that fail in different ways is the basis of defensible validation. That independence, not the number of layers, is what makes the architecture hold up.

A confidence score is the opposite of this. It is generated by the same model whose output it scores, so it inherits that output’s blind spots. It is correlated evidence wearing the label of a check. The same defect shows up in any arrangement where one model checks another model trained on the same data, whatever the second model is called. And independence is not a property a system has once and keeps. Under continuous learning, substrates retrained on a shared corpus can quietly re-correlate, so the independence of the gates has to be actively preserved and tested over time. We treat that as an open problem, not a solved one.

The proof certificate: eight properties

Everything above exists to produce one artifact. The proof certificate is a machine-readable, signed, versioned object created at the moment an AI-assisted decision is made. It records eight properties, and each one carries an admissibility test, the specific inspection an auditor can run against it. A property that cannot be inspected is not a property of this certificate.

Rule invoked. The specific rule, by source, citation, and version, with the ruleset snapshot and effective date. Citation precision, not "applicable regulation."
Values verified. The exact patient, protocol, site, or operational values checked, listed rather than summarized, each attributable to its source.
Verification operation. The deterministic procedure that returned pass or fail, expressible as a formal predicate, reproducible by an independent verifier.
Boundary statement. What the operation did not check, and the judgment factors reserved for the responsible human. This is the part of the record that makes a signature mean something.
Risk classification. The class assigned by the pre-gate and frozen at decision time, under a named, versioned taxonomy, indexing gate rigor and re-verification cadence.
Human reviewer identity. The identity and role of the human who attested, bound to the attestation, with the responsible principal investigator primary.
Override and escalation record. Whether the human accepted, rejected, or asked for revision, any challenge raised and how it was resolved, any override rationale, and whether escalation criteria were met.
Evidence, not substitution. An explicit declaration that the operation is evidence presented to the reviewer, not a substitution for the reviewer’s independent judgment.

The same object is the technical documentation EU AI Act Article 11 asks for, the basis for human oversight under Article 14, the transparency artifact under Article 13, and the audit trail under 21 CFR Part 11. None of those requirements mandates a particular substrate. All of them require a particular class of artifact. A confidence score satisfies none of them. A certificate that passes its eight admissibility tests satisfies all of them at once, which is what "audit-ready" actually means when an inspector is in the room.

What this architecture deliberately does not do

The precise claim matters here, because the overstated version would be false and a regulator would be right to distrust it. We do not claim a model that never hallucinates. We do not claim to eliminate drift. And Gate 2 does not prove that the encoding of a regulation into a computable check is faithful to what the regulation meant.

That last one is the hardest question in the whole design, and we lead with it rather than smoothing it over. When a regulation becomes a computable rule, someone had to translate it. Who certifies that the encoding is faithful, and who is accountable when the check is structurally perfect and semantically wrong? A proof can be flawless and still certify an error, because it operates on the encoding, not on the regulation the encoding was meant to capture. What the architecture does is concentrate that risk into one inspectable place instead of diffusing it through a probabilistic system where it cannot be found. Concentration is the contribution. It is not a cure. A standard worth adopting is one whose hardest question was named before it was set, and this is ours.

The full treatment of the certificate, the risk taxonomy, the continuous-learning problem, and the encoding question lives in the full Regulatory Validation Framework. For the foundational case on why a confidence score cannot pass inspection, and for the DIA 2026 poster and multi-agency engagement, see our research. This piece is the mechanism at a glance. The framework is the argument in full.

The model proposes; the proof disposes. Proof, not probability. Evidence, not substitution. The human decides.