Protocol Intelligence

Decoding Inclusion/Exclusion Criteria at Scale: Why Protocol Language Is Harder Than It Looks

CTO & Co-Founder December 16, 2024 8 min read

Protocol document with highlighted inclusion and exclusion criteria text

Inclusion and exclusion criteria are written for a specific audience: FDA reviewers, IRB committees, and regulatory affairs teams who need to understand the study population, assess risk, and evaluate scientific rationale. They are not written for databases. The language is precise in a regulatory sense — carefully constructed to withstand scrutiny in a submission package — but that precision is expressed in natural language with embedded logic, implicit temporal references, and clinical ontology concepts that no SQL query understands natively.

Building a system that reliably parses I/E criteria at scale means confronting that gap directly. This article describes the specific linguistic and structural patterns that make protocol parsing harder than it appears — not to suggest the problem is unsolvable, but to explain why a naive approach (regular expression matching, simple keyword extraction, off-the-shelf NLP pipelines) reliably fails in production trial environments.

The temporal constraint problem

Temporal constraints are among the most common and most underestimated challenge in I/E criteria parsing. A criterion like "eGFR ≥60 mL/min/1.73m² within 90 days prior to screening" contains three distinct data requirements: the lab value identity (eGFR, which maps to LOINC 69405-9 for CKD-EPI or 33914-3 for MDRD, depending on the lab), the threshold value (≥60), and the temporal window (within 90 days of the screening date).

Each requires different handling. The lab value identity requires ontology normalization — different labs use different LOINC codes for eGFR variants, and the clinical equivalence of those codes must be established before the query can be written. The threshold is straightforward once the field is identified, but the comparator must be parsed accurately; the difference between eGFR ≥60 and eGFR >60 is clinically meaningful for patients at exactly 60. The temporal window is the most complex: "within 90 days prior to screening" is a relative reference that can only be evaluated against a specific screening date, which at pre-screening time is either unknown or set to today as a proxy.

Temporal language in protocols is not standardized. A single trial might contain all of these formulations across its I/E criteria: "within the 12 weeks prior to enrollment," "at least 3 months before the baseline visit," "no earlier than 6 months prior to study entry," and "not within 30 days of randomization." These expressions reference different time units, different reference points, and different directions. A parser that handles one formulation but treats another as a different structural class will fail on a significant fraction of real-world criteria.

Compound and nested exclusions

A simple exclusion — "subjects with active malignancy" — is straightforward to parse. The challenge scales rapidly when criteria contain embedded logical operators, hierarchical conditions, or conditional branches.

Consider a common metabolic disease exclusion type: "History of severe hypoglycemia (defined as an event requiring third-party assistance) within the 6 months prior to screening, OR a history of hypoglycemia unawareness, OR currently using a continuous glucose monitor due to hypoglycemia risk." This is one criterion with three independent grounds for exclusion. The first has an embedded definition and a temporal constraint. A parser must recognize the OR structure, extract the embedded definition, parse the temporal window from the first clause, and distinguish the history-based clauses from the current-state check in the third.

Nesting goes deeper when criteria reference prior treatment with exception clauses: "Prior use of any GLP-1 receptor agonist within 3 months prior to screening, UNLESS the subject was on stable therapy for at least 6 months AND the dose was unchanged for 90 days prior to screening." This contains a primary exclusion, a drug class (mapped to RxNorm hierarchies), a temporal constraint, and an exception with two sub-conditions with separate time windows. A system that misses the exception branch will over-exclude patients who are appropriately stable on therapy.

Ontology normalization: protocol language to EHR fields

I/E criteria do not use a single controlled vocabulary. A protocol might reference "type 2 diabetes mellitus," "T2DM," or "non-insulin-dependent diabetes" interchangeably. The EHR stores the condition as an ICD-10 code — one of the E11.x subcategories. Between protocol language and EHR data lies a normalization step: natural language to SNOMED CT concept, SNOMED CT to ICD-10 code set, and in some cases ICD-10 to a family of related codes representing the same clinical concept.

Medication criteria add further complexity. A protocol that excludes "strong CYP3A4 inhibitors" is using a pharmacological classification that does not map directly to any RxNorm concept. The parser must maintain a reference list of CYP3A4 inhibitor classifications and match a patient's medication list against that classification — including generic name variants, brand name variants, and combination products containing a classified compound.

Ambiguous criteria and the silent guess problem

Some criteria contain ambiguity that cannot be resolved without sponsor clarification: "no recent history of cardiovascular disease," "clinically significant renal impairment," "adequate hepatic function." Each implies a threshold or time window a regulatory reviewer interprets in context, but a database query cannot execute without specific values.

The risk in automated parsing is the silent guess: a system resolves the ambiguity with a default assumption without surfacing it to the coordinator. If the assumption is wrong, patients may be incorrectly included or excluded, and the coordinator may not realize a judgment call was made. The correct behavior for ambiguous criteria is to flag them explicitly — surface the criterion with the parser's proposed interpretation and require coordinator confirmation before applying it to the scoring run. This adds friction, but it is the friction that GCP environments require.

What good parsing output looks like

A well-parsed protocol produces a structured representation of each criterion containing: criterion type (inclusion or exclusion), the clinical concept with ontology mappings, comparator and threshold where applicable, temporal window and reference point, exception clauses, the FHIR resource type and element required to evaluate the criterion, and a confidence level for each parsed component.

That last element — confidence per component — enables appropriate escalation. A temporal constraint from unambiguous language carries high confidence. One inferred from vague language should carry low confidence and trigger a coordinator review flag. The system should be explicit about what it knows, what it inferred, and what it guessed — because in a GCP environment, each category has different downstream accountability.