Protocol Intelligence

Decoding Inclusion/Exclusion Criteria at Scale: Why Protocol Language Is Harder Than It Looks

CTO & Co-Founder December 16, 2024 8 min read

Protocol document with highlighted inclusion and exclusion criteria text

Inclusion and exclusion criteria are written for a specific audience: FDA reviewers, IRB committees, and regulatory affairs teams who need to understand the study population, assess risk, and evaluate scientific rationale. They are not written for databases. The language is precise in a regulatory sense — carefully constructed to withstand scrutiny in a submission package — but that precision is expressed in natural language with embedded logic, implicit temporal references, and clinical ontology concepts that no SQL query understands natively.

Building a system that reliably parses I/E criteria at scale means confronting that gap directly. This article describes the specific linguistic and structural patterns that make protocol parsing harder than it appears — not to suggest the problem is unsolvable, but to explain why a naive approach (regular expression matching, simple keyword extraction, off-the-shelf NLP pipelines) reliably fails in production trial environments.

The temporal constraint problem

Temporal constraints are among the most common and most underestimated challenge in I/E criteria parsing. A criterion like "eGFR ≥60 mL/min/1.73m² within 90 days prior to screening" contains three distinct data requirements: the lab value identity (eGFR, which maps to LOINC 69405-9 for CKD-EPI or 33914-3 for MDRD, depending on the lab), the threshold value (≥60), and the temporal window (within 90 days of the screening date).

Each of these requires a different handling strategy. The lab value identity requires ontology normalization — different labs use different LOINC codes for eGFR variants, and the clinical equivalence of those codes must be established before the query can be written. The threshold is straightforward once the field is identified, but the comparator (≥ versus > versus ≤) must be parsed accurately, as the difference between eGFR ≥60 and eGFR >60 is clinically meaningful for patients at exactly 60. The temporal window is the most complex: "within 90 days prior to screening" is a relative reference that can only be evaluated against a specific screening date, which at pre-screening time is either unknown or set to today as a proxy.

Temporal language in protocols is not standardized. A single trial might contain all of the following formulations across its I/E criteria:

"within the 12 weeks prior to enrollment"
"at least 3 months before the baseline visit"
"no earlier than 6 months prior to study entry"
"documented within the preceding 6 calendar months"
"not within 30 days of randomization"

These expressions refer to different reference points (enrollment, baseline visit, study entry, randomization), different time units (weeks, months, calendar months), and different directions (prior to, not within). A parser that handles "within 12 weeks" but treats "at least 3 months before" as a different structural class will fail on a significant fraction of criteria across a real trial portfolio.

Compound and nested exclusions

A simple exclusion criterion — "subjects with active malignancy" — is straightforward to parse and map. The challenge scales rapidly when criteria contain embedded logical operators, hierarchical conditions, or conditional branches.

Consider a real-world type exclusion criterion from a metabolic disease trial:

"History of severe hypoglycemia (defined as an event requiring third-party assistance) within the 6 months prior to screening, OR a history of hypoglycemia unawareness, OR currently using a continuous glucose monitor due to hypoglycemia risk."

This is one exclusion criterion with three independent grounds for exclusion, the first of which has an embedded definition ("requiring third-party assistance") and a temporal constraint. A parser must recognize the OR structure, identify that the first clause is a defined term (with its definition embedded in parentheses), extract the temporal constraint from that clause, and distinguish the first and second clauses (which reference history) from the third clause (which is a current-state check).

Nesting goes deeper. Criteria that reference prior treatment history often contain sub-conditions:

"Prior use of any GLP-1 receptor agonist, DPP-4 inhibitor, or SGLT-2 inhibitor within 3 months prior to screening, UNLESS the subject was on stable therapy for at least 6 months AND the dose was unchanged for 90 days prior to screening."

This criterion contains a primary exclusion, a list of drug classes (each of which must be mapped to RxNorm concept hierarchies), a temporal constraint, and an exception clause with its own compound conditions. The exception clause contains two sub-conditions joined by AND, with separate temporal windows. A system that does not handle the exception branch correctly will either over-exclude (flagging patients who are appropriately stable on therapy) or under-exclude (allowing patients who changed their dose 60 days ago).

Ontology normalization: from protocol language to EHR fields

I/E criteria do not use a single controlled vocabulary. A protocol might reference a condition as "type 2 diabetes mellitus," "T2DM," or "non-insulin-dependent diabetes" interchangeably within the same document. The EHR stores that condition as an ICD-10 code (one of many E11.x subcategories). Between protocol language and EHR data lies a normalization step that requires mapping clinical concepts through at least three ontology layers: from natural language to SNOMED CT concept, from SNOMED CT to ICD-10 code set, and in some cases from ICD-10 to a family of related codes that represent the same clinical concept.

Consider medication criteria. A protocol that excludes "strong CYP3A4 inhibitors" is using a pharmacological classification that does not map directly to any single RxNorm concept. The parser must maintain a current reference list of CYP3A4 inhibitor classifications (which includes clarithromycin, ketoconazole, itraconazole, ritonavir, and others), and must be able to match a patient's medication list against that classification — including generic name variants, brand name variants, and combination products that contain a classified compound.

The ontology normalization challenge is compounded by the fact that ICD-10 coding practices vary across institutions, across physicians at the same institution, and even across encounters for the same patient. A patient with T2DM might be coded as E11.9 at one visit and E11.65 at the next, depending on whether complications were documented that day. A parser that queries only for E11.9 will miss patients whose most recent encounter documented a complication. Querying for all E11.x subcategories is the correct approach, but only if the parser knows that the protocol's reference to "T2DM" maps to that full subcategory tree rather than to E11.9 specifically.

Ambiguous criteria and the silent guess problem

Some I/E criteria contain ambiguity that cannot be resolved without sponsor clarification. The classic example is temporal references without defined windows: "no recent history of cardiovascular disease," "clinically significant renal impairment," "adequate hepatic function." Each of these phrases implies a threshold or time window that a regulatory reviewer would interpret in context, but a database query cannot execute without specific values.

The risk in automated parsing is the silent guess: a system resolves the ambiguity with a default assumption (recent = 6 months, adequate hepatic function = ALT ≤ 3× ULN) without surfacing that assumption to the coordinator. If the assumption is wrong, patients may be incorrectly included or excluded. More problematically, the coordinator may not realize the assumption was made, because the output looks like a scored candidate list rather than a list of scoring decisions that required human judgment.

The correct behavior for ambiguous criteria is to flag them explicitly: surface the criterion to the coordinator with the parser's proposed interpretation and require a confirmation before applying the criterion to the scoring run. This adds friction to the workflow, but it is the friction that regulatory environments require. An eligibility decision made by an automated system on an ambiguous criterion, without coordinator review, is not a defensible position in a GCP audit.

Criteria that reference other criteria

Protocols sometimes define terms in inclusion criteria that are referenced in exclusion criteria, or define categorical labels in a preamble that are invoked throughout the I/E section. Parsing criteria in isolation — as individual sentences extracted from a numbered list — will fail on these cross-references.

A protocol might include a section that defines the study population as "adults with a confirmed diagnosis of treatment-resistant hypertension, defined as SBP ≥140 mmHg despite optimal doses of at least three antihypertensive agents from different classes including a diuretic." That definition is not repeated in each inclusion criterion; later criteria reference "subjects meeting the study population definition" or "as defined above." A parser operating on individual criterion text strings will not resolve those references correctly without access to the full protocol document structure and a cross-reference resolution step.

What good parsing output looks like

A well-parsed protocol produces a structured representation of each criterion that contains, at minimum: the criterion type (inclusion or exclusion), the clinical concept being evaluated (with ontology mappings), the comparator and threshold (where applicable), the temporal window and reference point (where applicable), any exception clauses, the data source required to evaluate the criterion (FHIR resource type and element), and a confidence level for each parsed component.

That last element — confidence level per parsed component — is what enables appropriate escalation. A temporal constraint parsed from unambiguous language ("within 90 days prior to screening") should carry high confidence. A temporal constraint inferred from vague language ("recent history") should carry low confidence and trigger a coordinator review flag. The system should be explicit about what it knows, what it inferred, and what it guessed — because in a GCP environment, each of those categories has different downstream accountability.

Protocol language will not become standardized in the near term. It is written by medical writers optimizing for regulatory communication, not for data interoperability. The parsing challenge is real, and the failure modes of naive approaches are consequential. Building a parser that handles the full range of I/E criteria patterns — temporal constraints, nested logic, ontology normalization, ambiguous references, cross-document references — requires depth that most general-purpose NLP tools do not provide out of the box.

It is also necessary work. The alternative — manual abstraction of every criterion for every trial — scales in exactly the wrong direction as trial volumes increase.

A concrete parsing scenario: oncology I/E criteria complexity

Abstract discussion of parsing challenges is easier to evaluate with a concrete example. Consider the following inclusion criterion from a hypothetical Phase II solid tumor trial involving a PARP inhibitor in BRCA-associated malignancy:

"Histologically or cytologically confirmed advanced or metastatic solid tumor with a documented deleterious or suspected deleterious germline or somatic BRCA1/2 mutation, as determined by a validated assay performed in a CLIA-certified laboratory, with results available within 12 months prior to enrollment."

A parser evaluating this criterion against an EHR population must handle: the OR between "histologically confirmed" and "cytologically confirmed" (different procedure types in the EHR), the compound qualifier "advanced or metastatic" (which maps to stage annotations in the Condition resource but requires understanding that "advanced" is a clinical term without a single SNOMED CT equivalent), the mutation specification ("deleterious or suspected deleterious germline or somatic BRCA1/2 mutation"), the assay qualification ("validated assay" and "CLIA-certified laboratory" — neither of which appears as a structured field in standard FHIR Observation resources), and the temporal constraint ("within 12 months prior to enrollment").

The mutation data alone raises several sub-problems. BRCA1 and BRCA2 are distinct genes with different clinical implications. The distinction between germline and somatic mutations matters for some secondary analyses. The "deleterious or suspected deleterious" qualifier requires the parser to understand variant classifications (ACMG/AMP pathogenicity tiers), which vary by laboratory and are not standardized in a single EHR data element. Genomic test results, when structured at all in clinical EHRs, are typically stored in Observation resources using LOINC codes for genetic variant reporting — but the structure of those reports varies considerably across laboratory reporting systems and EHR implementations.

A parser that extracts "BRCA" as a keyword and queries for any Observation containing BRCA in the text will produce false positives (patients with benign variants, patients with BRCA-related family history flagged but not personally confirmed, patients whose BRCA test was ordered but returned inconclusive) and false negatives (patients whose genomic results were scanned as PDF attachments rather than structured data, or reported under a laboratory's proprietary test code not mapped to a standard LOINC). The specificity required to correctly classify this criterion is beyond keyword matching and requires understanding of genomic data standards — LOINC 81247-9 for genomic variant reporting, HL7 Clinical Genomics Implementation Guide conventions — as well as the clinical nuances of variant interpretation.

The spectrum from parseable to unparseable

Not all I/E criteria are equally difficult to parse. Understanding the spectrum helps set appropriate expectations for what automated parsing can and cannot deliver in production environments.

At the parseable end: objective lab value criteria with clear LOINC-mapped tests, defined thresholds, and standard temporal windows. "eGFR ≥60 mL/min/1.73m² within 90 days" — four well-defined components, all representable in standard FHIR fields, low ambiguity. A high-quality parser should achieve near-complete accuracy on criteria of this type across a broad trial portfolio.

In the middle of the spectrum: compound medication criteria, disease stage qualifiers, and criteria referencing clinical assessments that are documented variably across sites. "ECOG performance status 0 or 1" is a structured field in some EHR implementations and a free-text note in others. "Adequate bone marrow function, defined as ANC ≥1.5 × 10&sup9;/L, platelets ≥100 × 10&sup9;/L, and hemoglobin ≥9 g/dL within 28 days of first dose" — parseable in structure but requiring lab value ontology mapping for the specific LOINC codes used at each site for ANC and platelet reporting, which can vary by analyzer. These criteria require the parser to be LOINC-aware at the site level, not just at the abstract level.

At the difficult end: criteria requiring clinical judgment, criteria referencing external assessments (prior therapies at outside institutions, imaging findings interpreted by the treating physician), and criteria with embedded definitions that change the meaning of the primary clause. "Investigator-assessed adequate hepatic function" is not parseable without knowing the investigator's interpretation, which is not a structured EHR field. These criteria should be flagged for coordinator review rather than scored automatically — the appropriate output is not a parsed criterion but a structured escalation flag.

A realistic parser for clinical trial I/E criteria should report its own confidence per criterion, not treat all criteria as equally parseable. The output to the coordinator should distinguish between high-confidence parsed criteria (objective lab values, confirmed medication history) and criteria that require human review — whether because the language is ambiguous, the required data is not available in structured form, or the clinical concept requires judgment that falls outside the scope of automated mapping. That transparency is what makes an automated parsing system appropriate for use in a GCP-regulated environment.

Building that transparency in requires more engineering than building a parser that silently makes its best guess on every criterion. It is also the difference between a tool that coordinators trust and one that produces surprises at monitoring visits.