Data Quality

EHR Data Quality and the Enrollment Funnel: What Incomplete Records Cost You

Rebecca Nwosu January 27, 2025 7 min read

EHR data completeness analysis showing lab value gaps

One of the underappreciated obstacles in clinical trial enrollment is not that eligible patients don't exist — they usually do — but that their EHR records are incomplete in ways that make automated identification fail silently. A patient with T2DM, an HbA1c that qualified six months ago, and no contraindicated medications may never appear in a pre-screening shortlist because their most recent eGFR is missing from the record. The system finds a data gap where it needs a confirmed value and flags the patient as "data incomplete" rather than "potentially eligible."

This is different from a patient being ineligible. A patient with a missing lab value has unknown eligibility, not confirmed ineligibility. But in a pre-screening workflow built on EHR data, those two states are often treated identically — the patient drops out of the candidate pool and never gets a coordinator call that would have prompted a qualifying lab order.

Understanding how EHR data quality affects the enrollment funnel — specifically where the gaps occur, why they occur, and what they cost in terms of missed candidates — is necessary for any site that wants to use structured pre-screening tools effectively.

The four categories of data gaps that affect pre-screening

Not all data gaps affect pre-screening equally. Some missing fields are easy to work around; others eliminate a patient from consideration entirely. The four most consequential categories are missing lab values, stale lab values, incomplete medication lists, and inconsistent diagnosis coding.

Missing lab values are the most common source of false-negative pre-screening. A patient's EHR contains Observation records for labs drawn during clinical encounters, but those encounters are not evenly distributed. Patients seen only in the emergency department may have complete metabolic panels for acute episodes but no outpatient HbA1c records. Patients transferred from another institution may have a rich history at the prior site that never migrated to the current EHR. Patients who had relevant labs ordered by a specialist at an affiliated practice may have those results in a different system not queried by the FHIR API.

The specific labs most commonly missing in endocrine and cardiometabolic trials are eGFR (frequently absent for patients who have not had recent kidney function panels), HbA1c older than 90 days (stale by most protocol standards), fasting lipid panels (often only drawn annually), and liver function tests (drawn reactively rather than routinely). Each of these is a common I/E criterion element. A protocol with eGFR and HbA1c inclusion thresholds will fail to confirm eligibility for a meaningful fraction of patients whose most recent values are simply missing from the accessible record.

Stale lab values are a subtler problem than missing values, because the record appears complete. A patient has an eGFR of 67 on file — above the 60 threshold. But that result was drawn 14 months ago, outside the protocol's 90-day window. The pre-screening system correctly identifies the value as outside the temporal window and marks the criterion as unconfirmed. The patient drops to a lower confidence score. Whether that patient should be contacted with a request for a current lab draw depends on whether their clinical trajectory suggests the value is likely to remain qualifying — a judgment the automated system cannot make.

Stale values are particularly problematic for patients with progressive conditions. A patient with CKD whose eGFR was 65 fourteen months ago may now be below 60. Contacting that patient as a high-probability candidate only to discover an ineligible lab value during pre-screening is a screen failure that could have been avoided if the staleness had been flagged explicitly at the pre-screening stage.

Incomplete medication lists affect exclusion criteria more than inclusion criteria. A coordinator reviewing a medication list for a prior GLP-1 receptor agonist is looking at MedicationStatement and MedicationRequest records in the EHR. Those records reflect what was prescribed and documented within the current EHR. Medications prescribed by out-of-network providers, filled at pharmacies that don't reconcile to the EHR, or simply not documented at recent encounters may be absent from the record.

The consequences run in both directions. A patient who was on a GLP-1 agonist 8 months ago but whose prescription was never entered in the EHR will appear to be medication-eligible when they are not. Conversely, a patient whose prior exclusionary medication was discontinued 18 months ago may have an old MedicationStatement record with no discontinuation date, causing the system to flag them as potentially excluded when the exclusion period has passed. Medication list completeness and currency are among the least reliable elements of EHR data for pre-screening purposes.

Inconsistent diagnosis coding was covered briefly in relation to ICD-10 subcategory searches, but the scope of the problem extends beyond just which subcategory code a physician chose. The same condition may be documented as a confirmed diagnosis (ICD-10 condition with clinical status "active"), a problem list entry (which may or may not be maintained as the patient's clinical picture evolves), a history item (ICD-10 with clinical status "resolved"), or a chief complaint (not yet coded). A pre-screening system that queries for active ICD-10 conditions will miss patients whose diagnosis is on an outdated problem list entry coded as resolved, or whose diagnosis is present only in free-text encounter notes that were never formally coded.

How data gaps compound through the enrollment funnel

The enrollment funnel runs from total EHR panel to consented subjects. Broadly: total panel → diagnosis-positive subpopulation → I/E evaluated candidates → pre-screened contacts → screen visit completed → consented. At each stage, patients drop out.

Data quality affects the funnel at the I/E evaluation stage, and the effect compounds. Consider a trial with five inclusion criteria and three exclusion criteria, each requiring a confirmed EHR data element. If each criterion has an 85% data completeness rate — 15% of patients are missing the relevant field — then the probability of having all eight criteria confirmed from EHR data alone is approximately 0.85⁸, or roughly 27%. Three-quarters of potentially eligible patients will have at least one unconfirmed criterion.

That does not mean three-quarters of patients are ineligible. It means three-quarters of patients arrive at the coordinator's queue in an "incomplete data" state rather than a "confirmed eligible" state. A well-designed pre-screening system surfaces these patients with a data completeness signal — here are the specific fields missing, here is the last available value and its date — rather than dropping them from the list entirely. The coordinator can then make an informed decision: is it worth ordering a qualifying lab for this patient, or do the other confirmed criteria suggest a lower-probability candidate?

Sites that discard incomplete-data patients from their pre-screening pools are eliminating a substantial fraction of their potentially eligible population. The appropriate response to a missing eGFR is not to exclude the patient — it is to flag the gap, assess the clinical context, and make a decision about whether to pursue a confirming lab draw.

Specific data patterns worth auditing before trial activation

Site teams that run a data quality audit against their EHR population before the SIV — or in the early post-SIV period — can significantly improve the efficiency of their pre-screening process. The audit does not need to be exhaustive. Targeting the three or four most consequential I/E criteria for a given protocol will identify the largest data gaps.

For a diabetes/metabolic disease trial, the key fields to audit are: HbA1c completeness and recency (how many patients in the T2DM population have an HbA1c result within the past 90 days), eGFR completeness (how many patients have any eGFR result, and how many have one within 90 days), BMI documentation (height and weight recorded as separate Observation elements, required for BMI calculation through FHIR), and insulin prescription history (MedicationRequest records for insulin products, which feed into common exclusion criteria around prior insulin use).

A quick analysis of these four fields against the T2DM population will typically reveal that a meaningful fraction of patients — often 20–35% — are missing at least one of these fields within the relevant time window. That fraction represents patients who will arrive at the pre-screening queue with data gaps, and knowing the shape of those gaps before enrollment begins allows coordinators to plan accordingly: initiate lab orders for high-priority candidates, adjust outreach workflow to accommodate incomplete-data patients, or communicate with the sponsor about data gap prevalence that may affect enrollment projections.

Working with data gaps rather than around them

The instinct to design pre-screening workflows that only surface fully-confirmed candidates is understandable — coordinators are busy, and every phone call that results in an ineligible patient is time spent on a screen failure. But a pre-screening tool that only shows confirmed candidates at the cost of hiding incomplete-data patients is systematically excluding a large portion of the potentially eligible population.

The more useful design is a tiered pre-screening queue: confirmed-data candidates at the top (all I/E criteria confirmed from EHR data, high eligibility confidence score), incomplete-data candidates in the middle with explicit field-level gap indicators (criteria confirmed except for eGFR, last value 14 months ago), and low-probability candidates at the bottom. Coordinators can then work through the queue in priority order, making explicit decisions about which incomplete-data patients merit follow-up rather than having those decisions made implicitly by a data completeness filter.

EHR data quality is not going to improve dramatically in the near term — the structural factors driving incompleteness (fragmented records, inconsistent coding practices, delayed data migration) are systemic and slow to change. The enrollment workflows built on top of EHR data need to account for that reality from the outset, not treat data completeness as a prerequisite that sites need to solve before automated pre-screening can work.

A concrete scenario: cardiometabolic trial, community endocrinology practice

Consider a mid-size endocrinology practice in the upper Midwest activated on a Phase II trial of a novel dual GIP/GLP-1 receptor agonist. The protocol includes HbA1c between 7.5% and 10.5%, eGFR ≥60, BMI between 27 and 45, and no prior use of any GLP-1 or GIP-class therapy within 6 months of screening. The practice has approximately 1,100 patients with active T2DM encounters in the past 18 months.

A FHIR R4 population query against the active panel returns 842 patients matching the ICD-10 E11.x subcategory set. Of those, 614 have at least one HbA1c result in the record. Of the 614, only 289 have an HbA1c result within the 90-day window required by the protocol — the rest have qualifying values that are too old. Of the 289 with a current HbA1c, 214 fall within the 7.5%–10.5% range. From those 214, the eGFR filter removes another 41 patients who have values below 60 or whose most recent eGFR result is older than 90 days and therefore unconfirmable. The medication exclusion — querying MedicationRequest records for all GLP-1 and GIP-class RxNorm concept hierarchies — eliminates another 28 patients on current or recently discontinued therapy.

The residual: approximately 145 patients with confirmed or high-confidence eligibility from EHR data. Of the original 842 matching the broad ICD-10 population, that is about 17%. The other 83% were eliminated either by data gap (no qualifying result within the temporal window) or confirmed ineligibility. The coordinator who began this trial expecting to draw on an 800-patient pool is actually working with a much smaller confirmed candidate set — plus a substantial group of incomplete-data patients who may be eligible but require a confirming lab draw before outreach.

This scenario illustrates why pre-trial data quality audits matter. The practice's eGFR completeness gap — a large fraction of T2DM patients with no recent kidney function panel — is predictable for this population: community endocrinology patients without known CKD do not routinely have eGFR drawn at every visit. A pre-SIV audit that identified the eGFR gap would have prompted the coordinator to flag it to the PI and consider a proactive lab order campaign for high-probability candidates, potentially recovering 20–30 additional confirmable candidates from the incomplete-data pool before the enrollment clock started.

The FHIR data completeness signal: what to ask the API

When a FHIR R4 API is available, the data completeness assessment does not require a custom informatics build. A structured set of queries against standard resource types surfaces the gaps that matter for a given protocol. For the fields most commonly relevant to cardiometabolic and renal trials, the relevant queries are:

HbA1c completeness: Query Observation resources with LOINC 4548-4 (hemoglobin A1c/hemoglobin.total in blood) filtered to effective date within 90 days. The count of patients in the target population without a matching result is the HbA1c gap.

eGFR completeness: Query Observation resources with LOINC codes for CKD-EPI eGFR (69405-9) and MDRD eGFR (33914-3), filtered to effective date within 90 days. Because different laboratories use different LOINC codes for eGFR variants, both codes should be included in the query. A patient with any qualifying result from either code within the window is confirmed; a patient with neither is flagged as a data gap.

Medication currency: Query MedicationRequest resources with status "active" and authoredOn within 12 months. For exclusion criteria requiring confirmation of non-use, the absence of an active MedicationRequest is not sufficient — a medication may have been discontinued without the status being updated in the EHR. The data completeness signal here is the ratio of patients with any recent medication record to patients whose medication list appears to have no updates in the past 12 months; stale medication lists are a distinct data quality problem from an empty list.

We are not suggesting that running these FHIR queries resolves all data quality problems — it surfaces them. The output of a pre-SIV data completeness audit is not a clean candidate list. It is a gap map: here are the fields with low completeness, here is the size of the population affected, and here is the subset of incomplete-data patients who are otherwise high-probability candidates and might benefit from a proactive lab draw or chart review before outreach begins.

That map changes how coordinators allocate their early-enrollment time. Instead of discovering data gaps piecemeal as they work through the pre-screening queue, the coordinator enters the enrollment period knowing exactly where the gaps are and which patients are worth the additional clinical touchpoint to confirm eligibility.