When the Code Isn’t Enough: Building Diagnostic Algorithms and Proxies in RWD Research

In real-world evidence (RWE) research, misclassification is a persistent methodological challenge, particularly when diagnostic information is incomplete, coded inconsistently, ambiguously recorded or entirely absent. When working with data from secondary data sources such as administrative claims or electronic health records (EHRs), not every condition of interest is captured cleanly (or at all) through a well-defined standard coding systems like a ICD-10. To mitigate this gap, researchers often rely on diagnostic algorithms or proxy measures that infer clinical conditions using available data elements such as prescriptions, procedures, or patterns of healthcare utilization that might indicate the presence of the condition of interest.

 

This article explores this critical methodological practice, drawing on two examples: pediatric growth disorders and ischemic colitis in adults.

 

How Researchers Build Proxy Definitions

The use of ICD codes is foundational to case identification in most secondary data sources. However, these codes often fail to reflect the true clinical complexity of many conditions. Therefore, when diagnostic codes alone are insufficient, researchers develop composite case definitions using a combination of clinical indicators. The process begins by establishing a clear clinical understanding of the condition and identifying features that are likely to be captured in structured data. These may include medication use, procedures commonly performed in patient with the condition of interest, laboratory results, repeated healthcare encounters, or specialist visits. The aim is to triangulate the condition using available data indicating the possible presence of the condition, balancing sensitivity and specificity while ensuring transparency and reproducibility.

 

This approach requires collaboration between clinical experts and data scientists, especially to ensure that algorithm components reflect real-world practice. Ideally, proxy definitions are validated against gold-standard data such as patient registries, clinician chart reviews, or diagnostic imaging reports. However, in practice many algorithms remain unvalidated due to data access limitations, limiting transparency and reproducibility.

 

Case Study 1: Growth Impairment in Pediatric Populations

Growth failure may be consequent to a variety of causes: genetic conditions, chronic diseases, endocrine abnormalities and often it can be challenging to capture growth retardation as it might not be carefully and consistently coded in the data.

 

In datasets that include anthropometric data, height and weight captured longitudinally allow computation of body mass index (BMI), height-for-age z-score (HAZ), weight-for-age z-score (WAZ) or growth velocity (if multiple time points exist). However, the recording of height and weight can sparse or irregularly measured, incorrectly recorded due to missed adaptation to the pediatric population or to the lack of standardization of measurement timing or equipment.

 

When direct height and weight is not available or insufficiently recorded, codes such as short stature (ICD-10-CM: R62.52), growth hormone deficiency (E23.0), delayed puberty (E30.x) or congenital syndromes associated with growth failure (e.g., Q96 for Turner syndrome) might be used. However, also diagnostic codes may under-capture true prevalence; used mostly for more severe or treated cases.

 

Other indirect indications used to infer growth problems could be repeated referrals to pediatric endocrinologists, prescription of growth hormone therapy, diagnostic testing for hormonal deficiencies and proceeding and testing suggesting growth monitoring such as bone age assessment.

 

Once researchers identify which alternative diagnostic, prescribing, and procedural codes are most specific for the assessment, they can build an algorithm, ideally including a sensitivity analysis to account for variations in code selection.

 

Case Study 2: Ischemic Colitis and the Ambiguity of Gastrointestinal Diagnoses

Ischemic colitis presents another diagnostic challenge in RWD research. It is often undercoded or miscoded due to its overlapping symptoms with other forms of colitis or gastrointestinal distress. Early clinical presentations (e.g., abdominal pain, rectal bleeding, or diarrhea) can be mistaken for infectious colitis, inflammatory bowel disease, or even irritable bowel syndrome. Unless a colonoscopy or imaging is performed and explicitly recorded, the specific diagnosis of ischemic colitis may not appear in structured diagnostic fields.

 

To improve case identification, researchers may turn to hospitalization records for acute gastrointestinal events that include keywords or codes related to colitis, paired with imaging or endoscopy procedures performed during the same episode. The use of antibiotics, antithrombotic agents, or changes in anticoagulant therapy might provide contextual clues, especially if they occur in patients with known vascular risk factors. Additionally, prior documentation of cardiovascular disease, peripheral artery disease, or hypotensive episodes can support the likelihood of an ischemic etiology.

 

While none of these elements is definitive on its own, together they create a composite picture that can approximate true cases of ischemic colitis with a reasonable degree of confidence. However, without validation, such as through linkage to radiology reports or manual chart abstraction, such algorithms should be interpreted cautiously, with sensitivity analyses that test different levels of diagnostic certainty.

 

Table: Example Proxies and Algorithm Components for Case Identification in RWD

Condition Direct Diagnosis Limitations Proxy Indicators / Algorithm Components
Growth Impairment (Pediatric) - Specific diagnostic codes (e.g., ICD-10 E34.3 – short stature) underused or inconsistently applied - Anthropometric data often missing or not standardized - Longitudinal height z-scores or percentiles (if available) - Lack of expected height progression over time - Prescriptions of growth hormone therapy - Endocrine referrals or diagnostic tests (e.g., IGF-1, GH stimulation tests) - Repeated specialist visits for growth monitoring or failure to thrive - Lab testing related to growth disorders (e.g., thyroid function, celiac serologies)
Ischemic Colitis - ICD codes may be nonspecific (e.g., coded as colitis NOS, abdominal pain) - Often misclassified as infectious or inflammatory colitis - Hospitalization for lower GI symptoms with colonoscopy or imaging performed - Use of broad-spectrum antibiotics without infection code - Co-occurrence of hypotension, dehydration, or vascular events - Older age with vascular comorbidities (e.g., atrial fibrillation, atherosclerosis) - Procedure codes for colonoscopy with biopsy - Discharge diagnosis patterns across claims or EHR systems

This table provides a simplified view. In practice, researchers would combine several of these elements into logic-based inclusion criteria (e.g., "at least one hospitalization with a colitis code and colonoscopy performed within 3 days and no infectious colitis code") and validate definitions whenever possible.

 

Evolving Methods: The Role of AI and ML Looking Forward

The approach described below reflects how diagnostic algorithms have historically been developed to mitigate limitations in RWD. However, with the rise of machine learning (ML) and artificial intelligence (AI), researchers are increasingly using data-driven methods to overcome data gaps.

 

Supervised learning techniques such as logistic regression, gradient boosting, and neural networks can be trained on subsets of confirmed cases to uncover predictive patterns across thousands of features, many of which may be too complex to identify otherwise.

 

For example, in EHRs, models can learn from complex combinations of symptoms, lab trends, medication changes, and temporal patterns in patient trajectories. These approaches are designed to complement traditional rule-based algorithms. However, they also rely on access to accurately labeled training data and require robust external validation to ensure their generalizability and clinical relevance.

 

It is crucial to avoid over-engineering complex algorithms that lack clinical grounding. While ML methods show great promise in uncovering latent diagnostic signals, their outputs must remain interpretable and clinically meaningful, especially in contexts involving regulatory decisions or high-stakes clinical applications.

 

As RWD becomes increasingly central to decision-making by regulators, payers, and providers, the need to enhance phenotyping methodologies will only grow. Standardized phenotype libraries such as OHDSI’s ATLAS or PheKB provide a useful foundation, but further efforts are needed to ensure that undercoded, complex, or evolving diagnoses are reliably identified and meaningfully represented.

 

Ultimately, RWD research must deal with the intrinsic imperfections of clinical documentation and coding. But through thoughtful proxy design, transparent methods, and rigorous validation where possible, we can still generate reliable insights about patients that might otherwise remain invisible in the data.

 

 

References:

Grote FK, Oostdijk W, De Muinck Keizer-Schrama SM, van Dommelen P, van Buuren S, Dekker FW, Ketel AG, Moll HA, Wit JM. The diagnostic work up of growth failure in secondary health care; an evaluation of consensus guidelines. BMC Pediatr. 2008 May 13;8:21. doi: 10.1186/1471-2431-8-21. PMID: 18477383; PMCID: PMC2422838.

 

Harrall, K.K., Bird, S.M., Muller, K.E. et al. A better performing algorithm for identification of implausible growth data from longitudinal pediatric medical records. Sci Rep 14, 18276 (2024). https://doi.org/10.1038/s41598-024-69161-5

 

Hu C, Zhao X, Jiang B, Jiang X, Ren Y, Guo J. A predictive model for ischemic colitis: Integrating clinical and laboratory parameters. iLABMED. 2024;2(3):157–167. doi: 10.1002/ila2.52

 

Ingrasciotta Y, Isgrò V, Foti SS, Ientile V, Fontana A, L'Abbate L, Benoni R, Fiore ES, Tari M, Alibrandi A, Trifirò G. Testing of Coding Algorithms for Inflammatory Bowel Disease Identification, as Indication for Use of Biological Drugs, Using a Claims Database from Southern Italy. Clin Epidemiol. 2023 Mar 11;15:309-321. doi: 10.2147/CLEP.S383738. PMID: 36936062; PMCID: PMC10015969.

 

Scherdel P, Matczak S, Léger J, Martinez-Vinson C, Goulet O, Brauner R, Nicklaus S, Resche-Rigon M, Chalumeau M, Heude B. Algorithms to Define Abnormal Growth in Children: External Validation and Head-To-Head Comparison. J Clin Endocrinol Metab. 2019 Feb 1;104(2):241-249. doi: 10.1210/jc.2018-00723. Erratum in: J Clin Endocrinol Metab. 2019 Jun 1;104(6):2152. doi: 10.1210/jc.2019-00750. PMID: 30137417.

By Nadia Barozzi

Passionate about data-driven insights and the advancement of Real World Evidence research, drug safety and pharmacovigilance.