The growing use of real-world data (RWD) in oncology is essential to complement the limitations of clinical trials, which often involve narrowly selected patient populations under specific conditions. In contrast, RWD captures how cancer care is delivered in routine practice across diverse demographics, comorbidities, and healthcare systems. This is particularly critical in oncology, where treatment decisions must account for rapid innovation, variable patient responses, and evolving standards of care.
Expanding the use of RWD enables more inclusive evidence generation, supports regulatory and reimbursement decisions, identifies gaps in care, and ultimately helps deliver more personalized and equitable cancer treatment. However, conducting robust oncology RWE research using secondary data sources like electronic medical records (EMRs), administrative claims, and registries presents a unique and often underestimated set of challenges.
In this article, I will explore the challenges that I encounter most frequently, with a particular focus on the critical variables that are often missing in secondary datasets, and discuss actionable strategies to overcome them.
Why Oncology Is Uniquely Challenging for RWE
Unlike chronic diseases such as diabetes or hypertension, cancer is a highly heterogeneous condition. Each tumor type, and even each patient’s tumor, can differ significantly in its biology, response to treatment, and progression pattern. This complexity demands highly granular, longitudinal, and context-rich data—most of which is difficult to obtain in routine clinical practice.
Common Challenges in Oncology RWE Using Secondary Data
1. Incomplete or Missing Clinical Variables
Secondary data sources, particularly administrative claims and unstructured EMRs, often fail to capture key clinical variables needed for meaningful oncology research. These include:
- Histology and tumor grade
- Stage at diagnosis: critical for stratifying prognosis and treatment decisions
- Performance status: such as ECOG or Karnofsky scores, essential for eligibility in clinical trials and treatment suitability
- Molecular and biomarker status: e.g., HER2, EGFR, ALK, BRCA mutations (often only available in separate lab systems or pathology reports)
- Comorbidity and frailty measures: often captured inconsistently or incompletely
Implication: These are often found only in unstructured pathology reports or molecular lab systems, not in structured fields. Without these variables, it becomes difficult to replicate trial-like cohorts, adjust for confounders, or personalize treatment effect estimates.
2. Ambiguity in Treatment Intent and Lines of Therapy
In oncology, treatment is not only about which drugs are used, but why and when they are used. Secondary data often cannot distinguish:
- Whether a therapy is first-line, second-line, or maintenance
- If the treatment was curative, adjuvant, palliative, or compassionate use
- If a drug switch was due to progression, intolerance, or patient choice
Because of this, defining lines of therapy (LoT) requires algorithmic assumptions based on drug timing, sequencing, and combinations, which may not always reflect clinical intent.
Implication: Misclassification can severely bias outcome analyses or limit comparability across populations.
3. Limited Capture of Disease Progression and Response and of Patient-Reported Outcomes (PROs)
Clinical trials use standardized criteria such as RECIST to define response and progression. In real-world datasets:
- Radiology reports are often unstructured or missing
- Documentation of response is subjective and scattered in clinical notes
- Progression-free survival (PFS) is rarely measurable; researchers must use proxy endpoints (e.g., treatment change)
- Symptoms, quality of life, and toxicity burden are crucial in oncology but are not captured in most secondary data sources.
Implication: Real-world endpoints like time-to-next-treatment (TTNT) are often used as surrogates, but their validity depends heavily on the clinical context and data completeness.
4. Incomplete Mortality and Survival Data
Capturing survival endpoints in real-world settings is difficult when:
- Patients die outside the treating institution
- Deaths are not promptly or accurately recorded in EMRs or claims
- Cause of death is unknown or not coded reliably
In some countries, linkages to national death registries are possible, but often with long delays, legal hurdles, or incomplete coverage.
Implication: Survival analyses may be biased or underpowered, particularly in diseases with short time horizons or for evaluating long-term outcomes.
5. Toxicity and Adverse Events
- Many side effects are underreported, especially non-serious ones.
- Claims data may include codes for managing symptoms (e.g., antiemetics), but not the adverse event itself.
- Attribution of the adverse event to a specific therapy is difficult.
Implication: Incomplete or indirect reporting of adverse events in real-world data impedes the accurate assessment of treatment safety and tolerability, limiting insights into the real-world patient experience.
6. Incomplete Data on Palliative and End-of-Life Care
- Timing and use of palliative care, hospice, or DNR orders are not consistently recorded.
- Location and quality of death (e.g., hospital vs. home, ICU use at end-of-life) may not be visible.
Implication: Incomplete data on palliative and end-of-life care limits the ability to evaluate quality, appropriateness, and patient-centeredness of care during the final stages of life.
7. Heterogeneity and Fragmentation of Data
Data are fragmented across:
- Multiple hospital systems and oncology centers
- Laboratory and imaging vendors
- Health insurers with differing coding standards
Even within the same dataset, ICD codes, SNOMED, and local nomenclature may be used inconsistently. Patient journeys may span multiple providers, making longitudinal tracking difficult without a national ID system.
Implication: Data harmonization and cohort continuity are ongoing technical and methodological challenges. Cancer care often spans multiple institutions, especially for complex cases and linking data across oncology centers, labs, radiology, and primary care is not always possible, leading to fragmented care pathways and incomplete treatment and outcome histories.
TABLE 1: Summary of Often Missing or Poorly Captured Variables
Category | Variables Often Missing or Unstructured |
Tumor characteristics | Stage, grade, histology, molecular/genetic markers |
Functional status | ECOG/Karnofsky |
Treatment details | Intent, line of therapy, dose changes, oral therapies |
Outcomes | Disease progression, treatment response, recurrence |
Toxicity/side effects | Adverse event timing, severity, attribution |
End-of-life care | Palliative care use, ICU stays, place of death |
PROs and quality of life | Pain, fatigue, emotional burden |
How to Overcome These Challenges
Despite these limitations, oncology RWE research can be made more robust through the following approaches:
1. Tumor-Specific Variables (Stage, Grade, Biomarkers)
Solution: Data Enrichment via Linkage & NLP
- Link to cancer registries that contain stage and histopathology data
- Use Natural Language Processing (NLP) on pathology reports and clinical notes to extract tumor characteristics from unstructured data
- Collaborate with labs and genomics vendors to access molecular profiling data
2. Performance Status (ECOG/Karnofsky)
Solution: NLP + Proxy Measures
- Extract ECOG or Karnofsky status from clinical notes using NLP
- Develop proxy algorithms (e.g., based on hospitalization, mobility-related codes, assistive devices) or use validate proxies (e.g., TTNT, TTD)
- Use physician chart reviews selectively in smaller cohorts: manual abstraction by trained nurses or coders can be valuable for high-quality datasets or for validating NLP models
3. Line of Therapy & Treatment Intent
Solution: Algorithms + Clinical Validation
- Construct line-of-therapy algorithms using treatment start/stop patterns, drug class switches, and time gaps
- Define treatment intent based on setting (e.g., adjuvant vs. metastatic) and clinical pathway modeling
- Use manual abstraction or clinician input for complex regimens
4. Progression and Response
Solution: Surrogate Endpoints + NLP
- Use proxy measures: switch to next-line therapy, imaging events, or hospitalization for progression
- Apply NLP on radiology reports to extract text-based response/progression (e.g., “new lesions”, “stable disease”)
- Link to real-world imaging data where available
5. Patient-Reported Outcomes (PROs)
Solution: PRO Integration & Digital Health Tools
- Partner with providers who capture PROs in EHR (e.g., symptom checklists, PROMIS tools).
- Use data from wearables, apps, or digital symptom trackers in prospective studies.
- Incorporate PROs into pragmatic trial designs or hybrid RWE+RCT models.
6. Toxicity and Adverse Events
Solution: AE Algorithms + NLP
- Use validated claims-based algorithms (e.g., for neutropenia, nausea, febrile episodes).
- Apply NLP to unstructured notes to capture milder AEs (e.g., fatigue, neuropathy).
- Consider structured AE reporting tools in oncology networks (e.g., ASCO mCODE standard).
7. End-of-Life and Palliative Care
Solution: Multisource Linkage
- Link EMRs to hospice databases or vital statistics registries for place/cause of death
- Use claims/EMR to identify palliative care consults, hospice enrollment, and ICU use
- Develop composite measures for aggressive care at end-of-life
8. Fragmented Care Across Institutions
Solution: Multi-institutional Data Networks
- Use data from integrated networks or health system consortia.
- Establish data-sharing agreements and harmonize data dictionaries (e.g., OMOP, PCORnet CDM).
- Combine structured + unstructured data from multiple care sites.
9. Timing of Diagnosis and Treatment
Solution: Anchor Events and Hybrid Data Use
- Validate index date using multiple anchor points: first diagnostic code, biopsy, pathology, and first treatment.
- Use tumor board documentation or clinician notes to refine diagnosis and treatment timelines.
Table 2: Strategic Approaches to Implement
Strategy | Description |
NLP & AI tools | Unlock information from unstructured clinical documents |
Data linkage | Combine EHR, claims, registries, genomics, death records |
Standardization | Use common data models (e.g., OMOP, FHIR) and oncology-specific standards (e.g., mCODE) |
Clinician validation | Involve oncologists in validating endpoints and interpreting nuanced data |
Hybrid study designs | Combine prospective data collection with secondary data use |
Patient/caregiver involvement | Collect patient-reported data in digital form or via registries |
Final Thoughts
Oncology RWE research is demanding, but it is also essential. As new therapies proliferate and health systems strive to understand real-world effectiveness, safety, and value, the ability to extract meaning from complex, messy data becomes a strategic necessity.
By embracing multi-source integration, advanced analytics, clinician collaboration, and transparent methodology, we can bridge the gap between data and decisions unlocking new possibilities in RWE cancer research.