Introduction
Data source landscaping is a foundational step in the design and planning of real-world evidence (RWE) studies. It involves identifying, evaluating, and selecting appropriate data sources that align with the study objectives, target population, and operational requirements. With the increasing availability of electronic health data (EHR), claims records, and disease registries, researchers face both opportunities and challenges in choosing the most suitable data assets. A systematic and thoughtful approach to data landscaping not only improves the scientific rigor of a study but also enhances its feasibility and credibility.
This article outlines a strategic framework for data source landscaping based on a structured checklist.
The framework integrates scientific, operational, and governance considerations to help guide RWE researchers and study planners through a repeatable and transparent evaluation process.
1. Framing the Search: Objectives and Research Questions
The starting point for any data landscaping effort is a clear understanding of the research question and study objectives. These foundational elements guide the selection of relevant therapeutic areas, outcomes of interest, and population characteristics.
An effective first step is to review whether similar studies have already been conducted, identifying which data sources were used and what limitations were reported during their execution. This helps to anticipate challenges and build on existing evidence.
Developing a list of keywords based on the study objectives can structure the literature review process. These keywords should be used for targeted searches in databases such as PubMed, Embase, and Web of Science. Additionally, consulting the EMA Catalogue of RWD sources can provide valuable insights into available datasets.
The aim of this initial exploration is to compile a preliminary list that can serve as the foundation for deeper evaluation and prioritization.
2. Filtering the Field: Generating a Preliminary List of Data Sources
Once the literature search yields a set of potentially relevant studies and data considerations, extract information on the data sources they utilized. Look for patterns and recurring data assets used in similar therapeutic areas or study designs. Documenting these sources in a structured table allows for comparative analysis.
In addition to literature, draw on institutional memory and past experience. Data sources previously used by your organization, especially if operational relationships or contracts are in place, may provide a head start in terms of feasibility and access.
Be aware of data sources that may not be responsive to collaboration requests or that explicitly avoid working with industry partners. Early identification of such barriers helps avoid unnecessary delays later in the process.
3. Evaluating Data Source Suitability: The Structured Checklist
A comprehensive checklist allows for systematic evaluation across three major domains: study characteristics, data content and structure, and operational feasibility. Each item in the checklist serves as a prompt for assessing suitability and documenting rationale.
A. Product and Study Characteristics
Key considerations in this domain include:
- Type of product: Is the investigational product a biologic, small molecule, or advanced therapy such as a gene therapy?
- Indication and regulatory context: Is it a new indication, or is the drug already approved and in use?
- Prescribing and administration: Understanding who prescribes the drug and where it is administered informs whether such data is likely to be captured in the available sources.
- Study design: Specify whether the study is observational or hybrid, prospective or retrospective, and whether it is pre- or post-authorization.
Addressing these questions is helpful to understand where your product of interest is most luckily to be captured and whether sufficient data are already available for secondary use.
B. Data Source Content and Structure
This is often the most technical and discriminative part of the assessment:
- Population characteristics: Does the database capture the relevant age range, sex, co-morbidities, or treatment behaviors of interest?
- Variable availability: Are exposures, outcomes, and covariates captured consistently and in adequate detail?
- Continuity and completeness: Is the data longitudinal? Are there gaps or changes in data capture that may impact study validity?
- Linkage capabilities: Can the data be linked to other datasets to enrich analysis?
- Latency and follow-up: How recent is the data, and what is the typical follow-up duration per patient?
C. Operational Feasibility and Governance
Practical issues often determine whether a data source can be used despite its scientific relevance:
- Contracting timelines: Do you have existing data sharing agreements in place? How long does it take to access data?
- Experience and relationships: Has your organization previously worked with the data custodian?
- Privacy compliance: Is the data source compliant with GDPR or other relevant privacy regulations?
- Infrastructure needs: For multi-country or multi-institutional studies, is a centralized or federated analysis model needed? Do you have the relevant technology and competency to ingest, process and analyze the data?
4. Scoring and Prioritizing: Building an Evaluation Matrix
Once the checklist has been completed for all candidate data sources, it could be useful to create an evaluation matrix to assess your results. This matrix consolidates findings across domains and helps rank sources based on predefined criteria.
Consider assigning scores or qualitative ratings for categories such as:
- Data relevance (fit to research question)
- Operational readiness (contracting, timelines)
- Data quality and completeness (data use for regulatory submissions)
- Privacy and governance (local regulation compliance)
Sources can then be categorized (e.g., Tier 1 = Ready to use; Tier 2 = Requires additional validation; Tier 3 = Not recommended).
5. Real-World Application: Limitations and Best Practices
Even with a structured approach, data landscaping requires critical thinking and flexibility. Common challenges include:
- Misalignment between available data and research needs
- Differences in coding standards across data sources or countries
- Delays in contracting or legal review
- Lack of transparency in how data were collected or processed
Best practices to mitigate these include:
- Engaging cross-functional teams early (legal, IT, data governance)
- Documenting all assumptions and limitations
- Validating selected data sources by reviewing published studies that used them
Conclusion
A robust and systematic data landscaping process lays the foundation for successful real-world studies. By combining a structured checklist with practical judgment and organizational experience, study teams can improve both the feasibility and quality of RWE generation.
The checklist introduced in this article can be adapted to suit different therapeutic areas or research contexts, and can serve as a living tool to ensure transparency and consistency in data source selection.