Drug Discovery Data: The Access Problem No One Fixes
Drug discovery data is abundant, but pharma R&D teams still can't reach the cohorts that matter. The problem isn't AI. It's the access architecture.
Moritz Bertold
Mar 18, 2026
7 min
Over the past decade, the pharmaceutical industry has transformed its R&D operations into a data science enterprise. Machine learning algorithms and machine learning models are standard tools in target identification and lead optimization. Artificial intelligence is no longer a pitch deck promise: AI-driven drug discovery is embedded in how pipeline decisions get made, from screening chemical structures in medicinal chemistry to scoring compound libraries for biological activity.

And yet, the drug discovery data that actually matters for those decisions remains locked behind institutional walls.
The problem is not compute power. It is not a shortage of talent in drug discovery data science. It is access: to deeply characterized, longitudinal, multi-omics cohorts with the clinical annotations that make data science for drug discovery productive rather than performative.
Public repositories like ChEMBL and DrugBank are indispensable for data mining in drug discovery and early-stage data processing. But they were never designed to answer the questions R&D teams ask when building a clinical program: how many patients per indication have been profiled at the molecular level? Can we validate our candidate biomarkers against an independent cohort? These are questions of drug discovery data management, not database browsing.
A 2025 RAND study published in JAMA Network Open found that the median cost of developing a new drug reached $708 million, with clinical trials accounting for roughly 68% of total R&D expenditure. Much of that cost sits in patient recruitment, cohort design, and evidence generation: precisely the stages where access to the right drug discovery data would compress timelines. The data exists. The access architecture does not.
What Drug Discovery Data Actually Needs to Look Like
When pharma professionals talk about data, they use clinical and pipeline language. The questions are specific: What diseases do you have? How many patients per indication? Can patients be recontacted for trials? The answers shape everything from trial designs to training data for predictive models. These questions reveal what drug discovery data needs to be:
-
Multi-omics depth.
Genomics alone is not enough. Drug development programs increasingly require transcriptomics, proteomics, and metabolomics layered onto the same patient cohort. Multi-omics captures the molecular footprint of a disease: the full pattern of dysregulated genes, proteins, and metabolites at the pathway level. Understanding this biological activity across modalities is what separates productive data collection from stockpiling. -
Genotype-phenotype alignment.
Molecular data without clinical context is noise. Pharma needs cohorts where genomic variants are linked to structured clinical phenotypes, mapped to standard ontologies like HPO and SNOMED-CT. This alignment turns sequencing data into actionable intelligence for target validation and patient stratification, enabling personalized medicine at the pipeline level. -
Longitudinal design.
Cross-sectional snapshots tell you what a disease looks like at one moment. Longitudinal data tells you how patients respond to treatment over time. For biomarker panel narrowing, this is non-negotiable. Without it, even the most sophisticated applications of data science in drug discovery and the best data visualization for drug discovery tools cannot distinguish treatment effects from noise. -
Healthy controls.
Deeply profiled healthy controls, matched by age and demographics, allow pharma to distinguish disease-specific molecular changes from normal biological variation. This is especially critical in pediatric populations, where developmental changes introduce noise that adult reference ranges cannot account for.
Most public databases and commercial data providers do not meet these criteria simultaneously. The data that does meet them sits inside hospitals and academic medical centers. Getting to it is where the process breaks down.
The Indications Where the Gap Is Most Painful
The drug discovery data gap hits hardest in rare disease and pediatric indications. Major biobanks like UK Biobank and All of Us skew toward adult populations and common diseases. The latest trends in biological data acquisition emphasize multi-omics depth and longitudinal coverage, but the populations that need it most remain unserved.
Research published in May 2025 by the Tufts Center for Biomedical System Design projected that 95% of pediatric-onset rare diseases will still have no approved treatments by 2033. That is not only a drug development failure. It is a data access failure. You cannot build a clinical program around an indication when the molecular characterization of that patient population does not exist in any database you can query.
Rare disease R&D compounds the problem. Biotechs, not big pharma, drive most early-stage programs. These companies lack the bargaining power to negotiate data access agreements with academic hospitals. Big pharma enters later through M&A, after products have been de-risked. The result: the stage where data matters most is precisely the stage where data is hardest to reach. In these low data drug discovery settings, some teams experiment with techniques like one-shot learning, but sample scarcity still limits how far computational methods can take you.

Consider what data driven drug discovery actually requires for a pediatric rare disease program. You need genetically confirmed patients with multi-omics profiling. You need treatment response data captured longitudinally, including side effects. You need enough patients per indication to validate a biomarker panel. The big data drug discovery paradigm breaks down here. Big data and artificial intelligence modeling for drug discovery work when you have scale. For rare indications, the challenge is not big data in drug discovery. It is the right data, in the right format, accessible under the right conditions.
Why the Access Architecture Is the Real Bottleneck
The data pharma needs is not missing. It is trapped.
University hospitals and academic medical centers hold exactly the cohorts drug discovery teams need: deeply phenotyped patient populations with longitudinal clinical data and multi-omics profiling. But the governance, legal, and privacy infrastructure around that data was designed for academic collaboration, not commercial access. Traditional methods of data sharing were built for a world where a collaborator sent a request letter and waited months. That model has not scaled.
Getting access typically involves months of negotiation: data use agreements, ethics board reviews, institutional sign-offs, and technical integration work. A 2025 ITIF analysis found that strict data protection regulations led to a roughly 39% decline in R&D spending among pharmaceutical firms within four years of implementation, with smaller companies hit hardest. This explains one of the core challenges in AI-driven drug discovery data acquisition: even the most advanced AI systems and AI algorithms cannot produce results when the training data they depend on is unreachable.
The current landscape offers partial solutions. Data providers sell de-identified datasets, and trusted research environments give pharma teams access while keeping data local. Both serve a purpose for straightforward queries. But for the most complex and commercially valuable use cases, including multi-omics biomarker panel narrowing across genetically confirmed rare disease cohorts, these models fall short. De-identified data strips the clinical context that makes cohorts useful. Trusted research environments still require lengthy governance negotiations and carry residual privacy concerns that limit what analyses can be run. There are few case studies of either approach working at scale for sensitive pediatric or rare disease populations, and for good reason.
This is the structural bottleneck. Not the absence of drug discovery data. Not the lack of analytical capability. The access architecture itself.
The Data Will Never Move. The Algorithms Should.
The industry has spent decades trying to move data to where the analysis happens. That model is finished. Patient data is too sensitive, too regulated, and too institutionally protected to transfer at scale. It will never move freely between organizations. It should not. The question is not how to move data. It is how to move the algorithms to where the data already sits.
Trusted research environments answer that question when built for this purpose. Instead of moving data to the data scientists, they move the models to the data. Pharma submits a model specification or analytical task. The computation runs inside the hospital’s secure environment. Results and aggregate statistics come back. Raw patient data never moves. This is the foundation for effective drug discovery data management in a privacy-first world.
Federated learning in drug discovery takes this further by enabling machine learning algorithms to train across multiple institutional datasets without centralizing the data. Each hospital retains full control. Pharma gets the analytical power of a multi-site cohort without ever touching the underlying records. Data driven federated learning in drug discovery with knowledge distillation adds another layer: models trained at one site can transfer learned representations to another, compressing institutional knowledge without moving a single patient record.
This is already operational. Privacy-enhancing technologies are live in genomics and clinical data networks. The ICH M14 guideline, adopted in September 2025, set a global standard for pharmacoepidemiological studies using real-world data. AI-based data analytics for drug discovery in healthcare are increasingly built to operate within these distributed architectures.
For data science in drug discovery, the implications are practical. A pharma team working on biomarker panel narrowing for a pediatric indication can submit 800 candidate biomarkers to a federated platform. The platform trains models on a hospital’s longitudinal multi-omics cohort and returns ranked results with performance metrics and age-stratified breakdowns. The pharma team gets everything they need for a go/no-go decision. This is trial de-risking at its most direct: validating biomarker panels, calibrating inclusion criteria, and pressure-testing cohort assumptions before committing to a $150M clinical program.
The shift removes the single biggest friction point in drug discovery data acquisition: the assumption that sensitive data must cross institutional boundaries for analysis to happen. Data mining for drug discovery, model training, and biomarker validation can all happen on-site, under the data custodian’s control
The Data Pharma Needs Exists. The Access Model Was the Missing Piece.
The conversation about drug discovery data has been dominated by volume: bigger databases, more compounds, larger training sets. That framing misses the point. R&D teams do not fail because they lack data. They fail because they cannot access the deeply characterized cohorts that their pipeline questions demand.
The drug development process does not need more data. It needs infrastructure that lets algorithms travel to where the data lives. The organizations that build that infrastructure first will accelerate drug discovery faster than any new model or AI system ever could.
If your R&D team is stuck waiting for data access agreements or working with cohorts that lack the depth your program requires, it may be time to stop trying to move the data and start moving the analysis instead. Get in touch to explore how controlled access to deeply characterized clinical cohorts can move your pipeline forward.
Get in touch with tracebloc to explore how controlled access to deeply characterized clinical cohorts can move your pipeline forward.



