Your AI paper is behind glass — when peer review becomes peer reading
— when peer review becomes peer reading
Lukas Wuttke
Mar 17, 2026
16 min
You do serious research. You work with real data — sensitive data that you’re legally, ethically, and contractually obligated to protect. You train your model, run your experiments, document your methodology carefully, and publish.

And then the paper sits behind glass. Everyone can read it, but nobody can touch it — nobody can rerun your experiment on the same data, probe your results, challenge your preprocessing decisions, or check whether your 94% accuracy holds on a slightly different population. It has the shape of science, but the part that makes science work — the ability for others to test and potentially break your findings — has been quietly taken out of the equation.
This is a structural trap that honest, rigorous researchers fall into every day, and it’s quietly undermining entire bodies of work
The reproducibility conversation has a blind spot
The scientific community has been grappling with reproducibility for years, and the conversation has gained real momentum. A systematic survey across 17 disciplines found reproducibility failures affecting at least 294 ML-based papers, with non-replicable work often cited more than replicable work — a feedback loop that rewards overoptimism [1]. A broader review identified errors in 648 papers across 30 fields [2]. Conferences now require reproducibility checklists. Code sharing has improved. The culture is shifting.
But almost all of this effort targets one kind of problem: technical reproducibility — missing hyperparameters, undocumented random seeds, library version mismatches. These are real issues, and containerisation, better documentation, and open code repositories are helping.
What they don’t help with is the case where the data cannot leave the building.
A clinician-researcher working with patient records cannot post their dataset to GitHub. A team training a fraud detection model on proprietary transactions cannot share the underlying data for peer review. An engineer using industrial sensor logs from a manufacturing partner is bound by NDA. A political scientist working with conflict data under access agreements has no option to open it up.
For these researchers, no amount of documentation solves the problem — the bottleneck isn’t methodology, it’s access. And that bottleneck quietly turns peer review into something closer to peer reading. Reviewers assess whether a paper is plausible, whether the numbers seem reasonable, whether the narrative hangs together, but they can’t run the code, probe the test set, or check for data leakage. In the most literal sense, they’re reading through glass.
What unverifiable science actually looks like
The consequences of this are not hypothetical. In political science, complex ML models were published as dramatically outperforming traditional baselines for predicting civil conflict. When researchers later corrected for data leakage — errors invisible to anyone just reading the paper — the supposed advantage disappeared, and the models performed no better than logistic regression [3].
In neuroscience, a paper reporting 91% classification accuracy turned out to have fundamental leakage flaws: information from the holdout set had contaminated training, the accuracy was a substantial overestimate, and the paper was eventually retracted [4].
In finance, regulators have started scrutinising claims about proprietary AI trading models — because when a model is trained on private transaction data, external verification is structurally impossible, and the line between genuine performance and marketing becomes unverifiable [5].
And increasingly, major research labs treat publications less as contributions to shared knowledge and more as marketing for product launches. Papers land alongside model announcements, benchmarks are chosen to flatter, and the data and training infrastructure behind the results stay proprietary — preserving the form of science while stripping out the substance: openness, contestability, independent verification.
These are the cases that got caught. They surfaced because someone, somewhere, had enough access to check. For every one of these, there are hundreds of papers in sensitive domains where nobody has that access — and nobody ever will, because the data legally cannot move.
A paper is not the experiment. The data and model are.
There’s a deeper issue worth naming. Science doesn’t advance primarily through papers — it advances through experiments, and specifically through the ability of other scientists to challenge, replicate, extend, and break the work of their peers. The paper is just the record; the experiment is the thing that actually matters.
When data can’t be shared, you can publish the record but not the experiment itself. Other researchers can read what you did, but they can’t do what you did — they can’t test whether your model holds on their population, in their context, with their labelling conventions. What you’ve produced is peer reading, not peer review.
That’s not a minor inconvenience — it’s a fundamental break in how science is supposed to work, and one that’s becoming more common as research moves deeper into domains where sensitive data is the norm.
And here’s the painful irony: those domains — medical research, financial modelling, industrial systems, behavioural science, climate prediction — are exactly where we most need rigorous, testable results. The stakes are highest precisely where verification is hardest.
The wrong solution: pretend the data can be shared
The standard response is some version of: “Just anonymise it.” Or: “Use synthetic data.” Or: “Apply differential privacy and release a noised version.”
These approaches have their place, but they don’t solve the problem for serious research.
Anonymisation is weaker than it sounds. In the 1990s, Latanya Sweeney re-identified the health records of a US governor using publicly available voter registration data — showing that stripping names and addresses isn’t enough when the remaining fields are sufficiently distinctive [6]. The problem has only worsened: in 2013, researchers demonstrated that individuals could be re-identified from supposedly anonymised genomic data [7]. For high-dimensional data like medical imaging or genomic sequences, meaningful anonymisation often destroys the very properties that made the data scientifically valuable.
Synthetic data can be useful for testing pipelines, but a model trained on synthetic data and one trained on real data are different objects — and the scientific question is almost always about the real one. Differential privacy introduces noise that can undermine the research itself.
You can use these tools to make data safer to share, but you can’t use them to make sensitive data fully shareable without losing what made it worth studying. The real answer isn’t to move the data — it’s to move the experiment.
The partial solution the field already built — and why it’s not enough
The research community has tried to address data access. Over the past two decades, controlled-access repositories have been built to let researchers deposit sensitive datasets and allow qualified colleagues to apply for access. UK Biobank provides secure cloud access to a 500,000-participant cohort. MIMIC/PhysioNet offers de-identified ICU records under data use agreements. ICPSR hosts over 250,000 social science datasets with restricted-use tiers. Similar infrastructure exists in genomics (dbGaP, EGA), government statistics (US Census Research Data Centres), and finance (WRDS) [8–13].
These are real, serious investments, and they’ve enabled important work — but they come with structural limitations that matter.
First, anonymisation remains unsolved. Re-identification attacks keep getting more powerful, and for high-dimensional data the trade-off between privacy and utility is steep. Review boards are right to be cautious.
Second, access is unequal in practice. UK Biobank requires institutional affiliation, background checks, annual progress reports, and charges compute fees. dbGaP requires navigating the NIH’s eRA Commons system and waiting for committee review. Researchers at well-resourced institutions manage. Early-career researchers, those at smaller institutions, or those in lower-income countries face significant barriers — even when their science is entirely legitimate.
Third — and maybe most corrosive — the incentives work against sharing. Researchers who generate valuable datasets have careers built on what those datasets can produce. Sharing openly means someone else may publish the breakthrough finding. In a culture that rewards priority of discovery over contribution to infrastructure, data holders have a rational, if perverse, incentive to restrict access. Sensitive datasets often stay locked not just by regulation, but by self-interest.
This is amplified by the behaviour of major AI labs, which increasingly use publications as product marketing. Papers arrive alongside model launches and carefully chosen benchmarks, while the data, training infrastructure, and full evaluation pipelines stay proprietary. The community is asked to evaluate claims it has no ability to independently verify.
And for entire domains — financial transactions, industrial sensor data, conflict data, behavioural data under commercial agreements — there are no centralised repositories at all. The controlled-access model has been built almost entirely within biomedical research. For everyone else, the glass is always there
Open science without open data
What federated infrastructure makes possible is a genuinely different approach: the data stays where it is — inside the institution, behind the regulatory boundary — and the experiment travels to the data instead of the other way around.
A researcher who wants to test a hypothesis brings their model to the data, runs it, and gets results back — all while the data never leaves the institution. No privacy law is violated, no record is exposed, but real science happens: an independent team ran an independent experiment on the actual data and either confirmed a finding or didn’t [14].
The implications go well beyond simple replication. Researchers with pretrained models can probe published benchmarks and leaderboards, exposing implausibilities before flawed results spread through the literature. If a model performs dramatically better on one cohort than on every other, that gap itself is a signal — something may be wrong with the original evaluation, whether it’s data leakage, overfitting, or a preprocessing artefact. Federated infrastructure turns anomalous results from invisible assumptions into visible, testable patterns.
It also lets researchers collaborate earlier in the research cycle — verifying or falsifying hypotheses before investing years in a dead end, testing models on entirely different patient cohorts or market populations, and stress-testing findings under conditions the original authors never anticipated. It turns the static, untouchable paper into a living, contestable scientific claim.
This isn’t a workaround — it’s what open science looks like when the data is sensitive. The glass comes off, not because you shared the data, but because you created the conditions for other scientists to work with it.
What researchers concretely get — and what this doesn’t solve
It’s worth being specific about what federated infrastructure gives you that a paper does not.
When you read a published result, you get a claim: “Our model achieved 94% accuracy on this dataset.” You have no way to test it. With federated access, you can submit your own model — or the author’s published model — and run it against the same data. You get evaluation metrics computed on the actual population, not a self-reported number. You can run the same model against multiple cohorts at different institutions and compare: does it generalise, or does it collapse? You can submit a competing architecture and see whether it outperforms the published baseline on real data. You can run ablation studies, test different preprocessing pipelines, and check whether performance holds under distributional shift.
To be clear about limits: federated infrastructure doesn’t fix bad methodology within a single lab. If a researcher introduces data leakage in their own training pipeline, that error exists regardless. It doesn’t eliminate domain shift. And it doesn’t magically dissolve career incentives. But it does reshape them. Today, if a research team sits on a valuable dataset and finds nothing, nobody else gets the chance to try. The data stays locked, the potential breakthrough stays buried. Federated infrastructure changes that: multiple teams can work on the same sensitive data earlier, test hypotheses in parallel, and find results together — without anyone giving up control. Sharing access to computation isn’t the same as sharing the data itself, and that distinction turns hoarding from a rational strategy into an unnecessary bottleneck.
What it does solve is the structural barrier. Today, the default is that no one outside the originating institution can touch the experiment. Federated infrastructure changes that default — and that changes the scientific culture around sensitive-data research, even if it doesn’t fix every problem at once.
From paper to proof: what this looks like in clinical research
Here’s a concrete example. A research team in Berlin trains a model on twelve-lead ECG recordings from their hospital to predict early-stage atrial fibrillation and publishes the results. The model performs well on their local cohort — but is that a Berlin-specific result, or does it hold up elsewhere?
The team sends their model to partner hospitals in Seoul, São Paulo, and Toronto. Each hospital runs the Berlin model against its own patient cohort, locally, inside its own infrastructure. No patient data crosses a single institutional boundary. The Berlin team gets evaluation metrics back from every site.
Now it gets interesting. The model scores 91% on the Seoul cohort but drops to 74% in São Paulo. That gap is a finding in itself. Maybe the training data skewed toward a demographic overrepresented in Berlin and Seoul. Maybe a preprocessing step introduced an artefact that only surfaces with different recording equipment. The Berlin team can investigate — adjusting for population differences, retraining on a federated subset, running ablation studies — all without ever seeing a single patient record from any partner site.
Meanwhile, an independent group in Toronto submits their own architecture against the same cohorts. Their model scores more evenly across all three sites. That comparison, run on real clinical data across three continents, is the kind of evidence a single-site paper simply cannot produce.
Over time, the Berlin team doesn’t just validate one model. They cross-validate across dozens of cohorts, identify systematic biases their local data would never reveal, and build models that are genuinely robust — not because they assumed generalisability, but because they proved it. Other researchers do the same in the opposite direction: sending their models to Berlin’s cohort, stress-testing claims, catching weaknesses early. The network becomes a shared proving ground where every institution both contributes and benefits.
This is the infrastructure the tracebloc federated learning network provides. Your data stays in your institution, your models travel to the data, and partner hospitals around the world become collaborators who can challenge and strengthen your findings without either side ever compromising patient privacy.
Your paper stops being a display behind glass. It becomes the starting point for something that actually looks like science: open, contestable, cumulative.
The problem is structural. So is the solution.
If this is the problem you’re facing, the tracebloc network was built for it.
Sources
[1] Kapoor, S. & Narayanan, A. “Leakage and the Reproducibility Crisis in ML-based Science.” Patterns (Cell Press), 2023.
[2] Semmelrock, H. et al. “Reproducibility in Machine Learning-based Research: Overview, Barriers and Drivers.” AI Magazine (Wiley), 2025.
[3] Kapoor, S. & Narayanan, A. “Civil War Prediction — reproducibility study.” Patterns (Cell Press), 2023.
[4] Desai, S. et al. “What is reproducibility in artificial intelligence and machine learning research?” AI Magazine (Wiley), 2025.
[5] New York State Bar Association. “Regulating AI Deception in Financial Markets.” NYSBA, 2025.
[6] Sweeney, L. “k-Anonymity: A Model for Protecting Privacy.” Intl J. Uncertainty, Fuzziness and Knowledge-Based Systems, 2002.
[7] Gymrek, M. et al. “Identifying Personal Genomes by Surname Inference.” Science, Vol. 339, 2013.
[8] UK Biobank. “Access to UK Biobank Data.” ukbiobank.ac.uk, 2024.
[9] Johnson, A.E.W. et al. “MIMIC-III, a freely accessible critical care database.” Scientific Data, 2016.
[10] NIH / NCBI. “How to Request and Access Datasets from dbGaP.” NIH Grants & Funding, 2024.
[11] Freeberg, M.A. et al. “The European Genome-phenome Archive in 2021.” Nucleic Acids Research, 2022.
[12] Sudlow, C. et al. “UK Biobank: An Open Access Resource.” PLOS Medicine, 2015.
[13] Murtagh, M.J. et al. “Cautious Care — The SAIL Databank.” Intl J. Population Data Science, 2022.
[14] Luo, Y. et al. “Federated learning for preserving data privacy in collaborative healthcare research.” NPJ Digital Medicine / PMC, 2022.
[15] Hutson, M. “Artificial intelligence faces reproducibility crisis.” Science, Vol. 359, 2018.
[16] Hao, K. “AI is wrestling with a replication crisis.” MIT Technology Review, 2020.
[17] Arvan, M. et al. / Princeton CITP. “The Unreasonable Effectiveness of Open Science in AI.” arXiv:2412.17859, 2024.
[18] Powles, J. & Hodson, H. “Google DeepMind and healthcare in an age of algorithms.” Health and Technology / Springer, 2017.



