A central weakness of modern machine-learning systems is that performance tends to deteriorate outside the support of the training distribution. This column uses a tournament-style evaluation of AI-generated and human-written papers to investigate whether this limitation also applies to autonomous LLM research agents. The findings suggest that such agents may be particularly useful for scaling research activities that rely on established templates, but are likely to be weakest precisely where human researchers are most valuable: asking unusual questions, reframing problems, adapting methods to thin-data settings, and making judgements when the relevant template does not already exist.
Since ChatGPT brought large language models (LLMs) to broad public attention in November 2022, there has been active discussion about their potential to reshape the production of research. In economics, this debate ranges from how LLMs can expand the use of text and information in empirical work (Ash and Hansen 2023) to how they may alter the process of scientific experimentation itself, including by generating novel hypotheses, supporting experimental design, and enhancing data analysis through natural language-processing tools that measure dimensions such as sentiment and engagement (Charness et al. 2023).
More recently, however, with the introduction of easily deployable agentic frameworks based on LLMs, such as Claude Code or OpenAI Codex, this discussion has shifted from the use of LLMs as tools that assist particular research tasks to their potential role as autonomous research agents. More to the point, many researchers have begun asking whether these systems can not only support researchers, but also generate, implement, and evaluate research with a greater degree of autonomy.
While much of this optimism reflects the fact that agentic systems often perform better on complex tasks than single-prompt interactions with standalone models, these gains appear to come less from a dramatic change in the underlying reasoning capabilities of base models than from architectural scaffolding that enables iterative planning, task decomposition, tool use, and intermediate validation. These features can help models approach difficult problems through a more structured divide-and-conquer process, but they do not eliminate a central weakness of modern machine-learning systems: performance often deteriorates outside the support of the training distribution (e.g. Geirhos et al. 2020, Udandarao et al. 2024).
In new work (Zampa 2026), I ask whether this limitation also applies to autonomous LLM research agents. If it does, these systems should perform better when generating papers that follow paradigms, topics, and empirical templates already familiar from the existing literature, and worse when asked to produce research in semantically sparse or genuinely novel areas.
I test this hypothesis using the Autonomous Policy Evaluation (APE) project from the University of Zurich (Social Catalyst Lab 2026). APE is an open platform designed to evaluate whether autonomous workflows based on LLMs can generate, replicate, and iteratively improve observational policy evaluation studies using publicly available data. On the platform, AI agents produce novel empirical policy papers and then compete in a tournament-style evaluation conducted by an LLM judge, with a small sample of human-written papers included as benchmarks. Performance is tracked using Elo ratings and conservative TrueSkill scores (Herbrich et al. 2007), whose distributions are shown in Figure 1. As it can be seen, human-authored papers substantially outperform AI-generated papers on average in the current sample.
Figure 1 Distribution of APE tournament scores by paper type
The broader motivation behind the APE project is that scalable automation of policy evaluation could accelerate the identification of effective interventions by expanding the volume and speed of empirical economic analysis. Although the setting is still evolving, it provides a rare opportunity to study how AI-generated research performs within a common evaluation environment.
In practice, I construct a measure of “literature support” for each APE paper and test whether it helps predict its performance in the tournament. To do this, I source all abstracts flagged as “economics” and published since 2000 from OpenAlex, a large-scale open bibliographic database of scholarly works (Priem et al. 2022), yielding 1.67 million abstracts. I then map these abstracts into a shared semantic space and assess whether each APE abstract lies in a densely populated or relatively isolated region of that space. A paper receives a higher literature-support score when it sits near many semantically similar papers, and a lower score when it lies in a thinner part of the research landscape. Intuitively, the measure captures whether a paper is working with familiar topics, framings, and empirical templates, as opposed to moving into less represented territory.
The main result is that literature support is positively associated with performance for AI-generated papers, but not for human-written papers. AI-generated papers that are closer to dense areas of the existing economics literature perform better in the APE tournament. In the baseline estimates, moving from a less-supported to a more-supported region of the literature is associated with gains in both Elo and TrueSkill conservative scores. By contrast, the same relationship is close to zero for the human-written benchmark papers. Figure 2 shows that this pattern is not driven by a single way of constructing the support measure: across alternative definitions, the estimated relationship remains positive for AI-generated papers, while the corresponding estimates for human-written papers are much smaller and noisier.
Figure 2 Estimated relationship between literature support and tournament performance for AI-generated and human-written papers
Several caveats qualify the interpretation of these results. The literature-support measure is based on abstracts, not full papers, so it reasonably captures a paper’s topic, framing, and semantic location, but not everything that may matter for performance, such as empirical credibility, robustness, clarity, or the handling of limitations. It is also only an indirect proxy for the training-distribution support available to an LLM: OpenAlex abstracts approximate the density of publicly available economics research, but they do not reveal what a specific model was trained on or how it uses that material. Finally, the evidence remains observational. Since LLMs appear on both sides of the APE setting – as paper producers and as judges – the association between literature support and performance could reflect stronger AI-generated papers in well-supported areas, greater judge familiarity with those areas, or both. The absence of a similar pattern for human-written papers points against a purely judge-side explanation, but the human benchmark sample is small. Future work is therefore needed to validate these results, for example using a larger and more comparable pool of human benchmarks, non-LLM evaluation procedures, and richer support measures based on the full semantic content of papers rather than abstracts alone.
With those caveats in mind, the pattern observed has important implications if it generalises. To begin, it does not imply that AI has little value in research. If these systems perform best in well-mapped intellectual terrain, they may be particularly useful for scaling research activities that already rely on established templates: structured literature reviews, data assembly, replication exercises, robustness extensions, policy-note production, and incremental empirical applications in heavily studied areas. That is a substantial capability, especially for institutions under pressure to produce evidence quickly.
However, the same finding also suggests limits that matter for universities, journals, and public agencies. If autonomous systems are strongest where the literature is already dense, then they are likely to be weakest precisely where human researchers are most valuable: asking unusual questions, reframing problems, adapting methods to thin-data settings, and making judgements when the relevant template does not already exist. The near-term risk, then, is not that AI will fully automate scientific discovery. It is that institutions will overestimate its ability to do work outside familiar distributions and then reorganise research pipelines around a misleading notion of machine autonomy.
If this leads to a reduction in the demand for human researchers, especially at the junior stages where many careers begin, the result could be a narrower pipeline of people developing the skills and independence required for more original work. Even if AI systems raise output in the short run, the longer-run effect could be a reduction in exploratory capacity if human researchers are precisely the margin along which more disruptive forms of innovation emerge. The broader concern, then, is not simply substitution, but selective substitution: a world in which machines become better at reproducing and extending established lines of inquiry, while the human base that generates departures from those lines becomes thinner. The results shown here do not establish that outcome, but they do suggest that it is a possibility worth taking seriously when evaluating the role of LLMs in research and higher education.
Source : VOXeu
The crypto system is no longer a sideshow. Dollar stablecoins now put dollar claims into…
Women remain underrepresented in the upper ranks of academia, but evidence on the mechanisms behind…
The steady decline in the relative price of equipment has long been seen as a…
Europe exports a large surplus of savings outside the continent each year. This column argues…
The search for a European safe asset has generated no shortage of proposals, but these…
A landmark shift in the international corporate tax system is taking shape. The introduction of…