Economy

Literature support and the capabilities of autonomous research agents

A central weakness of modern machine-learning systems is that performance tends to deteriorate outside the support of the training distribution. This column uses a tournament-style evaluation of AI-generated and human-written papers to investigate whether this limitation also applies to autonomous LLM research agents. The findings suggest that such agents may be particularly useful for scaling research activities that rely on established templates, but are likely to be weakest precisely where human researchers are most valuable: asking unusual questions, reframing problems, adapting methods to thin-data settings, and making judgements when the relevant template does not already exist.

Since ChatGPT brought large language models (LLMs) to broad public attention in November 2022, there has been active discussion about their potential to reshape the production of research. In economics, this debate ranges from how LLMs can expand the use of text and information in empirical work (Ash and Hansen 2023) to how they may alter the process of scientific experimentation itself, including by generating novel hypotheses, supporting experimental design, and enhancing data analysis through natural language-processing tools that measure dimensions such as sentiment and engagement (Charness et al. 2023).

More recently, however, with the introduction of easily deployable agentic frameworks based on LLMs, such as Claude Code or OpenAI Codex, this discussion has shifted from the use of LLMs as tools that assist particular research tasks to their potential role as autonomous research agents. More to the point, many researchers have begun asking whether these systems can not only support researchers, but also generate, implement, and evaluate research with a greater degree of autonomy.

While much of this optimism reflects the fact that agentic systems often perform better on complex tasks than single-prompt interactions with standalone models, these gains appear to come less from a dramatic change in the underlying reasoning capabilities of base models than from architectural scaffolding that enables iterative planning, task decomposition, tool use, and intermediate validation. These features can help models approach difficult problems through a more structured divide-and-conquer process, but they do not eliminate a central weakness of modern machine-learning systems: performance often deteriorates outside the support of the training distribution (e.g. Geirhos et al. 2020, Udandarao et al. 2024).

In new work (Zampa 2026), I ask whether this limitation also applies to autonomous LLM research agents. If it does, these systems should perform better when generating papers that follow paradigms, topics, and empirical templates already familiar from the existing literature, and worse when asked to produce research in semantically sparse or genuinely novel areas.

Do AI research agents perform better in familiar areas of the literature?

I test this hypothesis using the Autonomous Policy Evaluation (APE) project from the University of Zurich (Social Catalyst Lab 2026). APE is an open platform designed to evaluate whether autonomous workflows based on LLMs can generate, replicate, and iteratively improve observational policy evaluation studies using publicly available data. On the platform, AI agents produce novel empirical policy papers and then compete in a tournament-style evaluation conducted by an LLM judge, with a small sample of human-written papers included as benchmarks. Performance is tracked using Elo ratings and conservative TrueSkill scores (Herbrich et al. 2007), whose distributions are shown in Figure 1. As it can be seen, human-authored papers substantially outperform AI-generated papers on average in the current sample.

Figure 1 Distribution of APE tournament scores by paper type

Note: The figure reports Elo ratings and conservative TrueSkill scores for AI-generated and human-authored papers in the APE sample.

The broader motivation behind the APE project is that scalable automation of policy evaluation could accelerate the identification of effective interventions by expanding the volume and speed of empirical economic analysis. Although the setting is still evolving, it provides a rare opportunity to study how AI-generated research performs within a common evaluation environment.

In practice, I construct a measure of “literature support” for each APE paper and test whether it helps predict its performance in the tournament. To do this, I source all abstracts flagged as “economics” and published since 2000 from OpenAlex, a large-scale open bibliographic database of scholarly works (Priem et al. 2022), yielding 1.67 million abstracts. I then map these abstracts into a shared semantic space and assess whether each APE abstract lies in a densely populated or relatively isolated region of that space. A paper receives a higher literature-support score when it sits near many semantically similar papers, and a lower score when it lies in a thinner part of the research landscape. Intuitively, the measure captures whether a paper is working with familiar topics, framings, and empirical templates, as opposed to moving into less represented territory.

The main result is that literature support is positively associated with performance for AI-generated papers, but not for human-written papers. AI-generated papers that are closer to dense areas of the existing economics literature perform better in the APE tournament. In the baseline estimates, moving from a less-supported to a more-supported region of the literature is associated with gains in both Elo and TrueSkill conservative scores. By contrast, the same relationship is close to zero for the human-written benchmark papers. Figure 2 shows that this pattern is not driven by a single way of constructing the support measure: across alternative definitions, the estimated relationship remains positive for AI-generated papers, while the corresponding estimates for human-written papers are much smaller and noisier.

Figure 2 Estimated relationship between literature support and tournament performance for AI-generated and human-written papers

Note: Points show estimated coefficients and bars indicate 95% confidence intervals. The left panel reports Elo results and the right panel reports conservative TrueSkill results. Different values of 𝝉 on the horizontal axis correspond to alternative bandwidth choices used to construct the literature-support measure.

Several caveats qualify the interpretation of these results. The literature-support measure is based on abstracts, not full papers, so it reasonably captures a paper’s topic, framing, and semantic location, but not everything that may matter for performance, such as empirical credibility, robustness, clarity, or the handling of limitations. It is also only an indirect proxy for the training-distribution support available to an LLM: OpenAlex abstracts approximate the density of publicly available economics research, but they do not reveal what a specific model was trained on or how it uses that material. Finally, the evidence remains observational. Since LLMs appear on both sides of the APE setting – as paper producers and as judges – the association between literature support and performance could reflect stronger AI-generated papers in well-supported areas, greater judge familiarity with those areas, or both. The absence of a similar pattern for human-written papers points against a purely judge-side explanation, but the human benchmark sample is small. Future work is therefore needed to validate these results, for example using a larger and more comparable pool of human benchmarks, non-LLM evaluation procedures, and richer support measures based on the full semantic content of papers rather than abstracts alone.

Implications for knowledge production

With those caveats in mind, the pattern observed has important implications if it generalises. To begin, it does not imply that AI has little value in research. If these systems perform best in well-mapped intellectual terrain, they may be particularly useful for scaling research activities that already rely on established templates: structured literature reviews, data assembly, replication exercises, robustness extensions, policy-note production, and incremental empirical applications in heavily studied areas. That is a substantial capability, especially for institutions under pressure to produce evidence quickly.

However, the same finding also suggests limits that matter for universities, journals, and public agencies. If autonomous systems are strongest where the literature is already dense, then they are likely to be weakest precisely where human researchers are most valuable: asking unusual questions, reframing problems, adapting methods to thin-data settings, and making judgements when the relevant template does not already exist. The near-term risk, then, is not that AI will fully automate scientific discovery. It is that institutions will overestimate its ability to do work outside familiar distributions and then reorganise research pipelines around a misleading notion of machine autonomy.

If this leads to a reduction in the demand for human researchers, especially at the junior stages where many careers begin, the result could be a narrower pipeline of people developing the skills and independence required for more original work. Even if AI systems raise output in the short run, the longer-run effect could be a reduction in exploratory capacity if human researchers are precisely the margin along which more disruptive forms of innovation emerge. The broader concern, then, is not simply substitution, but selective substitution: a world in which machines become better at reproducing and extending established lines of inquiry, while the human base that generates departures from those lines becomes thinner. The results shown here do not establish that outcome, but they do suggest that it is a possibility worth taking seriously when evaluating the role of LLMs in research and higher education.

Source : VOXeu

GLOBAL BUSINESS AND FINANCE MAGAZINE