Leveraging large language models for large-scale information retrieval in economics

The flood of new economics research makes it hard for policymakers and researchers to keep up. Traditional keyword searches often fail to capture the nuance of a paper’s causal design or how variables interact. This column introduces a methodology that uses large language models to process and synthesise tens of thousands of economics papers. The authors show how structuring findings into a ‘causal knowledge graph’, which identifies which economic concepts are linked and whether rigorous causal inference methods support these links, has the potential to assist policymakers, researchers, and students in navigating academic research.

Policymakers, researchers, and practitioners face an ever-growing avalanche of new economics research each year, on topics ranging from development and finance to inequality and climate change. Attempting to locate relevant findings – or even to keep up with the pace of publication – can be overwhelming. Keyword searches on platforms such as Google Scholar can help, but these often fall short when it comes to capturing the nuance of a paper’s causal design or how variables interact. As artificial intelligence (AI) advances, particularly with large language models (LLMs), there is growing optimism that these tools can assist researchers by efficiently parsing, summarising, and mapping key findings from academic work at scale (Ash and Hansen 2023, Dell 2024, Korinek 2023).

This column describes a methodology that uses LLMs to process and synthesise tens of thousands of economics papers. By focusing on the causal statements within each paper, we show how to construct a ‘causal knowledge graph’ – a structured representation that identifies which economic concepts are linked and whether rigorous causal inference methods (like randomised controlled trials or difference-in-differences) support these links. We also reflect on how these approaches can be extended to other domains, such as patents, financial filings, or policy reports, with a note of caution about biases and ‘hallucinations’ in LLM outputs.

The challenge of information overload

The sheer volume of publications poses obvious hurdles. Many fields within economics – think of macroeconomics, finance, or development – each feature extensive, rapidly evolving literatures. Even narrowing down to subtopics is no easy feat, since keyword-based searches can be inconsistent (e.g. “carbon tax” versus “emissions levy”) and do not necessarily reveal whether a paper truly identifies a causal effect or just reports correlations. The rising emphasis on data-intensive methods, from randomised trials to quasi-experimental approaches, has only increased the complexity of reading and summarising evidence (Angrist and Pischke 2010, Card and DellaVigna 2013, Pischke 2021).

These challenges motivate the use of advanced AI techniques. LLMs have proved remarkably adept at understanding and generating human-like text. They can digest large amounts of text, classify information according to prompts, and then output structured summaries. Researchers have already experimented with LLMs for tasks like such as measuring housing regulation (Bartik et al. 2023), analysing remote work in job postings (Hansen et al. 2023), or mapping granular product networks (Fetzer et al. 2024a, 2024b). Building on these successes, we apply LLMs to systematically extract causal statements from over 44,000 NBER and CEPR papers, focusing on how authors establish relationships between economic concepts and how they demonstrate that these relationships are causal.

Overview of our LLM-powered pipeline

In our approach (summarised in Figure 1), we extract the first 30 or so pages from each PDF, capturing the abstract, introduction, and the main empirical sections, where authors typically present their study design and results.

We designed prompts for a custom fine-tuned LLM that reads this text and generates structured output in JSON format. In the first stage, the model provides a broad summary of a paper’s key elements: research questions, data usage (public data versus proprietary), and main methodological approaches.

In the second stage, the LLM identifies explicit cause-and-effect statements. For example, if a paper states that “a policy reduces unemployment,” the model records a directed link from “policy” to “unemployment,” labels it as causal, and notes the estimation method (such as instrumental variables). We also record details about sample sizes, statistical significance, data context, data ownership, and whether the paper offers replicable public data or depends on proprietary information.

Figure 1

Note: The LLM processes academic papers to extract fields like author, institution, method, and data availability. These feed into two branches: Identification and Causal Claims. The Identification branch evaluates methods such as identification strategies and robustness checks, alongside data ownership, measurements, and extrapolated contexts. The Causal Claims branch analyses causal relationships as directed edges, focusing on two levels: (1) individual source (cause) and sink (effect) nodes, both as claimed and measured, with details on data types and owners; (2) source-sink edges, examining the evidentiary method and whether null results were reported.

To unify language across tens of thousands of papers, we then map each free-text variable to standard Journal of Economic Literature (JEL) codes. This is necessary because one author might refer to “household expenditures” while another writes “household consumption.” By encoding each variable and each JEL code into vector embeddings (using OpenAI’s text-embedding-3-large model), we can find the closest semantic match, resulting in a consistent taxonomy of concepts. Once this mapping is complete, each paper’s causal claims form a small directed graph whose nodes are JEL-coded concepts – for instance, “G21” for banks and microfinance or “D12” for consumer economics – and whose edges represent the causal relationships. This process in illustrated in Figure 2.

Figure 2

Note: This diagram depicts our AI-driven methodology for mapping causal linkages between economic concepts using JEL codes. Papers are processed via an LLM that extracts causal claims, identifying source (cause) and sink (effect) variables. These claims are encoded as directed links between JEL-coded nodes, forming a cumulative knowledge graph. Semantic embeddings and cosine similarity align variables to their closest JEL codes, enabling a unified taxonomy.

Illustrative example: Banerjee et al. (2015)

Consider Banerjee et al. (2015), a much-cited randomised controlled trial on how microfinance affects household outcomes in India. Our pipeline reviews the introduction and empirical discussion, identifies about eight causal links, and then maps them to standard codes. Specifically, “introduction of microfinance” is linked to “households borrow microcredit”, which in turn affects “business creation,” “monthly expenditures,” and other development indicators. Each arrow is labelled as a causal relationship grounded in RCT evidence. The resulting knowledge graph (Figure 3) reveals a chain of cause and effect: from microfinance introduction (mapped to G21) to outcomes like health and education (mapped to I15) or expenditures (D12). By repeating this for thousands of papers, we build a global view of what economists claim as causal and what methods they use to back these claims.

Figure 3

Note: This figure presents the knowledge graphs of Banerjee et al. (2015), illustrating the causal impact of introducing microfinance in India. Nodes represent economic concepts mapped to JEL codes; arrows indicate the direction of claims from source to sink. The graph has 8 causal edges, 12 unique paths, and a longest path length of 3, reflecting a complex causal narrative with multiple interconnected outcomes resulting from the intervention.

Why this matters, and possible extensions

Mapping out these causal claims helps us see the structure of economic knowledge. We can identify which topics are frequently studied or which remain underexplored. In addition, by tracing the methods used – RCTs, difference-in-differences, or instrumental variables – we get a sense of how the discipline has shifted toward more robust identification strategies over time (Goldsmith-Pinkham 2024). Beyond economics, the same pipeline can be adapted to process legal documents (Ash and Hansen 2023), parse technology patents, or help governments scan policy reports.

There is also an opportunity to integrate more advanced features. By capturing effect sizes or standard errors, an LLM-driven approach might power systematic reviews or meta-analyses, reducing the tedious work of hand-coding. Further potential lies in bridging across policy areas: a single knowledge graph could unify, for instance, labour market policies in health economics with labour market policies in development contexts.

Practical lessons from building the pipeline

Prompt design is critical. In our tests, a few words or examples can change the accuracy of the LLM’s classification. We also emphasise structured output, such as requiring the model to answer in JSON format with specific fields for each claim. Human review remains important, especially because AI can occasionally misclassify or ‘hallucinate’ references. We validated our results by comparing subsets to manually coded data (Brodeur et al. 2020) and checking consistency with known databases of exogenous variations, such as the Plausibly Exogenous Galore repository. Although the pipeline successfully retrieves high-level summaries and causal claims, it is not flawless. Users must be aware of the potential for bias in LLMs and the computational costs of handling large corpora.

Privacy and ethical considerations arise too. We need to confirm that feeding text into an LLM respects usage rights for each working paper. In more sensitive applications, researchers may want to rely on local models or carefully control inputs and outputs to stay compliant with data-usage policies.

Challenges and limitations

LLMs can inherit biases from the data used to train them, which may affect how they classify or summarise topics (Gallegos et al. 2024). They are also not fully transparent in how they arrive at a conclusion. Meanwhile, the cost of running LLMs across tens of thousands of papers can be high, requiring paid API access. Finally, even a well-trained model might overlook some subtlety of identification strategy or incorrectly summarise effect sizes, depending on the complexity of a paper. As such, while LLMs are powerful tools for streamlining information retrieval and synthesis, their outputs should be evaluated with validation datasets to ensure accuracy and contextual relevance (Ludwig et al. 2025).

Conclusion

By applying large language models to a wide corpus of economics working papers, we construct a global map of causal claims that has the potential to assist policymakers, researchers, and students in navigating academic research. This pipeline highlights the rise of rigorous empirical methods in economics, showing how the field has pivoted toward evidence-based causal claims over the past few decades. More generally, LLMs offer a novel way to handle information overload, enabling us to quickly locate relevant studies, trace how variables connect through credible causal pathways, and identify neglected questions that deserve attention.

source : cepr.org