Generative AI as a replacement for human coders in large-scale complex text analysis: New evidence from large language models

Economy, Featured, Lifestyle, Technology

Generative AI as a replacement for human coders in large-scale complex text analysis: New evidence from large language models

Despite its immense potential, text analysis at scale presents significant challenges. This column benchmarks several state-of-the-art large language models against incentivised human coders in performing complex text analysis tasks. The results indicate that large language models consistently outperform outsourced human coders across a broad range of tasks, and thus provide economists with a cost-effective and accessible solution for advanced text analysis.

Historically, economists’ data analysis skills have centered on structured, tabular data. However, the rapid expansion of digitisation has positioned text data as a valuable resource for studying phenomena that traditional quantitative methods often struggle to address (Gentzkow et al. 2019). For instance, text analysis has enabled researchers to explore a wide range of topics, including analysing central bank communications and policy announcements for macroeconomic insights (e.g. Demirel 2012), studying firms’ inflation expectations (e.g. Thwaites 2022), investigating emotional contagion in social media (e.g. Kramer et al. 2014), examining gender stereotypes in movies (e.g. Gálvez et al. 2018), and assessing the impact of media coverage on political outcomes (e.g. Caprini 2023) and stock market behavior (e.g. Dougal et al. 2012).

Despite its immense potential, text analysis at scale presents significant challenges (Barberá et al. 2021). As Ash and Hansen (2023) note, economists have mostly relied on three primary approaches to tackle this: (1) manual coding by outsourced human coders, (2) dictionary-based methods, and (3) supervised machine learning models. Each of these, however, has notable limitations. Outsourced manual coding is costly, time-intensive, and often relies on coders without domain-specific expertise. Dictionary-based methods fail to capture contextual nuances, leading to inaccuracies. Meanwhile, supervised machine learning requires considerable technical skills and large, labeled datasets – resources that are not always readily available (Gilardi et al. 2023, Rathje et al. 2024).

Generative large language models (LLMs) present a promising alternative for large-scale text analysis. Unlike traditional supervised learning methods, current LLMs are considered well-suited for tackling complex text analysis tasks without requiring task-specific training, effectively serving as ‘zero-shot learners’ (Kojima et al. 2022). In a recent paper (Bermejo et al. 2024a), we benchmark several state-of-the-art LLMs against incentivised human coders in performing complex text analysis tasks. The results reveal that modern LLMs provide economists with a cost-effective and accessible solution for advanced text analysis, significantly reducing the need for programming expertise or extensive labeled datasets.

The setup

The study examines a corpus of 210 Spanish news articles covering a nationwide fiscal consolidation program that impacted over 3,000 municipalities (see Bermejo et al. 2024b). This corpus is particularly suitable for testing contextual understanding, as the articles present complex political and economic narratives requiring in-depth knowledge of local government structures, political actors, and policy implications. Moreover, the articles frequently include intricate discussions on fiscal policies, political critiques, and institutional relationships, which would be difficult to analyse through simple keyword matching or surface-level reading.

A common set of five tasks of increasing complexity was selected to be evaluated through different coding strategies across all news articles, each task requiring progressively deeper contextual analysis. The tasks are as follows:

T1: Identify all municipalities mentioned in the article, with tagger performance measured by the macro-averaged F1-Score (a metric that balances correct findings with missed ones).
T2: Determine the total number of municipalities mentioned, with tagger performance measured by the mean absolute error (lower values indicate better performance).
T3: Detect whether the municipal government is criticised, with tagger performance measured by accuracy.
T4: Identify who is making the criticism, with tagger performance measured by accuracy (allowing for multiple correct labels).
T5: Identify who is being criticised, with tagger performance measured by accuracy (allowing for multiple correct labels).

These tasks were completed following three distinct coding strategies:

High-skilled human coders (gold standard labels). Gold standard labels were established through a rigorous process involving highly skilled coders (the authors and a trained research assistant). This process included multiple rounds of labelling and deliberation to reach consensus, resulting in high inter-coder agreement rates. Agreement was measured as the proportion of matching tags between first and second coding rounds, reaching >80% across all tasks and exceeding the 70% agreement threshold commonly considered acceptable in the literature (Graham et al. 2012). These labels serve as the benchmark against which other coding strategies are evaluated (Song et al. 2020). In essence, they represent the ‘correct’ responses that other strategies should replicate.
LLMs as coders. Four leading LLMs – GPT-3.5-turbo, GPT-4-turbo, Claude 3 Opus, and Claude 3.5 Sonnet – were tested using a zero-shot learning approach. Each model analysed every article twice to evaluate performance and consistency across tasks.
Outsourced human coders. University students from ESADE, a university located in Spain, were recruited as outsourced human coders. These students, primarily Spanish nationals with relevant linguistic and cultural knowledge, participated in an incentivised online study. Each student coded three articles, with quality controls and attention checks embedded to ensure data reliability. The final sample comprised 146 participants. This approach reflects common research practices where university students or temporary workers are employed for coding tasks.

Key findings

Coding strategies performance

Figure 1 illustrates the performance of outsourced human coders and LLMs across all tasks. The final panel (‘All correct’) shows the proportion of news articles where the different coders successfully completed all five tasks.

Figure 1 Overall performance, across tasks and coding strategies

Visual inspection of Figure 1 reveals that all LLMs outperform outsourced coders across all tasks. While GPT-3.5-turbo (the oldest and least advanced LLM tested) surpasses human coders, it falls behind other LLM models. Among the models compared, Claude 3.5 Sonnet and GPT-4-turbo (the most advanced) achieve the highest overall scores. This result suggests that as LLMs continue to grow more powerful, the performance gap between them and outsourced human coders will likely expand.

The performance advantage of LLMs holds even where task difficulty is considered. Figure 2 shows that state-of-the-art LLMs typically outperform outsourced human coders on more challenging tasks, where a task is deemed difficult if at least two authors initially disagreed on the correct answer during the creation of the gold standard labels.

Figure 2 Performance by article difficulty, across tasks and coding strategies

Other findings

Text length is known to impact the performance of both LLMs and human coders. Classifying news articles as ‘long’ or ‘regular’ based on word count revealed that longer articles pose greater challenges for both LLMs and outsourced human coders, with performance generally declining on longer texts. Notably, LLMs outperform human coders on longer articles, even achieving better performance on long texts than outsourced human coders do on shorter ones.
To verify that outsourced human coders performed the tasks correctly and followed the study’s requirements, permutation tests were conducted for tasks T1 through T5. These tests determined whether their performance significantly exceeded random chance. The results confirmed that the coders provided meaningful responses rather than random ones.

Cost and implementation advantages

The cost advantages of LLMs are significant. Running all tasks across the entire corpus cost just $0.20 with GPT-3.5-turbo, $3.46 with GPT-4, $8.53 with Claude 3 Opus, and $2.28 with Claude 3.5 Sonnet. In each case, the complete set of answers was delivered within minutes. In contrast, the outsourced human coding approach required substantial investment: designing the online questionnaire, recruiting and managing 146 participants, and coordinating the entire data collection process, all of which incurred significant time and logistical costs. Collecting data from all participants took about 98 days. Beyond cost and time savings, LLMs also provide operational simplicity through straightforward API calls, removing the need for advanced programming expertise or human-labeled training data.

Implications

Our study highlights the growing potential of modern generative LLMs as powerful, cost-effective tools for large-scale text analysis. The results demonstrate that LLMs consistently outperform outsourced human coders across a broad range of tasks. These findings underscore the significant advantages of leveraging LLMs for text analysis, suggesting that current natural language processing technologies have reached a point where researchers and practitioners – regardless of technical expertise – can seamlessly incorporate advanced text analysis methods into their work. Furthermore, as newer generations of LLMs continue to evolve, the performance gap between human coders and these models is likely to widen, making LLMs an increasingly valuable resource for economists.

Source : VOXeu