As generative pre-trained transformer (GPT) and other large language models become more prevalent, their capabilities beyond language processing will be tested. This column examines GPT’s economic rationality through budgetary decisions in risk, time, sociability, and food preferences. GPT outperforms human subjects in rationality scores, while its preference parameters differ slightly from humans and show less heterogeneity. Rationality scores remain robust to randomness and demographics but are sensitive to contextual framing. These findings highlight GPT’s decision-making potential and the need for further investigation of its limitations.
In 2023, Nature’s 10 included a special non-human awardee – ChatGPT – on its annual list of ten influential ‘people’ in science, acknowledging the significant role this human-language-mimicking artificial intelligence has played in the advancement of science. Generative pre-trained transformer (GPT) – and large language models (LLMs) more broadly – have impressed the world with their fluency in various tasks, including writing, coding, solving mathematical problems, and acting like humans (Biancotti and Camassa 2023, Noy and Zhang 2023). There is an ongoing discussion about wider applications of this technology, particularly in supporting decision-making at both individual and collaborative levels (Ramge and Mayer-Schönberger 2023). In this context, the question arises: can GPT make high-quality decisions?
Rationality, a classic idea in economics, helps us understand this question by theoretically capturing the extent to which a decision-maker maximises some well-behaved utility functions under given budget constraints (Afriat 1967). In practice, rationality has been widely used to measure the quality of decision-making. It has been applied in human decisions in laboratory environments, surveys, and grocery stores and linked with occupation, income, and wealth differences across individuals as well as development gaps across countries (Choi et al. 2014, Cappelen et al. 2023). Rationality is used in various fields and has even been studied in other species, such as monkeys (Chen et al. 2006).
In Chen et al. (2023), we study the economic rationality of the increasingly important ‘species’, GPT. We instruct GPT 3.5 Turbo to act as a decision-maker, allocating 100 points between two commodities with different prices. These decisions are then used to calculate a rationality score – the critical cost efficiency index (CCEI) – with the maximum value of one indicating the highest level of rationality. By specifying the different natures of the two commodities, we measure GPT’s rationality in four domains: risk, time, social, and food preferences. Correspondingly, the two commodities are set to be two risky assets, an instant payment and a future payment, a payment for the decision-maker and the payment for a randomly paired subject, and two types of food (meat and tomatoes). We repeat each environment 100 times to collect more observations. In addition, we conduct a parallel experiment with a representative US sample of 347 subjects to compare the rationality between GPT and humans.
Figure 1 shows the cumulative distributions of the rationality score in all four decision-making domains. In each panel, the light dotted line represents hypothetical subjects making random decisions, the dark dashed line represents human subjects, and the solid line represents GPT, with a more rightward position indicating a higher level of rationality. We see a clear pattern: GPT achieves a higher score than hypothetical subjects, confirming the sufficient power of our design to identify economic rationality. More importantly, we find that GPT consistently outperforms human subjects in terms of economic rationality.
Figure 1 Cumulative distributions of rationality scores
In addition to rationality, we are interested in two other economic outcomes. First, the property of downward-sloping demand – a fundamental principle in consumer behaviour – requires that the demand for a commodity decrease with its price. Compliance with this property serves as another measure of whether a decision-maker’s behaviour can be understood through economic theories. Similarly, we find that GPT respects the downward-sloping demand to a higher degree than humans in all four domains, confirming GPT’s ability to make reasonable economic decisions. Second, the underlying preferences behind decisions can be estimated through structural models. Compared to humans, GPT exhibits higher risk tolerance in risk preference, greater patience in time preference, a stronger inclination towards other-regarding and efficiency-oriented behaviour in social preference, and a lower preference for meat in food preference. Additionally, while different human subjects display varied preferences, GPT consistently delivers highly similar patterns across multiple repetitions of the same decision tasks.
How consistent is GPT’s behaviour in richer settings? To answer this question and enhance our understanding of GPT, we introduce variations of the above-mentioned baseline framework from three perspectives. First, we increase the temperature – a parameter in GPT’s setting – to increase the stochasticity and creativity of outputs. Second, we reframe the decision task. In one condition, we change the price description from ‘1 point = X units of commodity’ to ‘Y points = 1 unit of commodity’. In the other condition, we change the choice set from a continuous budget line to 11 options on the same budget line. These variations allow us to investigate whether the economic rationality of GPT is robust to framings that are less common. Third, we alter the demographic specifications in the prompt that requests GPT to act as a decision-maker, including variations in gender, age, education level, and minority group status. We are interested in whether GPT performs differently under various individual characteristics, which is related to growing concerns about algorithm bias (Obermeyer et al. 2019).
Figure 2 shows the average rationality score in the baseline framework and the variations. We find that GPT performs equally well under different temperature settings and demographic specifications. However, the rationality level of GPT drops substantially under the alternative price description and the discrete choice setting. While we also observe a reduction in rationality of human subjects in these two variations, the magnitude of the drop is much lower than that observed in GPT. We discuss the potential reasons behind the sensitivity of GPT to contexts and frames, such as the reflection of existing bias, insufficient training in alternative settings, and common tendencies of LLMs under dissimilar tasks.
Figure 2 Rationality scores of GPT across different variations
Conclusion
By applying economic theory and experimental methods, we provide a systematic approach to evaluate GPT’s decision-making abilities. Our results show that GPT is able to display a high level of rationality in decisions related to risk, time, social, and food preferences. Such excellent performance is robust under the increasing randomness of GPT and across different demographic characteristics. These findings highlight the great potential for GPT to serve as a general decision-support tool. More importantly, the fact that LLM is sensitive to context framing calls for further research on the replication and generalisability of LLM research, enabling machines to better assist human beings to “reap their benefits and minimise their harms” (Rahwan et al. 2019).
Source : VOXeu