When the ruler is made of the thing it measures: Multi-model evidence on AI occupational exposure scores

Featured, Innovation, Productivity, World

When the ruler is made of the thing it measures: Multi-model evidence on AI occupational exposure scores

To estimate how AI is reshaping work, it is now standard to ask AI itself to score how exposed each occupation is. The instrument is thus an instance of the phenomenon. This column replicates the most common scoring procedure using four frontier AI models and identical task data. The share of US occupations classified as ‘high direct exposure’ to AI differs widely, ranging from 2.7% to 51.5% depending on the model used. The findings imply that any analysis based on AI-generated exposure scores should report results from at least two or three different frontier AI models.

There is a circularity at the centre of the empirical AI-and-labour literature. To estimate how artificial intelligence is reshaping work, researchers ask an AI to score each occupation according to how much of its task content the AI could perform. The instrument and the phenomenon are the same kind of object. The procedure is now standard. Goldman Sachs’ widely cited estimate that AI could expose 300 million jobs to automation itself rests on a version of AI. So do the IMF’s cross-country exposure analysis (Cazzaniga et al. 2024), the ILO’s Generative AI and Jobs Index (Gmyrek et al. 2025), the Yale Budget Lab’s projections (Gimbel et al. 2025), and PwC’s 2025 Global AI Jobs Barometer covering one billion job advertisements. The numbers anchored by these scores now sit in central bank communications, finance ministry briefings, board presentations, and even state-level K-12 guidance documents in the US (Arizona AI Alliance 2025).

In our paper (Yin et al. 2026), my co-authors and I ask a question the standard procedure leaves implicit. Suppose the same scoring exercise is run with a different AI. Does the result change?

The answer is yes, and by far more than the field has acknowledged. The flagship statistic from Eloundou et al. (2024) – the share of US occupations with more than half of their tasks at high direct exposure – ranges from 2.7% under Google’s Gemini 2.5 to 51.5% under Anthropic’s Claude 4.5 on identical task data. That is a nineteen-fold spread. GPT-4 (the model used in the original Eloundou study) places the figure at 3.8%; ChatGPT-5 places it at 20.3%. Same instructions, same O*NET task descriptions, same data pipeline. Only the AI rater varied.

The pattern is sharper at the level of executive perception. Take management as an example, the kind of category any chief financial officer or workforce planner cares about. Under Claude, more than 80% of management occupations are classified as high-exposure. Under Gemini, fewer than one in five are. Under GPT-4, roughly one in four are. A workforce strategy memo that reads management roles as deeply at risk and one that reads them as moderately exposed are downstream of the same paper, the same data, and the same procedure. They differ only in which AI did the scoring.

These dispersion findings have a structural explanation, and the explanation is what makes them difficult to wave away. Each frontier model is its own instrument. Its exposure judgements are the joint product of an underlying labour market reality and the model’s training corpus, calibration choices, and reinforcement signals. None of those choices is wrong in any obvious sense, and none of them is small. The differences are also systematic rather than random: each model carries a consistent directional tilt across occupations, so the bias does not cancel with larger samples.

There is, in addition, a feedback channel that the standard measurement-error framework does not accommodate. The tasks where AI capability is advancing fastest are also the tasks generating the most training data for newer models. As the underlying technology evolves, the instrument that measures the technology’s effect on work evolves with it. The measurement and the phenomenon are not independent. This violates the mean-zero error assumption underlying classical sensitivity analyses, and it is the analytical reason the bias is irreducible from a single-model exercise.

A natural reaction is to ask whether downstream conclusions are also unstable. They are. When each model’s scores are plugged into the standard difference-in-differences employment specification, the point estimate of the AI-exposure effect flips sign across raters: positive under three of the four annotators, negative under the fourth. None of the four reaches conventional statistical significance, but the qualitative reading would differ across the four readings.

The identity of the occupations most affected also shifts sharply across raters. Two analyses with different AI raters would arrive at different conclusions about both the magnitude and the direction of the labour market effect, and target different occupational categories on the same evidence.

For applied research and policy, the implication is straightforward in design and consequential in practice. Any analysis that conditions on AI-generated exposure scores, whether it appears in a peer-reviewed journal, an international institution’s working paper, or a corporate strategy memo, should report results from at least two or three different frontier AI models. Where conclusions converge across raters, the finding is likely robust (yes, there might be other sources of errors). Where they diverge, the divergence itself is informative: it tells us that the inference depends on a feature of the model rather than a feature of the labour market. The compute cost of doing this is roughly the same as running an analysis on three different cloud providers. The price of not doing it is a body of empirical work in which a portion of the cross-paper variation reflects which AI happened to be available when the analysis was run.

The point also generalises beyond AI and labour research. Wherever AI is asked to produce a number that anchors a consequential decision – in pricing, eligibility scoring, hiring, lending, programme planning – the model is not a neutral observer. It is an instrument whose calibration shifts as it is updated. Treating multi-model checks as a standard practice now, in this early period when the empirical literature is still consolidating, is far less costly than discovering the instability later.

The deeper philosophical point that economists may find worth holding onto is this: we have a long tradition of asking whether the rulers we use are well calibrated for the things we measure. The classical literature on measurement error treats the ruler as fixed. When the ruler is itself made of the thing it measures, when the technology that scores AI exposure is an instance of the technology whose effect we are studying, fixedness is an assumption, not a property. Our paper is a first quantitative reading of how much that assumption matters in the present application. Across four current frontier AI raters and on identical task data, it matters by a factor of nineteen.

Source : VOXeu