AI brain for economic rationality

Are AI Agents Ready to Manage Your Money? A Critical Look at the Economic Rationality of Large Language Models

Samir D’Costa in Tech & Innovation February 2026 • 4 min read.

"New research benchmarks the decision-making skills of LLMs, revealing surprising gaps in their ability to handle complex financial scenarios."

Imagine a future where AI agents handle your personal finances, make investment decisions, and even negotiate on your behalf. This vision is fueled by the rapid advancement of Large Language Models (LLMs), which are increasingly being touted as capable decision-makers. However, before we entrust our economic well-being to these digital entities, a crucial question arises: are LLMs truly rational enough to handle the complexities of the financial world?

Recent research has explored leveraging LLMs to create decision-making engines, configuring them either to act directly as economic agents or to serve as key elements of broader systems. LLM-based agents are already showing strength in planning, solving complex problems, leveraging tools, and playing games. However, assessing their economic rationality is a different ballgame.

To address this concern, a team of researchers has developed a novel benchmark called STEER (Systematic and Tuneable Evaluation of Economic Rationality) to rigorously assess the economic rationality of LLMs. This benchmark draws upon established economic principles and cognitive psychology to evaluate LLMs across a wide range of decision-making scenarios.

Introducing STEER: A Report Card for AI Rationality

STEER isn't just another AI benchmark; it's a comprehensive framework designed to evaluate LLMs against the gold standard of economic rationality. It moves beyond ad-hoc tasks by enumerating first principles describing how agents should make decisions, then evaluating an agent's degree of adherence. The normative question of how decision-makers should act has been the focus of more than a century of research in economics, cognitive psychology, computer science, and philosophy.

The STEER benchmark encompasses 64 distinct "elements of rationality," organized into a hierarchical taxonomy. These elements span a wide range of cognitive skills, from basic arithmetic and probability to more complex concepts like game theory and social choice. Each element is then instantiated in multiple "grade levels" of difficulty and across various domains, such as finance and medicine, creating a robust and nuanced evaluation environment.

Foundations: Tests core mathematical and logical reasoning abilities.
Decisions in Single-Agent Environments: Explores preference formation and decision-making with single deterministic or probabilistic outcomes.
Decisions in Multi-Agent Environments: Assesses strategic thinking and game theory concepts.
Decisions on Behalf of Other Agents: Evaluates the ability to aggregate preferences and make socially responsible choices.

Using this framework, the researchers generated a dataset of 24,500 multiple-choice questions, meticulously validated to ensure accuracy and relevance. This dataset was then used to evaluate 14 different LLMs, ranging from smaller models like Llama 7B to the powerful GPT-4 Turbo.

Beyond the Hype: Towards Truly Rational AI Agents

The STEER benchmark provides a valuable tool for evaluating and improving the economic rationality of LLMs. By identifying specific areas where models struggle, researchers and developers can focus their efforts on fine-tuning, curating new datasets, and developing specialized architectures. The journey toward truly rational AI agents is just beginning, but benchmarks like STEER are essential for guiding our progress and ensuring that these powerful tools are used responsibly and effectively in the economic sphere.

About this Article -

This article was crafted using a human-AI hybrid and collaborative approach. AI assisted our team with initial drafting, research insights, identifying key questions, and image generation. Our human editors guided topic selection, defined the angle, structured the content, ensured factual accuracy and relevance, refined the tone, and conducted thorough editing to deliver helpful, high-quality information.See our About page for more information.

This article is based on research published under:

DOI-LINK: https://doi.org/10.48550/arXiv.2402.09552,

Title: Steer: Assessing The Economic Rationality Of Large Language Models

Subject: cs.cl econ.gn q-fin.ec

Authors: Narun Raman, Taylor Lundy, Samuel Amouyal, Yoav Levine, Kevin Leyton-Brown, Moshe Tennenholtz

Published: 14-02-2024

Everything You Need To Know

What is STEER, and why is it important for evaluating LLMs?

STEER (Systematic and Tuneable Evaluation of Economic Rationality) is a benchmark designed to assess the economic rationality of Large Language Models (LLMs). It is important because it provides a structured way to evaluate LLMs against established economic principles and cognitive psychology. STEER moves beyond simple tasks by examining how LLMs make decisions based on first principles. It helps in identifying specific areas where LLMs struggle, guiding researchers and developers in improving these models for use in financial and economic applications.

What are the key components of the STEER benchmark?

STEER encompasses 64 elements of rationality, organized into a hierarchical taxonomy. These elements are grouped into categories: Foundations (testing basic math and logic), Decisions in Single-Agent Environments (exploring preference formation), Decisions in Multi-Agent Environments (assessing strategic thinking and game theory), and Decisions on Behalf of Other Agents (evaluating the ability to make socially responsible choices). Each element is then tested at multiple difficulty levels and across various domains, such as finance and medicine.

How does STEER test the economic rationality of Large Language Models?

STEER uses a dataset of 24,500 multiple-choice questions to evaluate LLMs. The questions are designed to assess an LLM's adherence to the principles of economic rationality. The benchmark covers a broad range of cognitive skills, from basic arithmetic and probability to complex concepts like game theory. By analyzing the LLM's responses to these questions, researchers can determine its strengths and weaknesses in economic decision-making.

What are the potential implications of using LLMs in financial decision-making, and how does STEER relate to this?

The potential implications of using LLMs in financial decision-making are significant, as they could be used to handle personal finances, make investments, and negotiate on behalf of individuals or organizations. STEER is directly relevant because it helps determine whether LLMs are rational enough to handle the complexities of the financial world. By identifying the limitations of LLMs through STEER, developers can work to improve the models and ensure that they are used responsibly and effectively in the economic sphere, mitigating risks associated with irrational decision-making.

What are the limitations of Large Language Models as revealed by the STEER benchmark?

The STEER benchmark reveals that Large Language Models (LLMs) have surprising gaps in their ability to handle complex financial scenarios. While LLMs have shown strengths in planning, solving complex problems, and playing games, their economic rationality is still under development. STEER helps to pinpoint specific areas where LLMs struggle, such as game theory and making socially responsible choices. This information can be used to improve the models and guide their development towards truly rational AI agents.