AI agents making complex economic decisions.

Are AI Agents Economically Rational? New Benchmark Reveals Surprising Model Behaviors

Theo Raines in Tech & Innovation February 2026 • 4 min read.

"A deep dive into the 'STEER' framework and its implications for the future of AI-driven decision-making in economics."

The integration of Large Language Models (LLMs) into decision-making processes is rapidly evolving, presenting both unprecedented opportunities and significant challenges. LLMs are now being deployed as 'agents' in various capacities, from direct economic interactions to serving as crucial components of broader systems, sparking interest and enthusiasm. However, the question remains: Can these AI systems make sound, rational decisions?

Recent studies highlight the potential of LLM-based agents in diverse fields, such as personal finance, medical diagnostics, and even strategic games like chess. Furthermore, LLMs are poised to enhance Reinforcement Learning from AI Feedback (RLAIF), refining chatbot functionalities and social science experiments, raising the possibility of AI agents undertaking tasks previously reserved for humans.

The path toward reliable LLM agents hinges on answering whether an LLM agent is reliable enough to be trusted. In this article, we delve into the complexities of assessing the economic rationality of LLMs and introduce 'STEER,' a novel framework designed to evaluate and benchmark the decision-making capabilities of these AI agents.

STEER: A New Benchmark for Economic Rationality

AI agents making complex economic decisions.

The research paper introduces STEER (Systematic and Tuneable Evaluation of Economic Rationality), a novel benchmark distribution for quantitatively scoring an LLM's performance across fine-grained elements of decision-making. This benchmark addresses a critical need: a reliable methodology for assessing the economic rationality of LLMs acting as agents.

The STEER framework is built on a rich, hierarchical taxonomy of 64 'elements of rationality,' each representing a specific aspect of sound decision-making. These elements are derived from economic literature on rational choice and encompass a wide range of behaviors, including:

Foundations: Arithmetic, optimization, probability, logic, and theory of mind.
Decisions in Single-Agent Environments: Axioms of utility in deterministic and stochastic settings, risk preferences, and cognitive bias avoidance.
Decisions in Multi-Agent Environments: Strategic interactions in normal form games, extensive form games, and games with imperfect information.
Decisions on Behalf of Other Agents: Social choice theory and mechanism design.

STEER goes beyond simply identifying these elements. It instantiates them in multiple 'grade levels' of difficulty and across various domains (e.g., finance, medicine). For a significant portion of the elements, LLM prompts were crafted to generate multiple-choice questions, which then underwent manual validation to ensure quality and relevance. The result is a comprehensive benchmark for evaluating LLMs across a spectrum of decision-making scenarios.

The Path Forward

The development and release of STEER represent a crucial step towards ensuring that AI agents are not only powerful but also economically sound. As LLMs continue to evolve, benchmarks like STEER will be essential for guiding their development and deployment in ways that are both beneficial and aligned with human values and expectations.

About this Article -

This article was crafted using a human-AI hybrid and collaborative approach. AI assisted our team with initial drafting, research insights, identifying key questions, and image generation. Our human editors guided topic selection, defined the angle, structured the content, ensured factual accuracy and relevance, refined the tone, and conducted thorough editing to deliver helpful, high-quality information.See our About page for more information.

Everything You Need To Know

What is the primary goal of the STEER benchmark?

The STEER (Systematic and Tuneable Evaluation of Economic Rationality) benchmark is designed to quantitatively score an LLM's performance across fine-grained elements of decision-making. It serves as a reliable methodology for assessing the economic rationality of Large Language Models when they act as agents in various scenarios. STEER enables a systematic evaluation of how well these models adhere to principles of rational choice derived from economic literature.

What are the key components evaluated by the STEER framework?

The STEER framework assesses 64 'elements of rationality,' categorized into Foundations (arithmetic, optimization, probability, logic, theory of mind), Decisions in Single-Agent Environments (axioms of utility, risk preferences, cognitive bias avoidance), Decisions in Multi-Agent Environments (strategic interactions in games), and Decisions on Behalf of Other Agents (social choice theory, mechanism design). These elements are tested at various difficulty levels and across diverse domains such as finance and medicine, providing a comprehensive evaluation of decision-making capabilities.

How does STEER address the need for reliable LLM agents?

STEER addresses the need for reliable LLM agents by providing a structured benchmark to evaluate their economic rationality. By scoring Large Language Models against a detailed taxonomy of decision-making elements, STEER helps identify areas where models excel or fall short. This targeted assessment aids in developing and deploying AI agents that are not only powerful but also economically sound and aligned with rational decision-making principles. Furthermore, the multiple-choice question format with manual validation enhances the reliability and relevance of the benchmark.

In what real-world applications can LLM-based agents be utilized, and what makes the assessment of their rationality crucial?

LLM-based agents have potential applications in personal finance, medical diagnostics, strategic games like chess, and enhancing Reinforcement Learning from AI Feedback (RLAIF). Assessing their rationality is crucial because these agents are increasingly involved in tasks that were once reserved for humans, where poor decision-making could have significant consequences. By ensuring their economic rationality through benchmarks like STEER, we can improve their reliability and trustworthiness in these critical roles.

What are the broader implications of using benchmarks like STEER for the future of AI?

Benchmarks like STEER are vital for guiding the development and deployment of Large Language Models in ways that are beneficial and aligned with human values and expectations. By ensuring that AI agents are economically rational, these benchmarks help prevent unintended consequences and promote trust in AI systems. As Large Language Models continue to evolve and take on more complex decision-making roles, benchmarks like STEER will play an increasingly important role in ensuring their responsible and effective integration into society. While STEER focuses on economic rationality, it sets a precedent for creating similar benchmarks to evaluate other critical aspects of AI behavior, such as ethical considerations and social impact.