Large Language Models (LLMs) are rapidly becoming indispensable assistants across daily life, work, business, and research. As their power grows, so does interest in exploring new applications.

Can LLMs play video games? Can they find vulnerabilities in web applications? Can they predict future events?

To answer such questions rigorously, we need well-defined tasks, data, and evaluation metrics—an LLM benchmark. Benchmarks measure specific skills, compare models consistently, and help developers choose the right LLM for a purpose.

Demand for new benchmarks also arises from benchmark saturation: as LLMs train on more data, existing benchmarks often become too easy. If a model has seen the test data during training, it may simply memorize answers instead of generalizing. Thus, varied and dynamic benchmarks are crucial.

In a previous post, we tested whether LLMs can understand prices by simulating the TV game The Price is Right—specifically the “Showcase” segment, where contestants bid on prize totals without overbidding. With minimal instruction and examples, many LLMs performed impressively, sometimes surpassing human contestants.

This simple yet strategic challenge inspired us to create:

Price-is-Right LLM Benchmark

What This Benchmark Tests

Price estimation: Can models accurately estimate consumer product costs?
Reasoning under constraints: Can models follow bidding rules to maximize accuracy while avoiding disqualification?

How It Works

The Showcase Challenge

Each LLM guesses the total price of three products based on short descriptions, instructed not to overbid. Ten product-price examples are provided. The task—called a Showcase—is repeated 50–100 times with randomized product sets. Models don’t compete directly but are later compared in a simulated Tournament using pairwise matchups.

Data

We use grocery items from the actual show, priced at west-coast retail levels. The database currently includes 820 items. Below we show some examples of the item descriptions and prices.

Description	Price ($)
Mentos chewing gum	1.69
Soft Scrub cleanser	5.29
Zico coconut water	2.49
Merzetta roasted bell peppers (16 oz)	3.99
J&D's croutons (4.25 oz)	2.49

The pilot version uses static prices from a single source, with no geographic variation. Future updates will expand the dataset for broader coverage.

Evaluation Metrics

MAPE (Mean Absolute Percentage Error): Average closeness of bids to actual prices (lower is better).
Overbid Rate: Percentage of bids exceeding the true price; higher rates suggest weaker compliance with rules.
Elo Rating: Chess-inspired scoring, balancing accuracy and strategy.

Leaderboard example:

Rank	Model	Rating	MAPE	Overbid %
1	qwen3-30b-a3b	1231.98	22.74	20.41
2	gpt-5	1210.45	16.79	42.00
3	gpt-4o	1208.76	25.90	26.00
4	claude-sonnet-4-20250514	1132.58	26.34	50.00
5	o3	1131.20	13.67	50.00
6	gemini-1.5-pro	1117.05	26.97	44.00
7	gpt-5-mini	1114.03	24.42	40.82
8	claude-3-5-sonnet-20241022	1076.95	25.08	52.00
9	glm-4p5	1075.94	23.57	46.00
10	o1	1043.51	14.60	60.00

Models

We track leading LLMs from OpenAI, Google, Anthropic, and open-source providers like Fireworks. Models must accept text input and return JSON.

Think your LLM can top the leaderboard?
Submit it for the next tournament via our benchmark site, or run it yourself using the GitHub framework.

Caveats

The task adds game-like restrictions beyond simple price prediction.
Data is static and limited, not reflecting dynamic pricing factors.
Elo ratings can vary when tournaments use few Showcases due to random pairings.

While our benchmark provides meaningful insights into LLM performance, there’s room to grow. We see these limitations as opportunities for enhancement, and future versions will build on this strong foundation.

Why It Matters for Business

Though playful, the benchmark highlights skills essential in practice:

Market Entry Research: Estimate prices in new regions or categories.
Price Elasticity Research: Detect dynamics and forecast demand.
Economic Indicators: Use consumer prices as signals of broader trends.
Regulatory Compliance: Ensure accurate estimates to meet fairness and antitrust standards.