Can LLMs estimate prices?
What This Benchmark Tests
Price Knowledge
Can models accurately estimate the cost of everyday consumer products?
Strategy
Can models bid strategically to avoid disqualification while maximizing accuracy?
Why This Benchmark Matters For Business?
While the benchmark may seem just like a game, it demonstrates the AI capabilities that are useful in several real business applications
Market Entry Research
When expanding into new geographic regions or product categories, businesses need accurate price estimation capabilities.
Price Elasticity Research
AI models that understand pricing can analyze how pricing changes affect demand signals and consumer behavior.
Economic Indicator Development
Consumer price data serves as a leading indicator for broader economic trends and market conditions.
Regulatory Compliance
Accurate price estimation helps businesses maintain compliance with price discrimination and antitrust regulations.
Top Performing Models
🏆 Best Elo Rating
#1 | qwen3-30b-a3b | 1239 |
#2 | gpt-5 | 1207 |
#3 | gpt-4o | 1178 |
#4 | grok-4 | 1170 |
#5 | o3 | 1129 |
📊 Lowest MAPE
#1 | o3 | 13.78% |
#2 | o1 | 14.49% |
#3 | gpt-5 | 17.06% |
#4 | grok-4 | 19.47% |
#5 | qwen3-30b-a3b | 22.84% |
⚖️ Lowest Overbid Rate
#1 | qwen3-30b-a3b | 18.37% |
#2 | gpt-3.5-turbo | 20% |
#3 | gpt-4o | 28% |
#4 | gpt-4o-2024-08-06 | 40% |
#5 | gpt-5-mini | 40.82% |
Latest Update: September 25, 2025
The Showcase Challenge
This benchmark tests whether LLMs can understand real-world pricing and make strategic decisions under constraints, inspired by the Showcase game in "The Price is Right" TV show.
How the Game Works:
- • Two LLMs compete as contestants viewing the same showcase
- • Each model bids on the price of an everyday item (toothpaste, snacks, cleaners, etc.)
- • The closest bid without going over wins the round
- • If both models overbid, neither wins
- • Showcases typically total around $20 in value
Example Products from The Price is Right

Kraft Cool Whip
8oz dessert topping
$2.29

Mezzetta Roasted Peppers
16oz jar
$3.99

Minute Maid Orange Juice
1 gal container
$7.49
Data & Methodology
Data:
We use the set of grocery products that actually appeared on the Price Is Right show. The dataset consists of a short product description and its actual price in US dollars. All prices are the suggested retail price by the manufacturer on the US west coast. We use a fixed set of static prices in the first version of our benchmark, but we are planning to expand it and update in the future releases.
Challenge Sequence:
Every model undergoes the same sequence of 50-100 showcases with consumer goods.
Each round includes:
- Send the prompt with the rules of the game and output instructions
- Provide 10 reference price examples from similar products
- Parse bids and rationales with strict formatting
- Compare to the actual retail price, calculate the absolute percentage error and determine an overbid
Tournament Structure:
Based on the sequence of challenges and their outcomes we simulate a tournament in which each model competes with every other model and the Elo rating is computed to rank the models.

Evaluation Metrics
Elo Rating
Chess-inspired rating system where models gain points for wins and lose points for losses
MAPE (Mean Absolute Percentage Error)
Average percentage difference between bid and actual price (ignoring overbids)
Overbid Rate
Percentage of bids that exceeded the actual price (automatic disqualification)
Key Findings
- • Elo ratings vary significantly between models, reflecting different skill levels in the game
- • Strategic behavior emerges in top models that balance accuracy with overbid avoidance
- • Price knowledge correlates with model size and training quality
- • Winning requires both accurate estimation and strategic risk management
While the benchmark task description is very simple and easily understandable, it demonstrates a possibility to probe serious AI skills: using background knowledge and context to make constrained real-world decisions. By turning a game into a benchmark, we get a structured, repeatable way to measure models' real-world sensibility.
Caveats
- • Game mechanics introduce constraints beyond basic price estimation tasks
- • Dataset limitations exist with static pricing that doesn't capture market dynamics or regional variations
- • Elo ratings can vary when tournaments use few Showcases due to random pairings
We are planning to address these caveats during further refinements of the benchmark.
Ready to Test Your Model?
Think your LLM has what it takes to master "The Price is Right" showcase challenge? Submit your model and see how it performs against the current leaderboard champions.