September 17, 2025
By Maxim Laletin, Allen Downey
Large Language Models (LLMs) are rapidly becoming indispensable assistants across daily life, work, business, and research. As their power grows, so does interest in exploring new applications.
Can LLMs play video games? Can they find vulnerabilities in web applications? Can they predict future events?
To answer such questions rigorously, we need well-defined tasks, data, and evaluation metrics—an LLM benchmark. Benchmarks measure specific skills, compare models consistently, and help developers choose the right LLM for a purpose.
Demand for new benchmarks also arises from benchmark saturation: as LLMs train on more data, existing benchmarks often become too easy. If a model has seen the test data during training, it may simply memorize answers instead of generalizing. Thus, varied and dynamic benchmarks are crucial.
In a previous post, we tested whether LLMs can understand prices by simulating the TV game The Price is Right—specifically the “Showcase” segment, where contestants bid on prize totals without overbidding. With minimal instruction and examples, many LLMs performed impressively, sometimes surpassing human contestants.
This simple yet strategic challenge inspired us to create:
Each LLM guesses the total price of three products based on short descriptions, instructed not to overbid. Ten product-price examples are provided. The task—called a Showcase—is repeated 50–100 times with randomized product sets. Models don’t compete directly but are later compared in a simulated Tournament using pairwise matchups.
We use grocery items from the actual show, priced at west-coast retail levels. The database currently includes 820 items. Below we show some examples of the item descriptions and prices.
| Description | Price ($) |
|---|---|
| Mentos chewing gum | 1.69 |
| Soft Scrub cleanser | 5.29 |
| Zico coconut water | 2.49 |
| Merzetta roasted bell peppers (16 oz) | 3.99 |
| J&D's croutons (4.25 oz) | 2.49 |
The pilot version uses static prices from a single source, with no geographic variation. Future updates will expand the dataset for broader coverage.
Leaderboard example:
| Rank | Model | Rating | MAPE | Overbid % |
|---|---|---|---|---|
| 1 | qwen3-30b-a3b | 1231.98 | 22.74 | 20.41 |
| 2 | gpt-5 | 1210.45 | 16.79 | 42.00 |
| 3 | gpt-4o | 1208.76 | 25.90 | 26.00 |
| 4 | claude-sonnet-4-20250514 | 1132.58 | 26.34 | 50.00 |
| 5 | o3 | 1131.20 | 13.67 | 50.00 |
| 6 | gemini-1.5-pro | 1117.05 | 26.97 | 44.00 |
| 7 | gpt-5-mini | 1114.03 | 24.42 | 40.82 |
| 8 | claude-3-5-sonnet-20241022 | 1076.95 | 25.08 | 52.00 |
| 9 | glm-4p5 | 1075.94 | 23.57 | 46.00 |
| 10 | o1 | 1043.51 | 14.60 | 60.00 |
We track leading LLMs from OpenAI, Google, Anthropic, and open-source providers like Fireworks. Models must accept text input and return JSON.
Think your LLM can top the leaderboard?
Submit it for the next tournament via our benchmark site, or run it yourself using the GitHub framework.
While our benchmark provides meaningful insights into LLM performance, there’s room to grow. We see these limitations as opportunities for enhancement, and future versions will build on this strong foundation.
Though playful, the benchmark highlights skills essential in practice: