Can LLMs play The Price is Right?

Blog

July 23, 2025

By Allen Downey

Synthetic consumers—LLMs simulating human survey participants—are becoming a powerful tool for marketing and behavioral research. They promise faster iteration, lower costs, and broader flexibility than traditional panels. But for them to be useful, they need not only to sound realistic, but also to demonstrate some level of real-world reasoning.

A core question in this space: do LLMs “understand” prices? That is, can they recognize how much everyday items cost, and make decisions based on that understanding?

To explore this, we built a synthetic version of the Showcase game from The Price is Right, a game show where contestants try to estimate the value of consumer products. The result is a lightweight but surprisingly informative benchmark: a test of both knowledge and reasoning under constraints.


The Showcase Experiment#

In the TV version of The Price is Right, contestants bid on lavish collections of prizes—cars, appliances, and tropical vacations. In our version, the prizes are more mundane: consumer packaged goods (CPGs) like toothpaste, snack bars, and household cleaners. Instead of $30,000 showrooms, the total value is usually around $20.

The rules are simple:

  • Two LLMs (contestants) see the same showcase of items.

  • They each bid on the total value of the showcase.

  • The closest bid without going over wins.

  • If both overbid, neither wins.

We use this setup to probe several model abilities:

  • Estimating real-world prices of common goods.

  • Using reference examples to calibrate those estimates.

  • Strategically adjusting estimates to avoid disqualification.


How It Works#

Each “round” of the tournament follows this format:

  1. Generate a random showcase of three CPG items from our dataset.

  2. Provide 10 example prices from similar products to calibrate the models.

  3. Send the same prompt to two models, including the showcase and example prices.

  4. Parse the bid and rationale, enforcing strict formatting and short responses.

  5. Compare bids to the actual retail price, determine the winner, and update metrics.

Here's what the prompt looks like (excerpted):

You are a contestant on The Price is Right.
GOAL: Bid as close as possible to the total retail price WITHOUT GOING OVER. CRITICAL RULE: If your bid is over the actual price, you lose. GOAL: Bid as close as possible to the total retail price WITHOUT GOING OVER. CRITICAL RULE: If your bid is over the actual price, you lose. Showcase Items:
{showcase.description} Example Prices:
{showcase.example_prices} CRITICAL: You must respond with ONLY a single JSON object in this format:
{"bid": 1234.56, "rationale": "Brief explanation..."}

Including the format in the prompt discourages verbose outputs and facilitates parsing – even though not all models follow the instructions.


Contestants and Rounds#

We started with a broad field of 90 models—everything from foundation models to smaller instruction-tuned variants. We ran two preliminary rounds to eliminate models that were clearly not in the running, and to refine our prompt to elicit better responses. During each round, every model saw the same 20 showcases, and we computed the mean absolute percentage error (MAPE) of their bids, ignoring overbidding. In the first round, most versions of Gemini Flash were eliminated, along with several versions of Llama and DeekSeek R1. In the second round, we lost more of the same, and a few GPT minis. The top 50 models (lowest MAPE) moved on to the finals.

In the finals, we ran 50 showcases. For each showcase, we

  • Paired the contestants at random,
  • Solicited bids from each pair, and
  • Recorded the bids, whether or not the contestants overbid, and who won.

Metrics#

We evaluate the models along several dimensions:

  • Mean Absolute Percentage Error (MAPE)
    Measures how close each bid was to the actual price, ignoring overbids. Lower is better.
  • Overbid rate
    The percentage of bids that exceed the actual price—resulting in automatic loss. A high overbid rate suggests poor calibration or lack of strategic conservatism.
  • Win rate and Elo rating
    Tracks direct wins and losses across matchups using an Elo rating system. A model that consistently outbids (without overbidding) its peers rises in the rankings.

These metrics reflect both accuracy and adaptiveness—some models may have low MAPE but lose frequently due to risky bidding.


Results#

Some models are surprisingly accurate, and some are much worse#

Here are the top 10 models by MAPE:

RankModelRating# winsMAPEOverbid %
1o31025.82213.542.0
2o1-2024-12-171220.12214.644.0
3o3-2025-04-16922.11814.644.0
4o11092.22114.942.0
5gpt-4.11285.22917.922.0
6o1-preview-2024-09-12889.01718.746.0
7gpt-4.1-2025-04-141103.02418.724.0
8o3-mini1070.72119.438.0
9gpt-4.5-preview1072.42320.328.0
10o1-preview1147.52320.651.0
11o3-mini-2025-01-311003.92320.630.0
12claude-3-5-sonnet-202410221058.92120.742.0
13gpt-4o1303.92921.122.0
14claude-3-7-sonnet-202502191021.62021.150.0
15gpt-41101.62421.328.0
16gpt-4o-2024-05-131164.72822.120.0
17gpt-4-06131126.62322.236.0
18gpt-4o-2024-08-06992.32022.330.0
19o4-mini892.41622.954.0
20claude-opus-4-20250514719.91423.432.0
21o4-mini-2025-04-16960.31623.460.0
22gpt-4.5-preview-2025-02-271113.62323.620.0
23claude-3-5-sonnet-20240620949.81924.046.0
24claude-sonnet-4-202505141012.22124.134.0
25qwen2-vl-72b-instruct803.41625.038.0
26gpt-4o-search-preview-2025-03-11876.81625.654.0
27chatgpt-4o-latest1209.63125.92.0
28gpt-4o-search-preview1154.42625.936.0
29llama-v3p3-70b-instruct1068.72126.742.0
30gpt-4o-2024-11-201151.32726.812.0
31llama-3.3-70b-versatile1071.62127.140.0
32gpt-3.5-turbo-11061156.32627.328.0
33llama3-70b-8192949.62027.438.0
34gemini-1.5-pro1072.82127.634.0
35llama-v3p1-405b-instruct860.01627.744.0
36claude-3-opus-20240229930.02028.516.0
37gemini-1.5-pro-0021025.12129.132.0
38gpt-4.1-mini-2025-04-14985.12130.922.0
39compound-beta874.41631.043.8
40gemini-1.5-flash-8b-001962.41931.242.0
41gemini-1.5-flash-8b935.01531.748.0
42qwen3-30b-a3b844.81731.920.0
43gpt-4-turbo858.41432.268.0
44gpt-4-turbo-2024-04-09696.61132.772.0
45gpt-3.5-turbo884.31934.712.0
46gpt-3.5-turbo-16k993.92334.914.0
47gpt-3.5-turbo-0125897.22035.910.0
48o1-mini981.31539.656.0
49qwen-qwq-32b815.41641.138.0
50llama4-scout-instruct-basic692.01053.170.0



The lowest mean absolute percentage error (MAPE) is 14%, which is better than the performance of human contestants on the show, about 18% (see Chapter 9 of Think Bayes). The highest MAPE is 53%, barely better than guessing without looking at the showcase.

The following scatterplot shows the actual values of the showcases and the bids from the best and worst model.

The bids from OpenAI o3 track the actual values closely, with only one notable miss on the highest-valued showcase. The correlation of bids and actual values is 0.89. In contrast, the bids from Llama4 are unrelated to the actual values – the correlation is effectively 0.

Some models are more strategic than others#

In the previous plot, it looks like o3 is bidding strategically — more bids are below the actual value than above (58%). But some models are substantially better at avoiding overbidding. The most conservative model overbids only 2% of the time. The most aggressive model overbids 72% of the time. For comparison, contestants on the show overbid about 25% of the time.

The following scatter plot shows what the bid patterns look like for the models with the highest and lowest overbid percentages:

For both models the bids are well correlated with the actual values (0.69 and 0.90) but in one case they are consistently high and in the other case consistently low.

Winning is a combination of accuracy and strategy#

Here are the top models sorted by Elo rating:

RankModelRating# winsMAPEOverbid %
1gpt-4o1303.92921.122.0
2gpt-4.11285.22917.922.0
3o1-2024-12-171220.12214.644.0
4chatgpt-4o-latest1209.63125.92.0
5gpt-4o-2024-05-131164.72822.120.0
6gpt-3.5-turbo-11061156.32627.328.0
7gpt-4o-search-preview1154.42625.936.0
8gpt-4o-2024-11-201151.32726.812.0
9o1-preview1147.52320.651.0
10gpt-4-06131126.62322.236.0
11gpt-4.5-preview-2025-02-271113.62323.620.0
12gpt-4.1-2025-04-141103.02418.724.0
13gpt-41101.62421.328.0
14o11092.22114.942.0
15gemini-1.5-pro1072.82127.634.0
16gpt-4.5-preview1072.42320.328.0
17llama-3.3-70b-versatile1071.62127.140.0
18o3-mini1070.72119.438.0
19llama-v3p3-70b-instruct1068.72126.742.0
20claude-3-5-sonnet-202410221058.92120.742.0
21o31025.82213.542.0
22gemini-1.5-pro-0021025.12129.132.0
23claude-3-7-sonnet-202502191021.62021.150.0
24claude-sonnet-4-202505141012.22124.134.0
25o3-mini-2025-01-311003.92320.630.0
26gpt-3.5-turbo-16k993.92334.914.0
27gpt-4o-2024-08-06992.32022.330.0
28gpt-4.1-mini-2025-04-14985.12130.922.0
29o1-mini981.31539.656.0
30gemini-1.5-flash-8b-001962.41931.242.0
31o4-mini-2025-04-16960.31623.460.0
32claude-3-5-sonnet-20240620949.81924.046.0
33llama3-70b-8192949.62027.438.0
34gemini-1.5-flash-8b935.01531.748.0
35claude-3-opus-20240229930.02028.516.0
36o3-2025-04-16922.11814.644.0
37gpt-3.5-turbo-0125897.22035.910.0
38o4-mini892.41622.954.0
39o1-preview-2024-09-12889.01718.746.0
40gpt-3.5-turbo884.31934.712.0
41gpt-4o-search-preview-2025-03-11876.81625.654.0
42compound-beta874.41631.043.8
43llama-v3p1-405b-instruct860.01627.744.0
44gpt-4-turbo858.41432.268.0
45qwen3-30b-a3b844.81731.920.0
46qwen-qwq-32b815.41641.138.0
47qwen2-vl-72b-instruct803.41625.038.0
48claude-opus-4-20250514719.91423.432.0
49gpt-4-turbo-2024-04-09696.61132.772.0
50llama4-scout-instruct-basic692.01053.170.0



The models with the most wins and the highest rankings are the ones that balance accuracy and strategy. OpenAI o3, which has the lowest MAPE, is ranked only #21 out of 50 because it overbids too often. Most top models have MAPE near 20% and overbid percentages less than 30% – although there are a few top performers that deviate from this strategy.

The top models are substantially better than the worst. In the Elo model, we expect the best model, with rating 1210, to beat the worst, with rating 845, about 90% of the time.

OpenAI dominates the top of the leaderboard#

OpenAI models took the top 14 spots, and 16 of the top 20. Granted, part of this success is that they started with the most models (42 out of 90). And a few of them ended up near the bottom as well. But at least part of their success is earned – out of 42 models, 31 made it through the preliminary rounds.

The scatterplot below shows MAPE and overbid percentages for all 50 models, with each point colored by the backend used to access the model. OpenAI provides GPT-series models and related variants. Anthropic develops and hosts the Claude family of models. The Gemini backend provides Google’s Gemini models. Groq hosts high-speed inference for open or licensed models from other developers, including Meta’s LLaMA, Mistral, and Qwen. Similarly, Fireworks provides open-source or community-developed models including some from Meta, Mistral, and DeepSeek.

The best models are in the lower-left corner, with low MAPE and low overbid percentages. And we can see that this corner is populated entirely with OpenAI models. However, of the models that made it to the finals, the worst are also from OpenAI.


Why This Matters#

Although this benchmark is light-hearted, it reflects a serious capability: using background knowledge and context to make constrained real-world decisions. This kind of reasoning underpins tasks like:

  • Estimating costs in consumer surveys
  • Recommending products within budgets
  • Simulating realistic user behavior in test environments

By turning a game into a benchmark, we get a structured, repeatable way to measure models' real-world sensibility—not just their ability to talk about it.


Related articles:

Work with PyMC Labs#

If you are interested in seeing what we at PyMC Labs can do for you, then please email info@pymc-labs.com. We work with companies at a variety of scales and with varying levels of existing modeling capacity. We also run corporate workshop training events and can provide sessions ranging from introduction to Bayes to more advanced topics.