Your AI models are failing in production—Here’s how to fix model selection

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Enterprises need to know if the models that power their applications and agents work in real-life scenarios. This type of evaluation can sometimes be complex because it is hard to predict specific scenarios. A revamped version of the RewardBench benchmark looks to give organizations a better idea of a model’s real-life performance.

The Allen Institute of AI (Ai2) launched RewardBench 2, an updated version of its reward model benchmark, RewardBench, which they claim provides a more holistic view of model performance and assesses how models align with an enterprise’s goals and standards.

Ai2 built RewardBench with classification tasks that measure correlations through inference-time compute and downstream training. RewardBench mainly deals with reward models (RM), which can act as judges and evaluate LLM outputs. RMs assign a score or a “reward” that guides reinforcement learning with human feedback (RHLF).

RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling. pic.twitter.com/NGetvNrOQV
— Ai2 (@allen_ai) June 2, 2025

Nathan Lambert, a senior research scientist at Ai2, told VentureBeat that the first RewardBench worked as intended when it was launched. Still, the model environment rapidly evolved, and so should its benchmarks.

“As reward models became more advanced and use cases more nuanced, we quickly recognized with the community that the first version didn’t fully capture the complexity of real-world human preferences,” he said.

Lambert added that with RewardBench 2, “we set out to improve both the breadth and depth of evaluation—incorporating more diverse, challenging prompts and refining the methodology to reflect better how humans actually judge AI outputs in practice.” He said the second version uses unseen human prompts, has a more challenging scoring setup and new domains.

While reward models test how well models work, it’s also important that RMs align with company values; otherwise, the fine-tuning and reinforcement learning process can reinforce bad behavior, such as hallucinations, reduce generalization, and score harmful responses too high.

RewardBench 2 covers six different domains: factuality, precise instruction following, math, safety, focus and ties.

“Enterprises should use RewardBench 2 in two different ways depending on their application. If they’re performing RLHF themselves, they should adopt the best practices and datasets from leading models in their own pipelines because reward models need on-policy training recipes (i.e. reward models that mirror the model they’re trying to train with RL). For inference time scaling or data filtering, RewardBench 2 has shown that they can select the best model for their domain and see correlated performance,” Lambert said.

Lambert noted that benchmarks like RewardBench offer users a way to evaluate the models they’re choosing based on the “dimensions that matter most to them, rather than relying on a narrow one-size-fits-all score.” He said the idea of performance, which many evaluation methods claim to assess, is very subjective because a good response from a model highly depends on the context and goals of the user. At the same time, human preferences get very nuanced.

Ai 2 released the first version of RewardBench in March 2024. At the time, the company said it was the first benchmark and leaderboard for reward models. Since then, several methods for benchmarking and improving RM have emerged. Researchers at Meta’s FAIR came out with reWordBench. DeepSeek released a new technique called Self-Principled Critique Tuning for smarter and scalable RM.

Super excited that our second reward model evaluation is out. It's substantially harder, much cleaner, and well correlated with downstream PPO/BoN sampling. Happy hillclimbing!
Huge congrats to @saumyamalik44 who lead the project with a total commitment to excellence. https://t.co/c0b6rHTXY5

venturebeat

Your AI models are failing in production—Here’s how to fix model selection

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling. pic.twitter.com/NGetvNrOQV
— Ai2 (@allen_ai) June 2, 2025

RewardBench 2 covers six different domains: factuality, precise instruction following, math, safety, focus and ties.

Super excited that our second reward model evaluation is out. It's substantially harder, much cleaner, and well correlated with downstream PPO/BoN sampling. Happy hillclimbing!
Huge congrats to @saumyamalik44 who lead the project with a total commitment to excellence. https://t.co/c0b6rHTXY5

venturebeat

Your AI models are failing in production—Here’s how to fix model selection

Similar News

Cutting off rhinos' horns is a contentious last resort to stop poaching. A new study found it works

Jeff Bezos’s Washinton Post Plans to Add Random Opinion Writers Edited by AI

Unsecured Database Exposes Data of 3.6 Million Passion.io Creators

Launch Your Own Local News Website: A Comprehensive Guide for Engaging Communities

23andMe Re-Opens Data Auction With $305 Million Bid From Former CEO

Your AI models are failing in production—Here’s how to fix model selection

Similar News

Cutting off rhinos' horns is a contentious last resort to stop poaching. A new study found it works

Jeff Bezos’s Washinton Post Plans to Add Random Opinion Writers Edited by AI

Unsecured Database Exposes Data of 3.6 Million Passion.io Creators

Launch Your Own Local News Website: A Comprehensive Guide for Engaging Communities

23andMe Re-Opens Data Auction With $305 Million Bid From Former CEO