• Menu
  • Skip to right header navigation
  • Skip to main content
  • Skip to primary sidebar

DigiBanker

Bringing you cutting-edge new technologies and disruptive financial innovations.

  • Home
  • Pricing
  • Features
    • Overview Of Features
    • Search
    • Favorites
  • Share!
  • Log In
  • Home
  • Pricing
  • Features
    • Overview Of Features
    • Search
    • Favorites
  • Share!
  • Log In

Crowdsourced AI benchmarks should be dynamic rather than static datasets, and tailored specifically to distinct use casesAI benchmarks

April 23, 2025 //  by Finnovate

Over the past few years, labs including OpenAI, Google, and Meta have turned to platforms that recruit users to help evaluate upcoming models’ capabilities. When a model scores favorably, the lab behind it will often tout that score as evidence of a meaningful improvement. It’s a flawed approach, however, according to Emily Bender, a University of Washington linguistics professor and co-author of the book “The AI Con.” Bender takes particular issue with Chatbot Arena, which tasks volunteers with prompting two anonymous models and selecting the response they prefer. To be valid, a benchmark needs to measure something specific, and it needs to have construct validity — that is, there has to be evidence that the construct of interest is well-defined and that the measurements actually relate to the construct,” Bender said. “Chatbot Arena hasn’t shown that voting for one output over another actually correlates with preferences, however they may be defined.” Asmelash Teka Hadgu, the co-founder of AI firm Lesan and a fellow at the Distributed AI Research Institute, said that he thinks benchmarks like Chatbot Arena are being “co-opted” by AI labs to “promote exaggerated claims.” Benchmarks should be dynamic rather than static datasets,” Hadgu said, “distributed across multiple independent entities, such as organizations or universities, and tailored specifically to distinct use cases, like education, healthcare, and other fields done by practicing professionals who use these [models] for work.” Wei-Lin Chiang, an AI doctoral student at UC Berkeley and one of the founders of LMArena, which maintains Chatbot Arena said that incidents such as the Maverick benchmark discrepancy aren’t the result of a flaw in Chatbot Arena’s design, but rather labs misinterpreting its policy.

Read Article

Category: Members, Additional Reading

Previous Post: « Adaptive Computer’s no-code web-app platform lets non-programmers build full-featured apps that include payments (via Stripe), scheduled tasks, and AI features such as image generation, speech synthesis simply by entering a text prompt
Next Post: Next Post »

Copyright © 2025 Finnovate Research · All Rights Reserved · Privacy Policy
Finnovate Research · Knyvett House · Watermans Business Park · The Causeway Staines · TW18 3BA · United Kingdom · About · Contact Us · Tel: +44-20-3070-0188

We use cookies to provide the best website experience for you. If you continue to use this site we will assume that you are happy with it.OkayPrivacy policy