HealthBench: New Standard for OpenAI to evaluate AI in healthcare

HealthBench: New Standard for OpenAI to evaluate AI in healthcare

A OpenAI launched o HealthBench, a benchmark created with 262 doctors to evaluate the performance of systems artificial intelligence (AI) in conversations about health – and set a new standard for measuring the safety and effectiveness of AI in medical settings.

ADVERTISING

HealthBench Details
  • The benchmark tests models across a range of topics (such as emergency referrals and global health) and behaviors (accuracy, communication quality, etc.).
  • Recent models performed much better in the benchmark, with the o3 from OpenAI scoring 60% compared to GPT-16 Turbo's 3.5%.
  • The results also revealed that smaller models are now much more capable, with the GPT-4.1 Nano outperforming older options while being 25 times cheaper.
  • A OpenAI has open-sourced both the evaluations and the test dataset of 5.000 realistic, multi-turn health conversations between models and users.
Why is it important

There is an overwhelming body of evidence that AI can deliver significant improvements across the board in healthcare settings, and having clinician-validated benchmarks is an important step both in measuring how well each model performs in clinical settings and in deciding when and how to deploy them.

Read also

Scroll up