Artificial intelligence (AI) has made remarkable strides, achieving near-perfect scores on many technical performance benchmarks. However, these impressive numbers don’t always translate into AI tools that are truly useful or aligned with human expectations. Vanessa Parli, Associate Director of Research Programs at the Stanford Institute for Human-Centered AI (HAI), highlights this critical gap, drawing insights from the latest AI Index.
Parli points to the widespread excitement around tools like ChatGPT as a prime example. “There’s been a lot of excitement, and it meets some of these benchmarks quite well,” she acknowledges. “But when you actually use the tool, it gives incorrect answers, says things we don’t want it to say, and is still difficult to interact with.” This sentiment is echoed in the findings of the newest AI Index, published on April 3rd. A comprehensive analysis of over 50 benchmarks across various domains like vision, language, and speech revealed a surprising trend: AI performance is plateauing on many established metrics.
“Most of the benchmarks are hitting a point where we cannot do much better, 80-90% accuracy,” Parli explains. This saturation raises a fundamental question: Are these benchmarks truly measuring what matters as AI becomes more deeply integrated into our lives? Parli argues for a shift in perspective, emphasizing the need to develop new machine learning benchmarks that reflect how humans and society want to interact with AI.
In this discussion, we delve deeper into Parli’s insights on the trends in AI benchmarking and explore the crucial need for more nuanced and comprehensive evaluation methods.
Understanding AI Benchmarks: Defining the Goals
What exactly is an AI benchmark? In essence, a benchmark serves as a defined objective for an AI system to achieve. It’s a way to quantify and measure the capabilities of AI, guiding development efforts towards specific goals. A classic example, mentioned by Parli, is ImageNet. Initiated by HAI Co-Director Fei-Fei Li, ImageNet is a massive dataset containing over 14 million images. Researchers utilize ImageNet to train and test image classification algorithms, aiming to maximize the accuracy of image identification. This accuracy score then becomes a key metric for evaluating and comparing different machine learning models.
The AI Index Findings: Benchmarking Progress and Saturation
The AI Index study undertook a comprehensive review of numerous technical benchmarks developed over the last decade, spanning areas like computer vision and natural language processing. The research team evaluated year-over-year progress on these benchmarks, assessing whether AI systems continued to improve upon previous state-of-the-art performance. They analyzed approximately 50 benchmarks, including ImageNet, the SUPERGlue language benchmark, and the MLPerf hardware benchmark, with over 20 featured in the final report.
The findings revealed a significant shift in the trajectory of AI progress. In earlier years, the field witnessed substantial year-over-year improvements across various benchmarks. However, the latest data indicates a trend of diminishing returns. “This year across the majority of the benchmarks, we saw minimal progress to the point we decided not to include some in the report,” Parli notes. Illustrating this point, she highlights ImageNet: while the top image classification system achieved 91% accuracy in 2021, the improvement in 2022 was a mere 0.1 percentage point.
This near-saturation across many benchmarks suggests that current metrics may be reaching their limits in driving meaningful advancements in artificial intelligence.
Interestingly, the AI Index also revealed instances where AI performance surpasses human baselines, even when benchmarks haven’t reached ceiling accuracy. The Visual Question Answering Challenge, for example, tests AI systems’ ability to answer open-ended questions about images. In the latest evaluations, the top-performing AI model achieved 84.3% accuracy, exceeding the human baseline of approximately 80%. This further emphasizes the complexity of evaluating AI – outperforming humans on narrow tasks doesn’t necessarily equate to overall human-level intelligence or usability.
The Critical Need for Next-Generation Benchmarks
The stagnation in benchmark progress and the limitations of current metrics raise crucial questions for AI researchers and developers. As Parli points out, “Our AI tools right now are not exactly as we would want them to be – they give wrong information, they create sexist imagery.” If benchmarks are intended to guide AI development towards desirable outcomes, it becomes essential to re-evaluate what those outcomes should be. How do we envision interacting with AI, and how should AI interact with us?
The current benchmarks often focus on singular objectives, primarily accuracy. However, as AI systems become more sophisticated and integrated, encompassing vision, language, and other modalities, a more holistic evaluation approach is needed. Parli suggests considering benchmarks that assess trade-offs between various factors, such as accuracy and bias, or toxicity and fairness. Incorporating social factors into AI evaluation is also crucial. Many critical aspects of AI performance, such as ethical considerations and societal impact, are not easily quantifiable through traditional benchmarks. This calls for a fundamental re-evaluation of what we expect and demand from AI tools, and how we measure their progress.
HELM: A Step Towards Comprehensive AI Evaluation
Recognizing the limitations of existing benchmarks, researchers are beginning to explore more comprehensive evaluation frameworks. Parli highlights HELM (Holistic Evaluation of Language Models), developed at Stanford HAI’s Center for Research on Foundation Models (CRFM), as a promising example. HELM is designed to assess AI models across a wider range of scenarios and tasks, going beyond simple accuracy metrics. It considers factors such as fairness, toxicity, efficiency, and robustness, providing a more multi-faceted view of AI performance.
HELM represents a crucial step towards aligning AI benchmarks with broader human and societal values. As benchmarks significantly influence the direction of AI development, adopting more comprehensive and nuanced evaluation methods is essential to ensure that AI evolves in a way that is beneficial and responsible. The future of AI benchmarking lies in moving beyond narrow technical metrics and embracing a more holistic approach that reflects the complex interplay between AI, humans, and society.
The AI Index, an independent initiative at Stanford HAI, serves as a leading source of data and insights on AI, informing policymakers, researchers, and the public. Stanford HAI’s mission is to advance AI research, education, policy, and practice to improve the human condition. Learn more at https://hai.stanford.edu/.