- AI companies have long used benchmarks to promote their products and services as the best in the industry and claim they are better than their competitors.
- AI benchmarks provide a measure of the technical capabilities of large-scale language models, but are they reliable differentiators for those that form the basis of generative AI tools?
The advent of the era of generative AI raises pertinent questions. Which large-scale language model (LLM) is the best? And more importantly, how do you measure it?
AI benchmarking is difficult given that LLM tools require testing accuracy, veracity, relevance, context, and other subjective parameters, as opposed to hardware where computational speed is the defining criterion. There may be cases.
Over the years, several AI benchmarks have been created as technical tests designed to evaluate specific capabilities such as question answering, reasoning, coding, text generation, and image generation.
AI benchmarks also include objective comparisons, feature evaluation (summarization and inference), generalization, and robustness in handling complex language structures and tracking progress.
AI companies have used these tests to promote their products and services as the best in the industry and claim they are better than their competitors. The recently released LLM has already surpassed humans in several benchmarks. In other things they still do not match us.
For example, Gemini Ultra topped the Multitasking Language Understanding or MMLU benchmark with a score of 90%, followed by Claude 3 Opus (88.2%), Leeroo (86.64%), and GPT-4 (86.4%). The MMLU is a 57-subject knowledge test that includes elementary mathematics, U.S. history, computer science, law, and more.
Claude 3 Opus, on the other hand, scored just over 50% in scientific reasoning under the graduate level GPQA benchmark. GPT-4 Turbo (knowledge cutoff until April 2024) he scored 46.5% and GPT-4 Turbo (January 2024) he scored over 43%.
So there is some truth to the claim that AI tools are on par with what we imagined. However, since AI benchmarks provide task-specific evaluations, their usage across domain-independent general-purpose applications should be improved. So are they on par with what we expected?
Spiceworks News & Insights investigates why AI benchmarks are inconsistently evaluated and inappropriate for comparison.
AI benchmark limitations
AI benchmarking presents multiple challenges related to making general comparisons for LLMs. These include:
1. Lack of standardization
Ralph Meyer, Manager of Engines and Algorithms at Hyland, told Spiceworks that AI benchmarking is important because of the diversity of applications and requirements and the evaluation criteria, especially responsible AI capabilities (transparency, explainability, Due to a lack of consensus on data privacy, there is a lack of proper standardization. and resource constraints.
“AI systems are being applied to a wide range of domains and tasks, each with unique requirements and nuances. Developing standardized benchmarks that can accurately capture the performance and limitations of AI models across all these diverse applications. is a big challenge,” Meyer said.
“Evaluating state-of-the-art AI models can be prohibitively expensive and time-consuming, especially for independent researchers and small organizations. “However, this has the added risk of contaminating the training dataset with information used for or related to a particular benchmark.”
Rakesh Yadav, founder and CEO of Aidaptive, hopes to see standardization of AI benchmarks in some areas. “I predict that over the next few years, we will see AI benchmarks established for at least a limited set of use cases, and eventually a standard process for continually adapting benchmarks with innovation. ”
see next: Top 3 LLM comparison: GPT-4 Turbo vs. Claude 3 Opus vs. Gemini 1.5 Pro
2. Most AI benchmarks are outdated
The breakneck speed of LLM development over the past few years has made it difficult for benchmarks to keep up with the latest advances and features. By the time a benchmark is developed and adopted, new models have already outgrown its range. “This could lead to discrepancies in ratings,” Meyer added.
For example, a report co-authored by the state-run China Institute of Science and Technology Information notes that U.S. organizations released 11 LLMs in 2020, 30 LLMs in 2021, and 37 LLMs in 2021. . At the same time, Chinese companies have 2, 30 and 28 LLMs scheduled in 2020, 2021 and 2022, respectively.
By May 2023, US companies have deployed 18 LLMs, and Chinese companies have also launched 18 LLMs.
“We need modern benchmarks that can assess the end-to-end performance of AI systems in real-world applications, including pre-processing, post-processing, and interactions with other systems and humans. , helps bridge the gap between the broader requirements for deploying AI solutions in complex and dynamic environments,” said Meyer.
“Overall, existing benchmarks have played an important role in advancing AI research and development, but rapid advances in the field, particularly in generative AI, have led to more comprehensive benchmarks that can better assess capabilities and limitations. “There is a need to create new transparent benchmarks for modern AI models.”
3.Vested interests
Yadav reiterated that current AI benchmarks are created by organizations with specific commercial objectives. Most prominent technology companies have invested billions of dollars in AI research and companies that build AI tools and services. “Currently, these benchmarks are built by companies with profit-based motives and are inherently biased towards their own business needs (and rightfully so),” Yadav said.
“Ideally, government-funded benchmarks and standards would be established by an unbiased consortium of large companies, with ongoing research to ensure these standards are updated in line with new developments. However, this is a developing field with intense innovation.”
4. Benchmarking specific problems
The picture that AI benchmarks paint is often distorted, given that certain prompt engineering techniques can manipulate the results. An LLM's response is measured as its performance and is determined by how the prompts are constructed.
Google was criticized for claiming that Gemini Ultra outperformed OpenAI's GPT-4. This criticism (and some ridicule) stems from the fact that the company used Chain of Thought or CoT@32 prompt engineering techniques to obtain higher benchmark scores on his MMLU instead of his 5-shot learning. It's starting from the beginning.
this is pretty weird
Usually when we benchmark… we compare results from the exact same tests…
I noticed other people mentioning this
— Brian Kyritz (@kyritzb) December 6, 2023
AI benchmarks can also have biases and limitations, Meyer said. Benchmarks often have inherent biases and limitations that can skew the results. For example, multiple-choice tests can be vulnerable. Even small changes like reordering your answer choices can have a big impact on your LLM score. ”
LLM, on the other hand, is only as good as the data used to train it. Test data often differs from real-world data, which can cause major problems after the model launches. “Many LLMs are trained on huge datasets that can overlap with the data used in benchmarks. You may just memorize and regurgitate test examples, so high benchmark scores may not accurately reflect real-world performance.”
Yadav agrees. He cites a lack of training data on real-world use cases as a key factor why AI benchmarks may fail in evaluating his LLM. “The only time we find flaws is when the (post-launch) model doesn't work in the real world. There's a lot of media coverage when big companies' models don't work, but there's a lot of press coverage when big companies' models don't work. There is not much information available to help provide that,” Yadav said.
see next: Exploring the Growth of AI: KubeCon + CloudNativeCon Europe 2024
5. Benchmark scope is narrow
AI benchmarks can be too narrow in scope when used individually. These are often used to evaluate specific tasks. Yadav and Meier cite the limiting factor that AI benchmarks cannot broadly measure the overall behavior of his LLM.
“Many current benchmarks are narrowly focused on specific tasks, such as image classification or natural language processing, and fail to capture the broader capabilities of AI systems and real-world applications. , often overlooking important aspects such as ambiguity, handling of adversarial input, and interaction with humans and other systems,” Meyer said.
He added that AI benchmarks need to go beyond measuring inference performance and evaluate the end-to-end performance of AI systems in real-world applications. Otherwise, the ability to weigh the applicability of different LLMs across multiple use cases may be severely limited.
“Depending on the model or use case, there may not be an accurate benchmark to evaluate its effectiveness,” Yadav said. “Unless real problems and challenges arise, it is impossible to assess whether a particular model will perform well enough for a new use case.”
The future of AI benchmarking
AI benchmarks are needed to evaluate LLM's ability to process open-ended queries and generate logical responses that do not deviate from the context of the interaction. AI benchmarks also need to evolve with respect to humans in a natural and engaging way. This includes a composite assessment.
“As AI systems are increasingly expected to process and integrate information from multiple modalities such as text, images, audio, and video, the benchmarks will focus on performing cross-modal inference, understanding, and generation tasks. We need to evaluate the capabilities of our AI models. These tasks are critical for applications such as virtual assistants, content creation, and multimedia analysis. Such capabilities are critical for seamless, multimodal AI/human interactions. ,” Meyer said.
To achieve this, AI benchmarks must be developed independently with industry consultants. “The larger focus, in my opinion, is that AI benchmarking tools need a dedicated organization focused on driving innovation to establish and maintain benchmarks for specific use cases.” continues Mr. Yadav.
“This includes a combination of people who are technically adept at large-scale data processing (a small list), have a deep understanding of machine learning (again, a very small group), and who are also experts. This is easier said than done because it requires “within the domain in which the model is being applied.'' ”
“There are several ideas that can address this issue, including continuing to foster active learning by maintaining a continuous feedback loop from domain experts and incorporating that knowledge into the next training run. Building an LLM In some cases, testing conditions in combination with other machine learning techniques (such as reinforcement learning) can also be very helpful.”
What problems do you see with current AI benchmarks? Please share with linkedin, Xor Facebook. We look forward to hearing from you!
Image source: Shutterstock