The Department of Defense's Chief Digital and Artificial Intelligence Office (CDAO) is leveraging Scale AI to provide a reliable means to test and evaluate large-scale language models that can support, and even disrupt, military planning and decision-making. I created it.
According to a statement the San Francisco-based company shared exclusively with DefenseScoop, the results of this new one-year contract will allow CDAO to “measure model performance, provide real-time feedback to warfighters, and securely deploy AI.” We will provide a framework for implementing it. Create specialized public sector evaluation sets to test AI models for military support applications and organize results from post-action reports. ”
Large-scale language models and the comprehensive field of generative AI include generating persuasive but not necessarily accurate text, software code, images, and other media based on prompts from humans. Contains emerging technologies that can
This rapidly evolving area holds many promises for the Department of Defense, but also presents unknown and serious potential challenges. Last year, Pentagon leadership launched Task Force Lima within CDAO's Algorithmic Warfare Directorate to accelerate the understanding, evaluation, and deployment of the building blocks of generative artificial intelligence.
The department conducts extensive testing to evaluate and ensure that systems, platforms, and technologies operate in a safe and reliable manner before they are fully operational. and evaluation (T&E) processes. However, AI safety standards and policies are not yet universally set, and the complexity and uncertainty associated with large language models make T&E even more complex when it comes to generative AI.
Broadly speaking, T&E allows experts to determine the baseline performance of a particular model.
For example, to test and evaluate a computer vision algorithm that distinguishes between images of dogs and cats and objects that are not dogs or cats, officials first need to examine millions of different images of those types of animals or objects that are not dogs or cats. There is a possibility to use photos to train the algorithm. It's not a dog or a cat. In doing so, the experts also retain a diverse subset of data that they can later present to the algorithm.
The evaluation dataset is then evaluated against the test set, or “ground truth,” and ultimately the failure rate is determined where the model cannot determine whether the classifier is one of the classifiers it is trying to identify. You can decide.
Scale AI experts plan to take a similar approach to T&E using large language models, but because language models are generative in nature and English is difficult to assess, these complex The system does not have the same level of “ground truth.” For example, if asked to enter five different responses, the LLM says that he may be mostly factually accurate in all five, but that contrasting sentence structures may change the meaning of each output. there is.
As such, part of CDAO's efforts to develop frameworks, methodologies, and technologies that can be used to test and evaluate language models at scale includes the creation of “holdout datasets.” This dataset includes DoD insiders, prompting response pairs, and determining them through multiple reviews. And we make sure that each one responds as well as you would expect from a military man.
The entire process becomes repetitive in nature.
Once DoD-relevant datasets on world knowledge, veracity, and other topics are created and refined, experts will be able to evaluate existing large-scale language models against them. .
Ultimately, these holdout datasets will be available to experts to perform evaluations and model cards, i.e. information on the best context to use different machine learning models and measure their performance. You will be able to establish a short document that provides
Officials plan to automate this development as much as possible, and as new models emerge, they will be able to understand how they work, where they perform best, and perhaps where they fail. You can get a basic understanding of how things start to happen.
Further along in the process, the ultimate goal is for the model to essentially send a signal to the involved CDAO personnel if it begins to stray from the area being tested.
“This effort will help the Department of Defense mature its T&E policy to address generative AI by measuring and evaluating quantitative data through benchmarking and evaluating qualitative feedback from users. , using Department of Defense terminology and knowledge base to help identify generative AI models that are ready to support military applications with accurate and relevant results. Rigorous T&E processes are used in sensitive environments. The aim is to strengthen the robustness and resilience of AI systems in the world and enable the deployment of LLM technology in a secure environment,” a statement from Scale AI reads.
In addition to CDAO, the company has partnerships with Meta, Microsoft, the U.S. Army, Defense Innovation Unit, OpenAI, General Motors, Toyota Research Institute, Nvidia, and more.
“Testing and evaluating generative AI will help the Department of Defense understand the strengths and limitations of the technology and ensure it can be deployed responsibly,” said Alexandr Wang, founder and CEO of Scale AI, in a statement. It is stated as follows.