- Status: Planned Program
Background
Generative AI applications such as ChatGPT or Midjourney are currently attracting a great deal of attention. These models can be used in a wide range of application areas without any prior technical knowledge due to their ability to generate complex and multimodal outputs (e.g. text, image, audio, video) based on free inputs (prompts). The increasing adaptation of generative AI models in the domains of internal and external security is foreseeable considering their great application potential. The foundation models powering generative AI applications are mostly trained by private-sector companies, mostly in the USA and China, with high effort. Their underlying data sets, training mechanisms and model architectures are usually not (or no longer) published. In terms of cyber security, the high application potential of foundation models is thus offset by a currently high level of technological dependency and risks in terms of cyber security and application security.
Evaluations and comparisons in the form of benchmarks are useful for improving the assessment of properties of externally trained models. However, due to the high versatility and unstructured outputs of these models, they represent a complex problem that takes on additional urgency in the security context. Holistic benchmarking, in particular, remains an open and increasingly relevant research question in view of the recent strong growth in the capabilities of large AI models.
Aim
As part of a competition, comprehensive benchmark sets –consisting of tasks, metrics and suitable test data sets—are to be developed that enable a holistic evaluation of pre-trained generative AI foundation models (e.g. text-image models) for a given use case. In addition, foundation models are to be adapted to this use case (fine-tuning or in-context learning), evaluated using the various benchmarks developed and implemented in the form of an application demonstrator. In addition, conceptual insights are to be gained into the fundamental problem of evaluating universally applicable AI systems.
Disruptive Risk Research
The development of benchmarks and adaptation of foundation models as well as their realization as demonstrators take place in a unique competitive constellation in which each participant is in direct comparison with all other participants in terms of both benchmark and model development. Each model is evaluated and ranked against all benchmarks developed. All benchmarks are also evaluated separately in terms of their characteristics. Being a challenge of high-risk research, there is a possibility that no sufficiently suitable evaluation mechanisms will be found for certain AI systems under certain (holistic) requirements, as each benchmark is per definition specific, finite and contextual.