Quick answer

Article 55(1)(a) of the EU AI Act requires providers of general-purpose AI models with systemic risk to perform model evaluation in accordance with standardised protocols and tools reflecting the state of the art, including conducting and documenting adversarial testing to identify and mitigate systemic risks. The duty has applied since August 2, 2025.

Updated June 2026 · MmowW AI Compliance

EU AI Act Model Evaluation and Adversarial Testing: Article 55 in Practice

Where the Evaluation Duty Comes From

Regulation (EU) 2024/1689 reserves its most demanding model-level obligations for general-purpose AI models classified as presenting systemic risk under Article 51 — in practice, models whose training compute exceeds 10^25 FLOPs or which the Commission designates. For those models, Article 55(1)(a) requires providers to perform model evaluation in accordance with standardised protocols and tools reflecting the state of the art, including conducting and documenting adversarial testing of the model with a view to identifying and mitigating systemic risks.

This single sentence carries three distinct engineering programmes: capability evaluation (what can the model do), risk evaluation (what harm could those capabilities enable), and adversarial testing or red-teaming (what happens when capable, motivated testers try to make the model misbehave). All three must be documented, because the results feed the Annex XI technical documentation — Section 2 of which expressly requires a detailed description of evaluation strategies, evaluation results, and a description of adversarial testing measures.

What Counts as Systemic Risk

Article 3(65) defines systemic risk as a risk specific to the high-impact capabilities of GPAI models, having a significant impact on the EU market due to reach or due to actual or reasonably foreseeable negative effects on public health, safety, public security, fundamental rights, or society as a whole, that can propagate at scale across the value chain. Recital 110 gives examples of the risk areas evaluations typically cover: facilitation of chemical, biological, radiological and nuclear weapons development; offensive cyber capabilities; harmful loss-of-control scenarios; large-scale discrimination, disinformation and manipulation. Evaluation programmes are expected to map model capabilities against these categories.

How the Code of Practice Operationalises Testing

The Safety and Security chapter of the GPAI Code of Practice, published on July 10, 2025, is the most concrete public articulation of what compliant evaluation looks like. Signatories commit to adopt a safety and security framework that defines risk acceptance criteria; to run systemic risk assessments at appropriate points across the model lifecycle, including before placing the model on the market; to use rigorous methodologies — benchmark suites, capability elicitation with fine-tuning and tool access, human uplift studies where relevant; to involve independent external evaluators where appropriate, particularly for the most capable models; to monitor the model after release and reassess when significant changes occur; and to document all of it in safety reports.

Two practical principles from this framework deserve emphasis. First, elicitation matters: an evaluation that tests only the unmodified chat interface understates capability, so serious programmes test with jailbreaks, fine-tuning, scaffolding and tool access to approximate what a determined misuser could achieve. Second, evaluation is a lifecycle activity: a model that is updated, given new tools, or deployed in new modalities needs reassessment, not a one-time pre-launch gate.

Adversarial Testing in Practice

Adversarial testing — often called red-teaming — complements benchmark evaluation with open-ended, attacker-minded probing. A defensible programme typically combines:

Who Must Comply and Who Verifies

The duty binds providers of GPAI models with systemic risk, wherever established, once the model is placed on the EU market. Supervision is centralised: the AI Office, within the European Commission, is the exclusive supervisor for GPAI model obligations. Under Article 92 the Commission can conduct evaluations of a model itself, including through independent experts and the scientific panel, to assess compliance or investigate Union-level risks — which means a provider's own evaluation record may be checked against an external one. From August 2, 2026 the Commission can impose fines of up to 3 percent of worldwide annual turnover or 15 million euros, whichever is higher.

Providers below the systemic-risk threshold carry no Article 55 duties, but many adopt scaled-down evaluation programmes anyway: downstream enterprise customers ask for evaluation evidence, and Annex XI documentation for all GPAI models must include evaluation results where available.

Practical Steps to Build the Programme

Common Pitfalls

Evaluation programmes fail in predictable ways. Benchmark theatre is the first: running public benchmark suites whose questions have leaked into training data, and reporting reassuring scores that measure memorisation rather than capability. Under-elicitation is the second: testing the safety-tuned chat endpoint while ignoring what the same weights can do after light fine-tuning — a gap that matters legally, because systemic risk attaches to the model, not to one deployment surface. The third is organisational: findings that never reach decision-makers, or reach them after the launch date is committed. A framework is only state of the art if a red flag can actually stop or delay a release. The fourth is documentation debt: evaluations run informally in notebooks, with results that cannot be reconstructed six months later when the AI Office asks. Each of these gaps is cheaper to fix at design time than during an Article 92 evaluation by the Commission.

Providers should also resist treating evaluation as adversarial to product teams. The same capability-elicitation infrastructure that supports compliance — scaffolds, benchmark harnesses, regression suites — doubles as engineering tooling for model quality, and the providers with the most credible safety reports tend to be those who built one shared pipeline rather than a parallel compliance bureaucracy.

A Concrete Example

A frontier developer preparing a new model above the compute threshold runs a staged programme. Twelve weeks before launch, internal evaluations benchmark the model on biology, cyber and autonomy suites, with elicitation via fine-tuning. Findings in the cyber category exceed the framework's pre-defined attention threshold, triggering deeper external testing by a contracted security group. Mitigations — refusal training and tool-use restrictions — are applied and retested. The final safety report records methods, elicitation levels, results, mitigations and residual risk, and the Annex XI documentation is updated. After launch, automated adversarial pipelines run continuously, and a mid-cycle model update with new browsing tools triggers a scoped reassessment.

Action Plan

If your model is, or may soon be, classified as presenting systemic risk, stand up the evaluation framework before the training run finishes — the two-week notification clock under Article 52 and the realities of pre-launch testing leave no slack. Use the Code of Practice Safety and Security chapter as the blueprint, document everything in Annex XI structure, and treat adversarial testing as a continuous discipline rather than a launch ritual.

For organisations starting from zero, a realistic build order is: adopt the framework document and risk taxonomy in the first quarter; assemble internal benchmark and red-team capability in the second; contract external evaluators and run a full pre-release cycle on the next model in the third. Budget honestly — credible frontier evaluation programmes consume meaningful compute and senior engineering time — and remember that the obligation is continuous: every significant model update, new tool integration or new modality reopens the question of whether the last assessment still describes the model your users actually have.

The discipline also compounds. Each documented evaluation cycle builds the baseline against which the next model's risks are measured, each red-team exercise enriches the attack library that automated pipelines replay, and each safety report sharpens the framework's acceptance criteria. Providers who started early report that the second full cycle costs roughly half the first — which is the strongest practical argument for beginning before classification forces the issue. The earliest movers also report a hiring insight worth borrowing: evaluation engineers with adversarial instincts are scarcer than compliance generalists, and recruiting them ahead of classification is what separates a credible first safety report from a rushed one.

Check your AI compliance readiness — free.

Take the Readiness Check 3 minutes · 10 questions · no signup required

This article is for informational purposes only and does not constitute legal advice. Regulatory requirements change frequently — verify current rules with official sources. Built by Sawai Gyoseishoshi Office, Hiroshima, Japan.