Quick answer

Article 60 of the EU AI Act allows providers and prospective providers of Annex III high-risk AI systems to test them in real-world conditions outside regulatory sandboxes, before placing them on the market. Testing requires an approved real-world testing plan, registration, informed consent of participants under Article 61, a maximum duration of six months extendable once by six months, and full reversibility of outcomes.

Updated June 2026 · MmowW AI Compliance

EU AI Act Article 60: Real-World Testing of High-Risk AI Outside Sandboxes

Overview: Field Evidence Before Market Placement

Laboratory metrics rarely survive first contact with operational reality, and the EU AI Act acknowledges this. Article 60 creates a legal pathway for providers — and prospective providers — of high-risk AI systems listed in Annex III to conduct testing in real-world conditions outside AI regulatory sandboxes, before the system is placed on the market or put into service. The purpose is evidence: data on how the system performs with real users, real workloads and real edge cases, gathered under safeguards strict enough that the testing itself does not become the harm the regulation exists to prevent. For providers preparing conformity assessment ahead of August 2, 2026, Article 60 is one of the most practically useful provisions in the entire regulation — and one of the most procedurally demanding.

The Core Mechanics

Testing in real-world conditions runs on five structural elements:

Informed Consent: Article 61

The human safeguards are where Article 60 shows its debt to clinical research ethics. Subjects of the testing must give informed consent under Article 61 before participating: they must receive clear, concise information about the nature and objectives of the testing, the conditions of participation, their rights, the possible effects on them, and the arrangements for requesting the reversal or disregarding of the system's outputs. Consent must be documented, dated, and a copy given to the subject. A narrow carve-out applies to testing in law enforcement, migration, asylum and border control contexts, where seeking consent would prevent the testing from occurring — there, testing must have no negative effect on the persons concerned and their personal data must be deleted after the test.

Beyond consent, the regulation builds in protective conditions: subjects who are vulnerable due to age or disability receive appropriate additional protection; participation can be withdrawn at any time without justification and without detriment; subjects may request the immediate and permanent deletion of their personal data; and — a provision with real operational bite — the predictions, recommendations or decisions of the AI system under test must be capable of being effectively reversed and disregarded. A recruitment system under test cannot quietly reject real candidates; its outputs must remain advisory and reversible throughout the test.

Oversight, Incidents and Stopping Rules

The provider must designate and train those overseeing the testing, monitor it effectively, and remain ready to suspend or terminate. Market surveillance authorities hold inspection powers — they may request information, conduct unannounced checks, and require modification, suspension or termination of testing where conditions are breached or where risks emerge. Any serious incident during testing triggers reporting to the market surveillance authority under the logic of Article 73, and the provider must adopt immediate mitigation or suspend the testing. Article 60 also requires that testing not begin before approval and registration, and that subjects not be selected in ways that undermine the protective purpose — testing on people who are unaware, or recruiting only those least likely to complain, defeats the design and exposes the provider to enforcement.

Who Should Use Article 60 — and Who Should Not

Article 60 fits providers with a near-final Annex III system that needs operational evidence: accuracy under realistic load, human-AI interaction patterns, failure modes invisible in curated datasets. It suits employment screening tools tested with consenting applicant cohorts, triage systems shadowing real emergency workflows with reversible outputs, and educational assessment tools piloted in consenting institutions. It is the wrong instrument for early-stage development — that belongs in a regulatory sandbox under Article 57, with its guidance and data-processing basis — and unnecessary for systems that can be fully validated on historical or synthetic data. It is also distinct from deployer-led piloting after market placement: Article 60 governs the pre-market window, which is exactly what makes it valuable for conformity evidence.

Practical Steps

  1. Decide the instrument: sandbox for unresolved design and legal questions, real-world testing for operational evidence on a mature system — or sequence the two
  2. Draft the testing plan early against the Commission's implementing act template, with explicit stopping rules, reversal mechanisms and subject protection measures
  3. Build the consent pipeline: information sheets in plain language, documented and dated consent, withdrawal and deletion workflows that actually function
  4. Engineer reversibility before the test starts: every output of the system under test must be tagged, traceable and capable of being disregarded without residue in downstream systems
  5. Register the testing, calendar the six-month clock, and pre-draft the extension notification in case it is needed
  6. Capture results in the structure of Annex IV technical documentation so the evidence flows directly into conformity assessment

Concrete Example

A provider has built an AI system that prioritises emergency calls for a regional dispatch centre — Annex III point 5(d). Bench testing on historical call data is complete, but the conformity case needs evidence of live performance and operator interaction. Under Article 60, the provider files a testing plan with the national market surveillance authority: six months of shadow operation in two dispatch centres, with the AI's prioritisation displayed to operators as advisory only, every recommendation reversible, dispatch decisions remaining fully human, consent obtained from the participating operators, and arrangements approved for the handling of caller data. The system's recommendations are logged against actual outcomes, producing exactly the accuracy and human-oversight evidence Articles 14 and 15 demand. Two serious mismatches between AI prioritisation and clinical outcome are detected, reported, analysed and fixed — before the system ever holds real authority.

Action Before August 2, 2026

Providers intending to place Annex III systems on the market in late 2026 or 2027 should count backwards: six months of testing, preceded by plan approval cycles and consent infrastructure, preceded by reversibility engineering. That arithmetic puts plan drafting in the immediate present for many roadmaps. Watch for the Commission implementing acts specifying the testing plan elements, monitor how your national market surveillance authority handles approval in practice, and treat the registration and consent records as permanent compliance assets — they will be examined whenever the system's market history is reviewed. Real-world testing done properly is slower than informal piloting; it is also the difference between evidence a regulator accepts and anecdotes a regulator investigates.

Common Pitfalls Observed in Early Testing Programmes

Early adopters of structured pre-market testing report recurring failure patterns worth designing against. The first is consent decay: participants consent at the start of a six-month test, but staff turnover, shift changes and organisational drift mean that by month four, people are interacting with the system who never signed anything — consent management must be continuous, not ceremonial. The second is reversibility theatre: outputs are formally advisory, but interface design nudges operators into treating recommendations as decisions, which both contaminates the evidence and undermines the protective premise; testing plans should include measurement of actual reliance, not just a policy statement. The third is scope creep: a system under test acquires new features mid-test, quietly invalidating the approved plan — change control during the testing window must be as strict as in any clinical protocol, with material changes notified to the authority. The fourth is data residue: subjects exercise their deletion rights, but copies persist in analytics pipelines and backups; deletion workflows need to be engineered and tested before the first subject enrols. None of these pitfalls is exotic, and all of them are visible to an inspecting authority reviewing logs. Providers who instrument their testing programme to detect these patterns internally — before the regulator does — convert Article 60 from a procedural hurdle into what it was designed to be: the cheapest available source of truth about how a high-risk system actually behaves among real people.

Check your AI compliance readiness — free.

Take the Readiness Check 3 minutes · 10 questions · no signup required

This article is for informational purposes only and does not constitute legal advice. Regulatory requirements change frequently — verify current rules with official sources. Built by Sawai Gyoseishoshi Office, Hiroshima, Japan.