How do GPAI providers complete the training data summary required by the EU AI Act?

Providers must use the template the AI Office published on July 24, 2025. It has three parts: general information about the provider, model and data modalities; a list of data sources by category, including main domain names for scraped data; and a description of data processing aspects such as opt-out compliance and removal of illegal content.

Does the training data summary obligation apply to open-source models?

Yes. The open-source exemption in Article 53(2) only covers technical documentation and downstream information duties. The public training content summary under Article 53(1)(d) and the copyright policy apply to all GPAI providers without exception.

Must providers disclose every dataset used for training?

No. The summary must be generally comprehensive about categories and main sources of training content, but it does not require listing every dataset or individual work, and it is designed to avoid forcing disclosure of trade secrets such as exact data mixtures.

What is the deadline for models placed on the market before August 2025?

Under Article 111(3), providers of GPAI models placed on the EU market before August 2, 2025 must comply with Chapter V obligations, including the training data summary, by August 2, 2027. Where historical corpus information is no longer retrievable, providers should disclose the best available information and state the limitation.

Where should the training data summary be published?

The regulation requires the summary to be made publicly available. In practice, providers publish it on the model's product page, developer documentation portal or repository listing, keeping it versioned and updated when training content changes materially.

Quick answer

Article 53(1)(d) of the EU AI Act requires every provider of a general-purpose AI model to publish a sufficiently detailed summary of the content used to train the model, following the mandatory template the AI Office published in July 2025. The obligation applies to all GPAI providers, including open-source ones.

Updated June 2026 · MmowW AI Compliance

EU AI Act Training Data Summary: How to Complete the AI Office Template

The Obligation in Brief

Among the four baseline duties that Article 53 of Regulation (EU) 2024/1689 places on providers of general-purpose AI models, the training data summary is the most visible. Article 53(1)(d) requires providers to draw up and make publicly available a sufficiently detailed summary of the content used for training, according to a template provided by the AI Office. The AI Office published that template on July 24, 2025, shortly before the GPAI obligations became applicable on August 2, 2025.

The purpose, explained in Recital 107, is to help parties with legitimate interests — most prominently copyright holders — to exercise and enforce their rights. The summary is therefore not a technical artefact for regulators; it is a public transparency document intended to be generally comprehensive in its description of data sources rather than technically exhaustive at the level of individual works.

What the Template Requires

The AI Office template organises the summary into three parts:

General information: identification of the provider and the model (including versions covered), the modalities of the training data such as text, images, audio or video, and overall data size per modality expressed in broad ranges.
List of data sources: the categories of sources used — publicly available datasets, commercial datasets licensed from third parties, data crawled and scraped from the internet, user data, synthetic data, and other sources. For scraped data, the template requires identification of the main domain names from which content was collected, covering the most significant share of the scraped corpus, with a lighter requirement for small and medium-sized enterprises.
Relevant data processing aspects: a description of measures taken to respect reservations of rights under the text and data mining regime before and during data collection, and measures to remove illegal content such as child sexual abuse material from training data.

The template is deliberately calibrated. It demands more than vague statements that the model was trained on publicly available data, but it does not require providers to list every dataset, disclose trade secrets, or reveal precise data mixtures that constitute competitive know-how.

Who Must Publish a Summary

Every provider of a GPAI model placed on the EU market must publish the summary. Unlike the technical documentation duties, there is no open-source exemption: Article 53(2) only relieves qualifying open-source providers from the obligations in points (a) and (b) of Article 53(1), never from the training summary in point (d) or the copyright policy in point (c). Providers of models with systemic risk must publish it as well.

For models placed on the market before August 2, 2025, the transition rule in Article 111(3) gives providers until August 2, 2027. The AI Office has acknowledged that for some older models, complete information about historical training corpora may no longer be retrievable; in such cases providers are expected to provide the best information available and state the limitation transparently in the summary, rather than omit the model.

Common Drafting Questions

How detailed is sufficiently detailed?

The benchmark is the template itself. A summary that completes every applicable field of the template at the level of granularity the template asks for is the safest position. Where a field genuinely does not apply — for example, no user data was used — the summary should say so explicitly.

Does one summary cover multiple models?

A single summary may cover several model versions where the training content is materially the same. A significant change in training data — a new pre-training corpus, a major new data modality — calls for an updated summary. Providers should version and date their summaries.

What about fine-tuned models?

An entity that modifies an existing GPAI model and thereby becomes a provider in its own right must publish a summary covering the content used for the modification, such as the fine-tuning dataset, not the original provider's pre-training corpus. The obligations of a downstream modifier are limited to the modification it performs.

Practical Compliance Steps

Assign ownership early. The summary sits at the intersection of data engineering, intellectual property and communications, and someone must own its accuracy.
Build the inventory from pipeline records: dataset manifests, crawl logs and licensing agreements are the raw inputs for the source list.
Express volumes in the ranges the template uses rather than inventing your own precision.
Describe opt-out compliance concretely: which protocols your crawlers honour, since when, and how licensed and public-domain sources are handled.
Publish the summary where the model is distributed — on the model page, developer portal or repository listing — and keep prior versions accessible.
Synchronise the summary with your Annex XI documentation and your copyright policy so that the three documents tell one consistent story.

Why the Summary Matters Beyond Compliance

The training data summary is rapidly becoming the reference document in disputes between rightsholders and model providers. Collecting societies, publishers and image libraries read published summaries to decide whether their catalogues were likely used and whether their opt-outs were honoured. A summary that is materially inaccurate exposes the provider on two fronts at once: under the AI Act, because the AI Office can request the underlying Annex XI documentation and compare it with the public summary, and under copyright law, because the summary can be cited in infringement proceedings. Conversely, a careful summary that documents opt-out compliance and licensing arrangements is a useful exhibit for the defence.

The summary also has commercial weight. Enterprise customers performing AI due diligence increasingly ask for it before signing, because their own downstream obligations — and their indemnity negotiations — depend on the provenance of the model's training content. Several procurement frameworks in regulated sectors now list a template-conformant training summary among standard supplier evidence.

Enforcement and Supervision

Supervision of GPAI obligations is centralised at the European Commission, acting through the AI Office, rather than with national authorities. From August 2, 2026, the Commission may impose fines on GPAI providers of up to 3 percent of total worldwide annual turnover or 15 million euros, whichever is higher, for infringements of Chapter V — including failure to publish a training summary or publishing one that does not meet the template's requirements. The AI Office can also request information under Article 91 and require corrective measures under Article 93. During the first months of application the AI Office signalled a cooperative approach for providers making good-faith efforts, but the formal powers are in place and complaints from rightsholders are an obvious trigger for scrutiny.

A Concrete Example

A provider releases a text-and-image model trained on a mixture of licensed news archives, public-domain books, two well-known public research datasets, and a proprietary web crawl. Its summary names the public datasets, describes the licensed archives by category and counterparty type, lists the main domains represented in the crawl as the template requires, states the approximate size range of each modality, and explains that its crawler has honoured machine-readable opt-out signals since a stated date. It also describes the filtering applied to remove illegal content. Nothing in this disclosure reveals the model's data mixture ratios or training recipe.

Common Pitfalls

Early reviews of published summaries reveal recurring weaknesses. Some providers publish narrative data statements that ignore the template structure; these do not satisfy Article 53(1)(d), which makes the template mandatory. Others complete the general information part but leave the data source list at the level of broad categories without naming the public datasets or main scraped domains the template calls for. A third group forgets versioning: when a model family is refreshed with new training data, the old summary silently becomes inaccurate. Finally, inconsistency between the public summary and the confidential Annex XI documentation is dangerous, because the AI Office can request the latter and compare. The discipline is straightforward — one source of truth about training content, rendered at two levels of detail.

Action Plan

Download the AI Office template, map each field to an internal data owner, and draft the summary in parallel with — not after — your next model release. For models already on the market, schedule the work well before the applicable deadline. A published, template-conformant summary is one of the most visible signals of EU AI Act compliance a model provider can give, and its absence is equally visible to regulators and rightsholders alike.

Finally, coordinate the timing of disclosures across your portfolio. If you provide several models, releasing summaries of uneven quality invites questions about the weaker ones, and updating one model's summary while leaving a sibling's untouched signals neglect. A quarterly review cycle covering all published summaries, owned by the same team that maintains Annex XI documentation, keeps the public record consistent with engineering reality.

Check your AI compliance readiness — free.

Take the Readiness Check 3 minutes · 10 questions · no signup required

This article is for informational purposes only and does not constitute legal advice. Regulatory requirements change frequently — verify current rules with official sources. Built by Sawai Gyoseishoshi Office, Hiroshima, Japan.