Quick answer

Article 10 of the EU AI Act requires that training, validation, and testing datasets for high-risk AI systems meet specific quality criteria including relevance, representativeness, and completeness. Providers must examine data for biases, identify data gaps, and implement appropriate data governance and management practices throughout the AI system lifecycle.

Updated June 2026 · MmowW AI Compliance

EU AI Act Data Governance: Article 10 Training Data Requirements (2026) | MmowW

The Role of Data Governance in the EU AI Act

Data quality is the foundation upon which every AI system is built. Regulation (EU) 2024/1689 recognises this through Article 10, which establishes comprehensive data governance requirements for high-risk AI systems. The provision addresses the entire data lifecycle, from initial collection through training, validation, and testing, ensuring that the datasets underpinning high-risk AI meet standards of quality, relevance, and fairness.

Article 10 reflects a fundamental principle: an AI system is only as reliable as the data it learns from. If training data is biased, incomplete, or unrepresentative, the resulting AI system will reproduce and potentially amplify those deficiencies. The consequences are particularly serious for high-risk AI systems, which by definition operate in domains affecting health, safety, or fundamental rights.

Data Governance and Management Practices

Article 10(1) establishes the overarching requirement that high-risk AI systems that involve the training of AI models with data must be developed on the basis of training, validation, and testing datasets that meet the quality criteria referred to in paragraphs 2 to 5. This is not optional guidance but a binding obligation for providers.

Article 10(2) specifies that training, validation, and testing datasets must be subject to data governance and management practices appropriate for the intended purpose of the high-risk AI system. These practices must concern, in particular, the relevant design choices for the datasets, data collection processes and the origin of data, relevant data preparation processing operations such as annotation, labelling, cleaning, updating, enrichment, and aggregation, the formulation of assumptions about the information that the data are supposed to measure and represent, an assessment of the availability, quantity, and suitability of the datasets that are needed, and examination in view of possible biases that are likely to affect the health and safety of persons, have a negative impact on fundamental rights, or lead to discrimination prohibited under Union law.

These data governance practices are not merely procedural checkboxes. They require substantive engagement with the data at every stage of the pipeline. Providers must understand what their data represents, how it was collected, what assumptions underlie its use, and what biases it may contain.

Statistical Properties and Representativeness

Article 10(3) addresses the statistical characteristics of datasets. Training, validation, and testing datasets must be relevant, sufficiently representative, and to the best extent possible, free of errors and complete in view of the intended purpose. These criteria operate together rather than independently. A dataset may be statistically large yet unrepresentative of the population the AI system will serve. A dataset may be representative yet contain systematic errors that undermine reliability.

The requirement for datasets to have the appropriate statistical properties is tied to the specific geographical, contextual, behavioural, or functional setting within which the high-risk AI system is intended to be used. This means that a dataset suitable for one deployment context may be entirely inappropriate for another. A system trained on data from one jurisdiction may not perform adequately when deployed in a different legal, cultural, or demographic environment.

Article 10(3) also requires that datasets take into account, to the extent required by the intended purpose, the characteristics or elements that are particular to the specific geographical, contextual, behavioural, or functional setting. This provision acknowledges that AI systems do not operate in a vacuum. Their performance is shaped by the environments in which they are deployed, and training data must reflect those environments.

Bias Examination and Mitigation

Article 10(2)(f) specifically requires examination for possible biases that are likely to affect health and safety, have a negative impact on fundamental rights, or lead to discrimination prohibited under Union law. This obligation extends beyond simply checking for obvious demographic imbalances in datasets. It requires a thoughtful analysis of how data characteristics might translate into discriminatory outcomes.

Bias in AI training data can arise from multiple sources. Historical data may reflect past discriminatory practices. Collection methodologies may systematically over-represent or under-represent certain populations. Labelling practices may encode subjective judgements that embed bias into the training signal. Proxy variables may correlate with protected characteristics in ways that are not immediately apparent.

The Act does not prescribe specific technical methods for bias detection and mitigation. Instead, it requires providers to exercise due diligence in identifying and addressing potential biases appropriate to their specific system and context. This flexibility recognises that bias detection and mitigation is an evolving field where best practices continue to develop.

Data Gap Identification

Article 10(4) addresses the challenge of data gaps. Where the compliance with the requirements set out in paragraphs 2 and 3 cannot be ensured, providers must identify relevant data gaps or shortcomings and take measures to address them. This provision acknowledges that perfect datasets may not always be achievable and establishes a pragmatic approach to data quality.

Identifying data gaps requires providers to compare their available data against the requirements of the intended use case. If a high-risk AI system is intended to serve a diverse population, the training data must reflect that diversity. If it will operate across multiple jurisdictions, the data must capture relevant jurisdictional variations. Where gaps are identified, providers must take concrete steps to address them, whether through additional data collection, synthetic data generation, or other appropriate measures.

The data gap analysis is not a one-time exercise. As AI systems are deployed and post-market monitoring data becomes available under Article 72, new data gaps may be identified that were not apparent during initial development. The continuous iterative nature of the risk management system under Article 9 applies equally to data governance.

Special Categories of Personal Data

Article 10(5) addresses the processing of special categories of personal data as referred to in Article 9(1) of Regulation (EU) 2016/679 (the GDPR) and Article 10(1) of Directive (EU) 2016/680. This provision permits the processing of special category data to the extent that it is strictly necessary for the purposes of ensuring bias detection and correction in relation to high-risk AI systems. This is a carefully circumscribed exception that allows providers to process sensitive data such as racial or ethnic origin, political opinions, religious beliefs, or health data specifically for the purpose of identifying and correcting biases.

This permission is subject to appropriate safeguards for the fundamental rights and freedoms of natural persons, including technical limitations on the re-use of such data, the use of state-of-the-art security and privacy-preserving measures, including pseudonymisation and encryption, and limitations on access to such data.

Practical Steps for Compliance

Organisations developing high-risk AI systems should approach Article 10 compliance systematically. Begin by documenting data provenance, establishing where each dataset originated and how it was collected. Map the data preparation pipeline, recording each transformation step from raw data to training-ready input. Assess statistical properties against the specific requirements of the intended deployment context. Conduct structured bias audits using both quantitative metrics and qualitative assessment. Identify and document data gaps with specific plans for remediation.

Article 10(6) requires that appropriate data governance and management practices apply to the development of high-risk AI systems that do not involve the training of models, to the extent that they use data. This extends the data governance obligations beyond traditional supervised learning to encompass other AI development approaches.

Maintaining ongoing data governance requires organisational discipline. Daily operational practices that track data quality, monitor for drift, and flag potential bias issues are more effective than periodic audits alone. The WnowW Trust OS at mmoww.net/ai/app/ supports this kind of operational data governance discipline, helping organisations build systematic daily habits around AI compliance rather than treating it as an episodic exercise.

Data governance under the EU AI Act is ultimately about accountability. Providers must be able to demonstrate that they have taken reasonable, documented steps to ensure their AI systems are built on data that is relevant, representative, and free from harmful biases. This is both a regulatory requirement and a foundation for building AI systems that perform reliably and fairly in the real world. Content verified against current regulations by Sawai Gyoseishoshi Office.

Check your AI compliance readiness — free.

Take the Readiness Check 3 minutes · 10 questions · no signup required

This article is for informational purposes only and does not constitute legal advice. Regulatory requirements change frequently — verify current rules with official sources. Built by Sawai Gyoseishoshi Office, Hiroshima, Japan.