Key Definitions
| Term | Definition |
|---|---|
| Data Governance | The organizational framework of policies, processes, standards, and metrics that ensures data is managed as a strategic asset, with defined ownership, quality standards, and usage controls throughout its lifecycle. |
| Training Data | The dataset used to train a machine learning model, from which the model learns patterns, relationships, and decision rules that it will later apply to new data. |
| Validation Data | A dataset held separate from training data, used during model development to tune hyperparameters and evaluate model performance without contaminating the training process. |
| Testing Data | A dataset held entirely separate from training and validation data, used only after model development is complete to provide an unbiased estimate of model performance. |
| Data Lineage | The documented history of data as it moves through an organization's systems — its origins, transformations, movements, and downstream uses — enabling traceability and impact analysis. |
| Data Quality | The degree to which data is accurate, complete, consistent, timely, valid, and fit for its intended use in AI system training, validation, testing, or operation. |
| Data Bias | Systematic distortions in data that can cause AI systems to produce unfair, discriminatory, or inaccurate outputs, arising from collection methods, historical patterns, measurement errors, or representation gaps. |
| Data Minimization | A GDPR principle (Article 5(1)(c)) requiring that personal data processed must be adequate, relevant, and limited to what is necessary for the specified purpose. |
| Data Protection Impact Assessment (DPIA) | A structured assessment required by GDPR Article 35 when data processing is likely to result in a high risk to the rights and freedoms of individuals. |
| Synthetic Data | Artificially generated data that mimics the statistical properties of real data, used as an alternative to real-world data for AI training when actual data is unavailable, insufficient, or raises privacy concerns. |
| Data Subject Rights | The rights granted to individuals under GDPR regarding their personal data, including rights of access, rectification, erasure, restriction, portability, and objection. |
| Anonymization | The irreversible processing of personal data such that the individual can no longer be identified, directly or indirectly, by any means reasonably likely to be used, removing the data from GDPR scope. |
Chapter 1: Why Data Governance Is Central to AI Compliance
Data is the foundation of every AI system — the quality, representativeness, and governance of data directly determine whether an AI system is accurate, fair, and compliant. The EU AI Act recognized this by dedicating Article 10 entirely to data and data governance requirements for high-risk AI systems. Organizations that fail to govern AI data effectively cannot achieve AI compliance regardless of how well they manage other governance aspects. Data governance is not a supporting function for AI compliance; it is the core requirement.
1-1. The Data-AI Compliance Connection
The EU AI Act establishes a direct legal link between data governance and AI system compliance:
- Article 10(1): High-risk AI systems that use techniques involving the training of AI models with data shall be developed on the basis of training, validation, and testing data sets that meet specific quality criteria.
- Article 10(2): Training, validation, and testing data sets shall be subject to data governance and management practices appropriate for the intended purpose of the AI system.
- Article 10(3): Training, validation, and testing data sets shall be relevant, sufficiently representative, and to the best extent possible, free of errors and complete in view of the intended purpose.
- Article 10(5): To the extent that it is strictly necessary for ensuring bias detection and correction, the providers of high-risk AI systems may exceptionally process special categories of personal data (GDPR Article 9) subject to appropriate safeguards.
These provisions mean that organizations cannot simply purchase an AI tool and assume the data governance is the provider's problem. Deployers who supply their own data or fine-tune AI systems have direct data governance obligations.
1-2. The Cost of Poor Data Governance
Poor data governance in AI contexts creates cascading problems:
| Data Issue | AI Impact | Business Impact | Regulatory Impact |
|---|---|---|---|
| Inaccurate data | Incorrect model predictions | Wrong decisions; customer harm | Non-compliance with Art.10(3) |
| Biased data | Discriminatory outputs | Discrimination claims; reputational damage | Fundamental rights violations |
| Incomplete data | Model underperformance for underrepresented groups | Unequal service quality | Art.10(3) representativeness failure |
| Outdated data | Model drift; degraded accuracy | Increasingly wrong decisions over time | Art.10 ongoing data governance failure |
| Poorly documented data | Inability to explain model behavior | Audit failures; regulatory inquiries | Art.10(2) documentation requirement |
| Uncontrolled data access | Privacy violations; data leakage | GDPR fines; trust erosion | GDPR and Art.10 safeguard failures |
1-3. Data Governance vs. Data Management
Data governance and data management are related but distinct:
Data Governance (strategic) defines:
- Who is responsible for data
- What standards data must meet
- Which rules govern data use
- How compliance is verified
Data Management (operational) implements:
- How data is collected, stored, and processed
- How data quality is measured and maintained
- How data access is controlled
- How data is backed up and protected
Both are necessary. Governance without management is theoretical; management without governance is directionless.