What is Data Governance?
Data Governance is the management of data across the organization. Data exists in different formats and across different layers. The main purposes of Data Governance are to protect data (by restricting access), maintain data quality, and maintain a golden copy of the data. From raw data to reports, data moves through stages such as staging, integration, data warehousing, and the semantic layer. Throughout this flow, it is essential for any organization to protect the personal information of its users.
Why is Data Governance important?
Golden copy
In any organization, source data comes from many source systems. For example, in banking, user data may come from the online banking application, ATM transactions, teller transactions, checks, etc. Each application may store user data differently, so when integrating data from all these applications, it is essential to maintain a golden copy by removing duplicates, formatting, and standardizing the data. That golden copy then serves as the source of truth for analytics.
Data protection
Data can be stolen in different ways, like data breaches, phishing, insecure networks, malware, etc. Protecting users’ personal information is a top priority for organizations. To support this, data must be secured by restricting access based on role Example- by employee role. Data Governance is how organizations achieve this.
Different Stakeholders
· Data owners -They define business rules, data quality rules, policies, standards, and regulations for the organization.
· Data stewards- They validate daily data quality checks and collaborate with stakeholders so that data flows run smoothly.
· Data users - People who work with data daily—for example, developers who implement data quality rules, ETL developers, QA, and data analysts.
· Data custodians -Typically IT admins and DevOps teams who implement policies and manage security and access.
Pillars of Data Governance
Data Quality
The name is self-explanatory: it’s about maintaining the quality of data. Data quality involves several processes, including data cleaning, removing duplicates, standardizing and formatting data, and integrating data from different sources to maintain a golden copy of data. The business decisions will be most accurate when the quality of the data is accurate.
Below are a few data quality dimensions:
Accuracy - How well data is maintained across the organization.
Completeness -Ensuring all required data elements are present.
Consistency-Keeping data consistent across the organization’s systems.
Timeliness- Having data available when it’s needed.
Validity-Applying business rules, formatting, and standardizing data.
Uniqueness- Maintaining a golden record by removing duplicates.
Data Stewardship
Data stewardship means assigning responsibility to a person or group for managing and maintaining datasets. Data stewards create data quality rules and ensure those rules are applied to the data. They also coordinate with data owners, stakeholders, and users to establish data standards and policies.
Below are a few data steward responsibilities:
Defining data quality rules - The data steward keeps data quality checks in place so that data flows meet the rules.
Data protection - Protecting data by defining levels of access across the organization.
Collaboration -Working with teams that deal with data every day (e.g. ETL developers, stakeholders, data analysts, and data owners).
Documentation -Documenting data quality rules, metadata, and data lineage.
Compliance- Ensuring that company data policies are in place and followed.
Data Protection and compliance
Data protection is about protecting data from unauthorized access. It is a very important task for any organization. In today’s digital world, organizations can earn customer trust by securing their personal information. Data theft and breaches can occur anywhere- in this pillar, data is kept secure through various protection techniques.
Below are a few data protection tasks:
Policies -Apply data classification policies to data.
Restrict access -Restrict access to data based on each employee’s role.
Regulations -Apply relevant regulations and standards to data Example -GDPR, HIPAA, PII requirements.
Risk management- Conduct regular audits to detect and address breaches.
Data lifecycle management -Manage retention , archival, and disposal of data.
Data Management
Data Management is often the most challenging pillar of Data Governance and the heart of it, because data resides in many forms like metadata, data lineage, actual data, reports, etc. It is essential to keep all of these forms of data safe and governed.
Below are a few important tasks:
Data integration - Integrating data from different sources.
Data archiving - Keeping data archival policies in place to move or delete data after a certain period.
Data modeling- Applying data modeling techniques to data.
Data architecture- Applying data architecture methodologies and industry best practices.
Data storage - Setting up databases and cloud storage. Example -S3, Azure Blob Storage.
Data Governance Use Cases
Master Data Management - In some organizations, data comes from many sources. In such cases, it is important to maintain a golden copy, since different sources follow different formats.
Data warehousing projects - When implementing a data warehouse, data integration is required because data comes from multiple sources.
Personal data - Some organizations handle highly sensitive personal information. For example, in banking, healthcare, and insurance. For them, protecting users’ data is a top priority.
Data migration -When organizational mergers happen, data systems must be merged. Data Governance helps ensure smooth data system mergers.