Databricks announces new offering for Unity Catalog
Databricks has significantly expanded data governance capabilities on the lakehouse by unveiling data lineage for Unity Catalog.
Data lineage describes how data flows throughout an organisation, and the data and AI company’s newest feature will allow customers to gain visibility into areas such as where data in their lakehouse came from, who created it and when, and how it has been modified over time and how it is being used.
Databricks notes that businesses deal with large amounts of data from a range of sources, and understanding these different areas can be extremely difficult but having the understanding is crucial to ensuring trust and assessing risk.
Data lineage for Unity Catalog allows data teams to view every downstream consumer affected by data changes to gain a straightforward understanding of how severe the impact is and quickly notify the relevant stakeholder of changes.
This includes applications, dashboards, machine learning models and data sets.
In addition, the offering allows data consumers such as data scientists, data engineers and data analysts to be context-aware as they carry out their work, resulting in stronger outcomes.
Data stewards will also be able to see which data sets are no longer accessed or have become obsolete so they can retire unnecessary data, reducing risk and ensuring end users only use high-quality data.
These new capabilities in Unity Catalog offer organisations a complete view of the entire data lifecycle, so data leaders can understand how data is being collected, if it was updated, and the processes used.
“Governance capabilities such as data lineage are critical as we work to build the industry’s most robust lakehouse platform,” Databricks co-founder and chief technologist Matei Zaharia says.
“Without good data lineage, it is challenging to track the business and verification processes that data-driven organisations need to be successful.
“Our goal is to ensure our customers can focus on insights, and move toward proactive data management practices through a unified, transparent view of their entire data ecosystem.”
One of the key features of Unity Catalog is automated run-time lineage to capture all lineage generated in Databricks, enabling greater accuracy and efficiency than tagging it manually.
This information is captured for tables, views, and columns to give a granular picture of upstream and downstream data flows.
Lineage also works across all workloads supported by Databricks, including SQL, Python, R and Scala, meaning that all data personas can build on their tools with data intelligence and more substantial insights such as capturing lineage for entries like notebooks, workflows and dashboards.
Further, data lineage also assists businesses in meeting compliance standards, making it easier to track data flows that are subject to compliance regulations, including the General Data Protection Regulation, California Consumer Privacy Act, or Health Insurance Portability and Accountability Act.
Databricks says this aspect of data traceability is an important component of a modern data architecture that allows customers to meet their legal requirements.
Data lineage for Unity Catalog is now available for preview on AWS and Microsoft Azure.