Data lineage: Why knowing the past is key for unlocking the future
Data Operations

Data lineage: Why knowing the past is key for unlocking the future

Shawn Fergus
Shawn Fergus

Let's officially make a bid for 2023's understatement of the year: Information technology is advancing fast. But, be that a blue ribbon understatement or not, here's the thing:

As we're increasingly swept up in the evolution of information technology, it's becoming easier to get lost in what might be happening tomorrow. And, as we're starting to see, looking back at what data has done may prove to be more inspiring than what it’s yet to do.

Enter: data lineage.

data lineage

What is data lineage, and why does it matter?

Data lineage (also called data provenance) refers to the intentional tracing and documentation of data as it’s sourced, transformed, and transported for use in organizations. By tracking data lineage it’s possible to understand and visualize important aspects of the data lifecycle as it flows through a multitude of processes and systems.

Understanding data movement in this way helps data engineers, data scientists, analysts, and other stakeholders build and modify data pipelines—vital components of the digital infrastructure big data depends on.

As part of normal business operations, data sources are changed or modified often. The tracing of data lineage allows internal teams to perform regular impact analysis, identifying potential complications these changes may have on downstream systems and processes. Troubleshooting in general is also easier, as teams are better able to identify the root causes of data processing or reporting issues when they occur.

Organizations connected to highly regulated industries (e.g., finance, healthcare) are especially reliant on the transparency that data lineage and data lineage tools provide. Without it, ensuring compliance through audits would be impossible, if not impractical. However, this transparency positively impacts the data governance of any organization, as does the increases in overall data quality that data lineage provides. Especially as digital transformation initiatives are increasingly common in modern businesses.

What data lineage offers that digital transformation needs

Organizations that take their data lineage seriously clearly stand to benefit from better—not just bigger—data. However, the role data lineage plays in enabling digital transformation is both significant and multifaceted.

First, and perhaps most foundational, is the fact that data lineage builds trust and enables better decision-making. Digital transformation often involves the complex, nuanced task of integrating a variety of data sources (e.g., data warehouses, data lakes), platforms, and tools. By its nature, however, tracing data lineage provides an invaluable map of data flow within an organization. This map helps ensure stakeholders can trust their data sources during and after their transformation initiative is complete.

The complexity of digital transformation is often compounded by the need to integrate an organization’s legacy systems with newer technologies. But data lineage can help visualize complications during these integrations, making it easier to simplify, manage, and optimize data flows.

Transformation initiatives often require an organization’s teams to integrate more than just its systems. Unfortunately, a clear, accessible understanding of data lineage may not solve who’s suddenly stealing whose lunch from the break room fridge. However, this understanding does foster natural and productive collaboration between analysts, data engineers, and other business users, who need to be on more—not less—of the same pages regarding data flow.

This increase in synergy also improves operational efficiency as the transformation process unfolds and redundancies and areas that could be optimized are revealed through the data lineage. Furthermore, risk can be mitigated through impact analysis of all data changes required to unlock these opportunities.

Finally, lineage is crucial when migrating data to new systems from old platforms. As part of maintaining robust data governance, tracing data lineage ensures data quality and accuracy is maintained while being transferred and transformed.

15 best practices for tracing your organization’s data lineage

If no data lineage is being traced within your own organization, you now have a solid head start in making a case to begin. To that, let’s add 15 common (yet critical) best practices that will help you effectively trace and map data flow through any type of business.

  1. Start with a clear objective: Yes, we may agree that tracing the source, transformation, and destination of all data is important. But clearly defining what you need data lineage to do for you provides mission-critical context for how it’s mapped and how it will be used.
  2. Automate wherever possible: While big data certainly is big, these days even “medium data” is comparatively monstrous. This reality underscores the need to use automated tools and platforms that will help you discover and map data lineage across systems, regardless of scale.
  3. Select and maintain the right granularity: Based on the objective (or objectives) you’ve defined, you need to decide: Do you need to see the forest or the trees? High-level lineage will provide you with a broad overview of your data lifecycle. Detailed lineage can offer insights into specific dependencies, transformations, and data flows.
  4. Integrate with metadata: Bolster your data lineage with metadata management to add additional context to data flows. This can make it easier to understand data attributes, definitions, and quality.
  5. Update and validate consistently: Every data environment is dynamic. Therefore, good data lineage depends on regular updates and validation. Due diligence here ensures that what’s recorded and mapped will always reflect the actual state of your organization’s data ecosystem.
  6. Encourage collaboration: Involving key stakeholders from different departments ensures that the data lineage you trace will be comprehensive, not just accurate. These stakeholders should include IT teams, data stewards, and business analysts, in addition to data engineers.
  7. Prioritize critical data elements: Push mission-critical elements to their rightful place at the front of the line, so your tracing efforts more immediately relate to key business decisions, regulatory reporting, areas with known data quality issues, etc.
  8. Document transformations: Core to the operating principle of tracing data lineage, clearly document all transformations, calculations, and business rules applied to data as it moves through systems. And do so at the level of granularity you identified in step three.
  9. Implement version control: Like an extended family through generations, your data lineage will naturally evolve over time. By maintaining versions of your data lineage, it will become easier to understand both changes to data flows and the context in which they were made.
  10. Champion visualizations: Visual representations—even basic flowcharts and graphs—tend to be much more intuitive and engaging than textual descriptions of chronicled data.
  11. Centralize a repository: Store all information traced in a central repository. Not only does this ensure consistency, a single source of truth is more accessible to relevant (and often, key) stakeholders.
  12. Operationalize feedback loops: Take advantage of the convenience your centralized repository provides by ensuring users and stakeholders weigh in on your data lineage maps. The feedback must flow. As it does, you’ll find it easier to spot gaps, errors, and areas for potential improvement.
  13. Promote training and awareness: Insist that all relevant personnel are trained on the importance of data lineage, how to interpret what's traced and mapped, and how to play a direct role in its accuracy and completeness.
  14. Integrate with your data quality tools: By linking your data lineage initiative with data quality tools, you can drill down to the source of data quality issues faster.
  15. Review, audit, review again: Normalize data lineage audits and reviews to continuously reinforce its completeness, accuracy, and relevance to your organization.

What tools can be used for data lineage monitoring?

There are many kinds of tools worth wielding to help make tracing data lineage a productive, ongoing part of business operations. But while any business can benefit from documenting the history of their data, no two have the same advantages, challenges, and resources.

Therefore, we’ve broken a list and some examples down into three broad groups, reflecting small-, medium-, and large-sized organizations:

Small organizations

For small businesses with limited resources, the focus should be on cost-effective and easy-to-implement solutions to trace and map their data’s origin, transformation, and use.

A savvy approach here may include open-source tools since they’re free to use and can be customized based on the specific needs of the business. These may include Apache Atlas, Apache NiFi, and Apache Kafka.

While primarily designed for data governance and metadata management, Apache Atlas provides a scalable frame suitable for data lineage. Moreover, Atlas can visually represent data lineage, an important aspect of demonstrating how it moves and transforms across various processes and systems.

Apache NiFi, a data integration and flow automation tool, allows users to design, schedule, and monitor data flows across multiple sources and destinations. But NiFi also offers built-in lineage tracking, meaning every piece of data that flows through it will be automatically tracked. Once tracked, data can be visualized in detail, including the data’s origin, transformation, and eventual destination.

The third potential tool from Apache, Kafka, is a popular open-source stream-processing software platform. Again, while not a data-lineage-specific tool, Apache Kafka’s features and capabilities (e.g., source-to-destination mapping, transformation tracking, auditing, and logging) can make it a valuable tool for tracing data lineage via data integration.

Alternatively, many cloud-based providers like AWS Glue Data Catalog or Google Cloud Data Catalog have basic data lineage and integration tools baked into their respective suites.

Medium-sized organizations

Here, we assume that medium-sized organizations will have more resources available and more complexity to trace and map, but still plenty of motivations to be cost-conscious. With less of a need to lean on open-source platforms, medium-sized organizations may focus on a data integration platform like Azure Data Factory instead of open-source solutions like those from Apache.

Microsoft's Azure Data Factory is a cloud-based data integration service focused on creating, scheduling, and managing data-driven workflows. Organizations that invest in Microsoft's Azure ecosystem can also tap into related compute services, like using Azure Machine Learning to handle their data transformations.

Being fully integrated, any of Azure's services that help source, transform, and deliver data can also then be monitored through Data Factory, centralizing the data lineage process.

However, some medium-sized organizations may still prefer the benefits open-source platforms provide. In these cases, Talend's data integration solution is both open-source and offers a variety of comparable tools for integration, data quality, and governance. Talend also provides built-in data lineage and impact analysis tools.

The increase in data complexity that medium-sized organizations also deal with may necessitate data cataloging and more robust data governance capabilities.

As one example of the former, Alation provides a data catalog that leverages machine learning to automatically inventory and document data. Doing so helps keep more complicated (and larger) datasets easier to search and understand. As the medium-size organization scales, this automated cataloging ensures those involved in data lineage can continue to map and understand relationships and usage patterns among various aspects of data flow.

As data usage scales up, medium-sized organizations may also need to use data quality tools like Trifacta or Informatica. Both tools, in their own way, ensure all data is high quality as it enters the org and, by extension, the data lineage process. In this way, maintaining data quality at scale helps data lineage outputs remain useful and reliable at scale as well.

Large/multinational organizations

Large organizations, we continue to assume, will have more complicated data to deal with and more resources to marshal against those complications. And for tools that can help with data lineage, we see an equal bump in scale.

Solutions here may include enterprise data governance platforms, like Informatica Enterprise Data Catalog or IBM's InfoSphere Information Governance Catalog. These governance solutions are comprehensive solutions built to support trusted views of immense data flow across an organization. The views they provide need to be quite literal, too, as data lineage efforts require traced and mapped data to clearly show where data is coming from, going to, and how it's getting there no matter how much of it there is.

Organizations of this size may also leverage advanced data pipelines in addition to their data integration tools, as doing so can maintain the clarity and efficiency needed for data lineage in real-time data scenarios. Like their medium-sized counterparts, database management systems may be at play here, but with the help of specific audit and logging tools like Splunk or the ELK stack (i.e., Elasticsearch, Logstash, Kibana).

At the enterprise scale, these tools add valuable firepower to the auditing and logging process. But these specialized solutions also offer their own ways of visualizing data as it flows through the organization and additional tools for searching, analyzing, and mapping data in real time.

Data documentation may also get assistance from specific tools at this scale, like Redgate SQL Data Catalog or Datedo. The added insights into database structures, relationships, and large-scale metadata can also prove invaluable to enterprise-scale data lineage efforts.

It's essential to note that the categorizations above are meant to be helpful guidelines. Any choice of tools should be based on the specific needs, existing infrastructure, and future growth plans of a specific organization.

This is especially true since many tools offer tiered pricing or editions built to scale, allowing organizations to start small and scale up as needed. Low-code, serverless workflow platforms like Shipyard are a great example and can be helpful for tracking and recording data lineage at any scale, especially when integrated correctly within an organization's data ecosystem.

Getting back to the future, remember: Flexibility is key

“The farther backward you can look, the farther forward you are likely to see.”

- Winston Churchill

Shipyard can be a good fit for small organizations with limited resources due to its free tier and emphasis on no-code/low-code solutions. Small organizations can start with the free tier and scale as they grow.

Medium-sized organizations with moderate resources may have some technical expertise but might not want to invest heavily in custom solutions. Shipyard's scalable pricing and ease of use can be attractive. And they can utilize more advanced features while not being overwhelmed.

While large organizations have the resources to build custom solutions if they so desire, they might still opt for platforms like Shipyard for specific teams or projects that require rapid deployment without the overhead of developing from scratch. The unlimited tasks option in Shipyard's highest tier can be beneficial for large-scale operations.

Making history—data history, that is—is the first step in guaranteeing an innovative, data-driven operation. The second? Joining our Substack to get insights, POVs, and inside knowledge piped directly into your inbox!