What is Data Extraction?
Captain's Compass

What is Data Extraction?

Steven Johnson
Steven Johnson

Welcome to the seventh episode of Captain's Compass, your definitive resource for all things related to data management. In today's post, we're diving deep into the concepts of data extraction and data orchestration with your host, Steven Johnson.


What is Data Extraction?

Data extraction, in its simplest form, involves pulling data from a service and depositing it into a database, a Google sheet, or even an Excel sheet. This process refers to transferring raw data from one system into a more accessible and usable format.

However, as your business scales, data extraction becomes a more complex endeavor. You might start with a few data sources, but as your business grows, you may have to juggle 10, 20, or even 40 different data sources, each requiring its own unique extraction method.


The Role of Data Extraction Vendors

Given the complexity of data extraction, businesses are presented with two main choices: writing custom scripts for each data source or using vendors with pre-built connectors.

There are several reputable vendors in the extraction space that can ease the process. For instance, Fivetran, Portable, Airbyte, and Matillion offer pre-built connectors that can automatically pull data from systems like Salesforce into your preferred database or spreadsheet.

On the other hand, if you're looking for more control over the extraction process, you can write custom scripts in Python, Node.js, or any other language to manually pull data from the APIs of your chosen services. Although this offers greater control, it requires significant upfront work and may not be the most efficient route for all businesses.


Data Extraction Vs. Data Orchestration: What's the Difference?

While data extraction tools pull your data from a system into a database, this is merely a single step in your data pipeline. Most businesses need to perform additional transformations because the extracted data is raw. They will need to clean up this data and then move it into a BI tool, email it to stakeholders, or conduct reverse ETL for marketing or other business purposes.

So, what bridges the gap between all these processes? Enter data orchestration.

Orchestration tools allow you to connect your extraction, transformation, BI, and reverse ETL processes, establishing a seamless workflow between all your tools. Some of the most popular orchestration tools include Shipyard, Airflow, Prefect, and DAX. These tools not only connect your processes but can also host your scripts, essentially serving as both your orchestration and extraction tool.


Conclusion

Data extraction and data orchestration are both crucial components in the overall data pipeline. As businesses scale, the importance of understanding these two processes cannot be overstated. Whether you choose to write your own scripts or employ the services of vendors with pre-built connectors, both methods have their advantages and it ultimately comes down to your specific business needs and capabilities.

We hope that this guide provides you with a clearer understanding of data extraction and orchestration, helping you to make more informed decisions regarding your data management strategies.

Be sure to check out our substack of articles that our internal team curates weekly from all across the data space. Ready to try Shipyard? Get started with our free Developer Plan now.