Great Expectations is one of our favorite new data packages on the market and it's steadily gaining popularity every month (1.6k stargazers on github at the time of writing). Abe Gong and the incredibly smart team at Superconductive have done a great job of nailing a problem that most data teams have: pipeline debt.
We had the pleasure of getting to hear Abe speak at the Data Day Texas conference back in January. To paraphrase his presentation:
If you told a software engineer that they had to work on a system that was untested, undocumented, and unstable, they would look at you like you're insane.
Now, change the audience to someone working with data. They would laugh uncomfortably and nod in understanding.
Why are we ok building data systems in a way that's absolutely crazy?
Great Expectations relies on data teams to choose from a set of rules that they expect their data to continuously meet. These expectations both define a series of tests for the data while simultaneously building out documentation for each dataset. It's bringing the idea of software unit testing into the realm of data and we're excited that it's arrived.
If you're already familiar with the ins and outs of Great Expectations, feel free to jump to our example of executing an expectation suite locally.
Why you should use Great Expectations
As your organization grows, you're continuously chaining together more and more data sources. A typical pipeline starts to look like the example below:
Most data pipelines get convoluted because they rely on a variety of steps, including:
- Extracting data from external or internal sources.
- Combining two or more data sources together.
- Cleaning data values or structures.
- Loading data into a Data Warehouse or alternative Data Storage (S3, GCS, SFTP, etc.)
- Taking an automated action off of a data set.
- Running scheduled queries for dashboards and reports.
- Applying an ad hoc model against data for predictive scores and forecasts.
Most of the scripts that typically comprise a pipeline are just assuming the data flowing through them will always look the same. If you've dealt with data for any stretch of time, you'll quickly realize that making such an assumption is not only bad, but potentially catastrophic. If one unintended change slips through the cracks, reports are inaccurate, models get deployed with incorrect figures, data has to be wiped and reloaded, and organizational trust erodes.
The unfortunate truth is that these types of issues can slip by unnoticed for long stretches of time. When the team finally becomes aware of the problem, it can take days to identify the root cause, all while the affected pieces come to a standstill. It usually looks a bit like this:
Great Expectations serves as an intermediary step in your pipeline to verify that before (and after) every stage of a pipeline, the data looks as your script expected it to.
While adding tests takes a lot of dedicated time and effort, the upside is that if the data changes, or a script that's altering the data changes (one of our existing 20 steps) we can:
- Know immediately that the data isn't normal.
- Identify exactly what changed in the data.
- Determine the specific step where it went wrong
- Prevent downstream steps from executing, limiting the amount of bad data exposure and future cleanup efforts.
With an implementation of Great Expectations in your pipeline, your Data Team will be able to build greater organizational trust in your data while increasing pipeline resiliency and data transparency.
The Lifecycle of an Expectation Suite
Great Expectations takes a bit of upfront effort to get started and a lot of the work is done locally. To get started building your own expectation suite locally, you can follow this guide. At a high level, the steps usually involve.
- Creating a new directory for your expectations to run in, selecting the file or datasource to use, and generating a set of sample expectations for your data.
- Browse the generated docs for the sample set of expectations. Once you find rules that are missing or incorrect, run
great_expectations suite edit.
- Add and test new expectations directly in a Jupyter notebook.
- Once finished, run the bottom statement in the notebook to generate expectations as a JSON file.
- Repeat steps 2-4 until you're satisfied with all of your expectations.
The final step of the process is launching your expectation suite in a workflow automation platform so that your data is continuously run against your expectations on a scheduled basis.
To help you get started implementing expectations into your pipelines, we have designed a tutorial that's primarily focused on this step. While having tests for your data is half the battle, the other half is ensuring that those tests are running continuously, that all data is appropriately logged, and that the results are integrated into your pipeline execution.
Executing an Expectation Suite Locally
To get a general feel for what the final product of an expectation suite will look like, we created an example expectation suite that looks at Amazon's Product Review data. This example is available for download at the following link and includes:
- The minimum required directories and scripts needed to run Great Expectations.
- An Expectation Suite called
amazon-product-reviewsthat houses some very basic expectations.
- A script called
- Pulls down the external data of Amazon Product Reviews using a public S3 URL.
- Decompresses the file and converts it to a CSV. This is only a requirement for this specific set of data.
- Runs the data through your Expectation Suite.
- Stores the JSON output externally in an S3 bucket, if desired, with dynamically generated IDs.
- Conditionally triggers exit codes based on the JSON output.
We always recommend digging into the code to get a better understanding of how everything is working. Before running anything locally, you should be aware of a few caveats:
- For your expectation suites to work, they always to live in a directory called
great_expectationsand follow other required directory naming conventions.
- We created a datasource called
great_expectations.ymlthat will recognize any files that exist in the root directory. This makes it easier to download data externally and run it against your expectations immediately.
- Great Expectations only has built-in support for file storage on S3 right now. If you'd like to store files elsewhere (Google Cloud Storage, Azure Blob Storage, Dropbox, Google Drive, Box, etc.), you can replace the
upload_to_s3function with the appropriate function.
- For our example, the input URL can be any of the URLs listed here. Note that these files are 1GB+ in size and the
sample_us.tsvfile is the only one guaranteed to pass all of the expectations successfully.
To run your expectation suite locally, you'll need to go through the following steps. Before starting, make sure to download the example and ensure that your computer is running Python 3.7 or above.
1. Create a virtual environment and install the following 4 packages:
Note: We didn't include a requirements.txt file to specifically maintain the visibility of package requirements within the Shipyard interface.
2. Create a new S3 bucket, or use an existing S3 bucket, to store generated validation files. Using the credentials for the IAM role that accesses this bucket, add environment variables to your computer for
Note: This step is not necessary if you don't include the
--output_bucket_name argument below. However, your validation results will only be stored locally.
3. Unzip the downloaded file and navigate to the corresponding directory in your terminal.
4. Run the following command.
python3 run_great_expectations.py \--input_url https://s3.amazonaws.com/amazon-reviews-pds/tsv/sample_us.tsv \--output_bucket_name <your-bucket-name> \--expectation_suite amazon-product-reviews
5. Success! You can now run
great_expectations docs build to view all of the expectations visually, alongside the most recent validation run.
Executing an Expectation Suite in the Cloud with Shipyard
Once you're ready to shift your expectations into production, we've put together a step-by-step guide that walks you through how to set up your expectation suite in the cloud using Shipyard's platform.
While Great Expectations can run on any workflow system, Shipyard has been designed to work well with the combination of multiple third-party data packages. Some benefits of using Shipyard for execution include:
- Get started with no DevOps setup or infrastructure maintenance. Once you have an account, you can start launching and scheduling your expectations in less than 5 minutes.
- Integrate your expectations into larger, more complex data pipelines.
- Built-in logging and notifications for every expectation test that your data runs through. Quickly share and reference these results with other team members.
- Build Blueprints for your pipeline tests to make it easier for team members to consistently set up tests.
By using Shipyard to run your expectations on a daily basis, you can easily integrate your expectations with all of your existing pipeline components.
Additional Use Cases with Great Expectations
While Great Expectations was designed for testing data within existing pipelines, we believe it opens up new opportunities for building better data applications, especially within Shipyard. Here are just a few use cases we're excited to try out.
Billing File Management
Want account leads to upload billing files? Create a Blueprint that allows teams to upload billing files directly to an S3 bucket. Immediately check the uploaded file data using an expectation suite. If expectations fail, send the file back to the user with notes on what wasn't compliant and delete the file from S3. If all expectations succeed, do nothing.
Advertising Bidding Algorithms
Frequently changing advertising bids using your own custom logic? Run an hourly script that dumps data from an advertising channel directly to S3. Run this data against your advertising expectations. If expectations succeed, run a bidding algorithm that updates. If expectations fail, alert the marketing team.
Internal HR Surveys
Need to verify that all employees filled out a survey? Create a daily script that pulls in data directly from Google Sheets. Run the sheet data against a set of expectations, looking for all non-null fields. For all rows that failed, extract the email address field. Send these emails a message to remind them to fill out the survey. If all expectations succeed, stop the process.
We hope you've found this guide to be helpful as you begin your journey with Great Expectations. Integrating a new technology into your business can always be daunting, but we believe that with the right setup, it can be made easier. If you have any questions about how to use and scale Great Expectations effectively in the cloud, feel free to reach out to us at email@example.com.