This is another post in our ongoing series regarding workflows and the challenges involved in effectively constructing them.
In this edition, we’ll take a look at a common pattern in orchestration called scheduling. While this is a fairly simple pattern to work with, there are a number of significant drawbacks that will be discussed in the following sections.
Workflows involve coordinating a number of individual processes to perform a larger business function. As an example, borrowed from our previous post, let’s consider the simplified workflow below which consists of three tasks:
- Fetching data from a given source
- Transforming the retrieved data
- Sending the formatted data to a recipient
In the scheduling pattern which has no overarching orchestration architecture, each task is triggered on a preconfigured timer. These are all on a staggered schedule to ensure that each runs in the proper order; for example, the fetch task runs on the hour, while the downstream tasks run on a 10 and 20 minute offset, respectively.
This presents a fairly simple and straightforward pattern to get up and running with a common service such as cron.
Now despite the apparent ease of configuration, there are a number of failings associated with this particular design.
Clearly, one of the biggest problems with this approach is that an assumption must be made as to how much time to allocate to a given task to properly offset the schedule for the downstream tasks. While in theory this is pretty straightforward, particularly after having tested out and seen how long each task takes, this is still an assumption.
Individual processes may take longer due to unexpected volumes of data or limited CPU capacity. If this happens, the assumptions underpinning the staggered schedule could be significantly off and the overall workflow could be disrupted.
Looking at our example, if the fetch takes longer than 10 minutes, perhaps from a third-party API where there was a connection issue or the data was larger than expected, the subsequent tasks would run without the fetched data and fail. For example, if the new data were not fetched and the downstream message task wasn’t aware, outdated data could be sent to the recipient.
Without a dedicated orchestration system in place, it can be difficult to easily surface information from the logs. In the most basic setup, agnostic of any scheduling program, each task could write its own log files. However, this could get complicated as it requires coordinating where log files are written to as well as ensuring that each task has appropriate permissions to write to that location.
While options like cron provide both default logging output (to a single, predefined location) as well as per-task configurations (any location the user chooses), grepping through the files is still required to find relevant information.
Additionally, it may be difficult to identify which logs from the individual processes were associated with the overall workflow unless an execution ID was explicitly passed between the tasks. In our example, even applying a datetime timestamp for each execution wouldn’t be completely helpful since all were offset and other tasks may be occurring during the same timeframe.
Being informed when a workflow, or task within it, encounters an error or completes its work is often required. You wouldn’t want your business critical processes being left without supervision.
Because individual task exit codes aren’t monitored by an external workflow orchestration system, each task must be responsible for emitting alerts based on its execution status. Within the task code, success and error messages must be configured depending on what the user wishes to be notified on.
Additionally, if the user wants to be notified when the entire workflow completes, the final task must have appropriate messaging written into its code. However, since this final task is itself on a predefined trigger schedule, comprehensive logic must be included to be able to choose what type of status to emit.
While cron provides the builtin MAILTO functionality, it requires an SMTP server to be running on the local machine and it will trigger on every task run likely leading to notification fatigue.
Under the scheduling paradigm, there is no direct way for downstream tasks to observe the exit status or output data of their predecessors.
Since each task runs on an automated schedule regardless of the upstream events, each task must include its own error handling and data processing logic. If the upstream task completes in time but encountered an error, for example in a case where a file was expected but was not generated, the downstream task must be able to accommodate that scenario. This adds a lot of additional code to each task and gets particularly difficult with branching and converging patterns within the workflow.
Another area where there is limited visibility is the “bird’s eye view” of the whole workflow. Given the scheduling pattern requires triggers to be configured on a per-task basis, there is no single point to observe the complete workflow graph from.
Again using cron as an example, looking into the crontab, only the task names and trigger schedules are provided. This means that the user would need to sort out the tasks by order time and then examine the code of each task in order to understand both the branching and converging logic as well as what error handling assumptions are incorporated into each task.
This is not an intuitive approach and isn’t scalable for larger workflows.
With these notable drawbacks, there are some alternatives that can be considered.
This entails running a separate process that polls each task waiting for a status code. However, this too by design incurs wasted CPU cycles. Additionally, this still makes assumptions, albeit likely smaller, about how much time to wait between each poll that is taken.
A more comprehensive approach is the monitoring orchestration pattern. This involves providing a parallel orchestration program that listens to each task and coordinates triggers throughout the graph accordingly. We discussed this approach at length in a prior post available here for reference.
The major benefit to a schedule-driven orchestration approach is the ease of setting it up. It can be as simple as configuring a series of cron jobs to periodically kick off tasks in a staggered pattern. However, limitations around capturing and responding to exit codes, inflexibility in execution time assumptions, and lack of visibility into the workflow as a whole present notable drawbacks.
Outside of relatively simple and linear workflows, using the scheduling approach is not recommended. Production-grade services would best serve customers by providing a monitoring approach where users are afforded considerably more freedom to design complex workflows. Shipyard supports scheduling features but enhances them by supporting a monitoring-based solution for increased flexibility and ease of use.
Stay tuned for upcoming posts in this series on workflows including breakdowns of scheduling patterns, distributed systems, and more.
Shipyard is a modern data orchestration platform for data engineers to easily connect tools, automate workflows, and build a solid data infrastructure from day one.
Shipyard offers low-code templates that are configured using a visual interface, replacing the need to write code to build data workflows while enabling data engineers to get their work into production faster. If a solution can’t be built with existing templates, engineers can always automate scripts in the language of their choice to bring any internal or external process into their workflows.
The Shipyard team has built data products for some of the largest brands in business and deeply understands the problems that come with scale. Observability and alerting are built into the Shipyard platform, ensuring that breakages are identified before being discovered downstream by business teams.
With a high level of concurrency and end-to-end encryption, Shipyard enables data teams to accomplish more without relying on other teams or worrying about infrastructure challenges, while also ensuring that business teams trust the data made available to them.