Why Workflows are Hard: Monitoring Orchestration - Part 1
Data Operations Build vs Buy

Why Workflows are Hard: Monitoring Orchestration - Part 1

John Forstmeier
John Forstmeier

This is the first post in a series regarding working with workflows and orchestrating individual business programs.

Different options exist for orchestrating a series of tasks in software. Some are simple but make broad assumptions about how the individual tasks will execute while others are more complex but allow for a more flexible execution. There are several things to consider when rolling your own orchestration code or evaluating existing orchestration services.

Example Scenario

Introducing orchestration code is required when a series of independent tasks need to be strung together to complete a larger operation. An example of this can be seen in our Hacker News dogfooding post where we use several, independent tasks stitched together using orchestration to scrape Hacker News and report on relevant posts.

For the sake of simplicity, let’s say that we have three independent tasks:

  • Fetching data from a given source
  • Transforming the retrieved data
  • Sending the formatted data to a recipient

Assuming these three tasks are not directly invoking one another, in which case they effectively become a single program, there needs to be some way to trigger each one in the proper sequence. In fact, keeping these individual tasks as independent programs might be helpful since those individual pieces could be reused in other larger programs as well.

Monitoring Pattern

There are different patterns that can be applied to solve orchestration problems like the one outlined in the prior section. In this piece, we’re focusing on the monitoring strategy.

The main feature of this approach is that each individual task is contingent on the upstream task(s) completing prior to its own execution. Each task is made aware of the exit codes and outputs from the preceding executions.

Using the example provided in the previous system, the monitoring pattern would require an orchestration program alongside the three tasks. It would be responsible for listening to the exit code status from each task and then triggering the appropriate downstream task. More complex trigger options and branching/converging flows are possible, but with the simple linear example presented, each downstream task would be triggered on a success exit status from its upstream partner or else the full workflow would exit in failure.

Given that each task waits on its upstream neighbor(s), this means that the overall orchestration system is more intuitive as it follows the logical pattern of “if this task completes successfully, trigger the subsequent tasks.”

Monitoring Pros

There is a fairly strong advantage to using monitoring in orchestration. These advantages are highlighted by the commonly encountered situations below.

Variable Upstream Execution Time

While processes within the larger orchestration may be relatively stable from execution to execution, it may not always be safe to assume that will remain the case indefinitely. This is especially true when those tasks interact with an external source like a database or third-party API where their response time is outside of the developer’s control. On the other hand, when operating on variable quantities of data, processes can occasionally finish quicker than anticipated.

The orchestration system waits for an exit code, no matter how long or how short it takes, avoiding any fixed execution time assumptions when determining when to start downstream tasks. This setup makes workflows more resilient to variability in execution time and ensures executions always occur as quickly as possible.

Incomplete Upstream Execution

As an extension of the prior scenario, the situation could arise that an upstream task never actually completes its execution. Downstream tasks need to be able to handle this correctly; with the monitoring approach, given the orchestration system’s overall awareness of the states of each task, this would likely entail canceling the downstream processes since they would never be executed anyway.

Variable Upstream Status

Another common situation is that individual tasks may finish with a non-zero status. Since the orchestration system is listening to the outputs from each task as it runs, the user is able to apply conditional rules and complex patterns to the workflow graph.

For example, if a task may end in an error, the user could opt to ignore it and trigger the downstream task anyway. Additionally, with the use of conditionals between tasks, the user can create branching and converging logic between tasks so that specific paths through the orchestration are followed according to each task’s exit status.

Monitoring Cons

Although the monitoring approach has significant benefits, there are still some things to consider when building out your orchestration code.

Dedicated Orchestration Logic

Given simple scheduling via a cron job is insufficient to implement the monitoring pattern, a complete suite of orchestration logic will need to be provided so that execution order is accurate. As mentioned above, it will also need to, at a minimum, account for data generated by each task’s execution while a more robust solution would include the ability to create branching and converging logic between tasks.

Track and Store Execution Data

Resources will need to be configured to both collect and store the data and metadata generated by each task, particularly exit codes. Collecting and managing these codes is essential in being able to execute workflow tasks in the proper order set by the users graph. The storage of this metadata can balloon over time, particularly if you’re storing additional information related to start and end times, code changes, outputs, and more.

It’s also worth noting that with the monitoring system in place, it’s possible to collect a range of metadata regarding the workflow executions such as memory consumed and data uploaded. This provides the option to conduct performance analysis of the business logic and the orchestration system.

Additional Computational Requirements

The overall orchestration platform will need to be relatively non-intrusive and lightweight in construction if it coexists on the infrastructure with the business logic. This is necessary so that the maximum amount of computing resources can be dedicated to the business logic being executed within each of the process tasks. However, this setup could result in situations where computationally intensive tasks inevitably take down the entire orchestration platform, causing a loss of orchestration progress.

Some designs may opt to have the orchestration logic run on an entirely separate system from the business logic contained in the workflow tasks. The benefit here is that your orchestration logic will be more fail-safe from computationally expensive tasks. However, this introduces another trade-off of adopting and managing a distributed architecture which can greatly increase complexity.

Our Approach

The ideal solution addresses all of the points listed above.

Accounting for the pros, a system should accommodate unexpected execution scenarios, such as those listed in the prior sections, and complex branching and converging pathways. Additionally, eliminating the cons with an execution-aware and unobtrusive design would be required in an effective answer.

Combining these two high-level perspectives allows the developer to build and expose a highly flexible orchestration service to the end user. This is the approach Shipyard has taken when building our orchestration service, enabling the creation of resilient workflows with minimal setup logic.

Conclusion

A monitoring-based orchestration approach provides advantages in flexibility and usability.

This allows the users of the orchestration logic to focus more on implementing business logic and actually solving problems. However, building one of these services by hand presents a significant development investment as it requires building out data collection, scheduling, and decision logic in order to be able to network individual tasks together effectively.

Options like Shipyard provide low-touch, language-agnostic, highly-configurable, orchestration tools without the overhead of building them yourself. It’s a balancing act and one that should be heavily considered when working with orchestration systems.

Stay tuned for upcoming posts in this series on workflows including breakdowns of scheduling patterns, distributed systems, and more.


About Shipyard:
Shipyard is a modern data orchestration platform for data engineers to easily connect tools, automate workflows, and build a solid data infrastructure from day one.

Shipyard offers low-code templates that are configured using a visual interface, replacing the need to write code to build data workflows while enabling data engineers to get their work into production faster. If a solution can’t be built with existing templates, engineers can always automate scripts in the language of their choice to bring any internal or external process into their workflows.

The Shipyard team has built data products for some of the largest brands in business and deeply understands the problems that come with scale. Observability and alerting are built into the Shipyard platform, ensuring that breakages are identified before being discovered downstream by business teams.

With a high level of concurrency and end-to-end encryption, Shipyard enables data teams to accomplish more without relying on other teams or worrying about infrastructure challenges, while also ensuring that business teams trust the data made available to them.

For more information, visit www.shipyardapp.com or get started for free.