Pitfalls of AWS Lambda and Serverless Data Pipelines

There are a number of considerations to account for when incorporating serverless architecture into your team's platform. It's particularly notable when the slice of the architecture in question comprises your internal data pipelines or workflow automation tools. Data is an important, if not the most important facet of this decision.

In fact, I'm a huge proponent of designing your code around the data, rather than the other way around, and I think it's one of the reasons git has been fairly successful […] I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.

~ Linus Torvalds ~

As an example, if you have a team that is responsible for processing large PDF documents and extracting information from them programmatically, understanding the strengths and weaknesses of serverless offerings is crucial. Creating a logical representation of your data flow while maintaining a robust architecture without introducing significant DevOps overhead is possible but tricky.

AWS Lambda

Going with an AWS Lambda-first approach, for example, has its pros and cons. Setting up and running a Lambda for internal processing is relatively simple but there are a number of drawbacks. Writing a Lambda, in any of the provided runtimes, requires a specific pattern to be adhered to in order for the function to execute (e.g. {% c-line %}def handler_name(event, context): return value{% c-line-end %} in a Python Lambda handler). This requires that additional infrastructure be spun up in in order to test or run the function locally which is possible but not trivial.

Depending on the Lambda's runtime, third-party packages can add to setup overhead. As an example, running a Python runtime Lambda that includes the popular {% c-line %}requests{% c-line-end %} package requires pre-installing the package locally and ZIPing it into the deployment package. If you're not developing on a Linux environment, one option would be to set up an EC2 instance, SSH into it, install the package, ZIP it, and then download it to be included in the Lambda deployment package.

Lambda functions also have limited execution sizes with the maximum memory configuration at 3008 MB. There is an additional 512 MB provided in the {% c-line %}tmp/{% c-line-end %} directory but this is, as its name implies, temporary and is not guaranteed to be maintained between invocations. Typical workarounds to this limitation include splitting the files on some other service, such as AWS EC2, and then storing them in S3 and fetching them directly from the Lambda. Creating chains of Lambdas via AWS Step Functions is another way to handle this.

Another factor is timeout limitations placed on a Lambda execution with a current maximum of 15 minutes. Some scripts or processes take longer and one workaround would be to split the process into chunks to execute in Step Function Parallel States or in fan out patterns with subsequently invoked Lambdas. This limitation, coupled with cold start delays when invoking a Lambda less frequently, can raise reliability concerns.

Increasing the memory configuration and timeout can alleviate these issues somewhat, the flip side is that it may noticeably increase architecture costs.  Since usage costs are calculated as a function of memory size, invocation, and duration, this can become prohibitive quickly.

Shipyard

Shipyard, in many ways, is a heavy-duty version of the serverless platform offered by Lambda.

Whatever script you write and run locally can be run on the platform without any code changes or proprietary configuration. Additionally, the requirements configuration option allows for third-party packages to be included similar to how a {% c-line %}requirements.txt{% c-line-end %} file would handle them. Both available memory and execution limits are significantly higher given that Vessels were designed for heavy data loads and run in dedicated containers for each Voyage.

For processing PDFs, Tesseract could be directly installed on a Vessel, text could be pulled from large, multi-page files, and results could be posted to an external endpoint using {% c-line %}requests{% c-line-end %} or uploaded to S3 using {% c-line %}boto3{% c-line-end %} with either package installed via the requirements configuration page in the Shipyard application.

Shipyard bridges the gap between lighter-weight serverless function services such as AWS Lambda and the non-serverless but heavier-weight options such as AWS EC2 or AWS ECS that require additional DevOps work to get up and running. In a lot of ways, Shipyard can be seen as a robust, heavy-duty platform for internal workflow automation. This design makes it easier for teams to focus on building data pipelines without having to create some of the workarounds mentioned to fit the system.

Explore other posts:

Ready to get started?

Get in touch or request access.