Sunday, July 3, 2022

AWS EFS, and why it is a bad solution for AWS Batch

 



In this post we will review an disk space issue that I've encountered, and the findings I've discovered in the process.


TL;DR
AWS EFS seems like a great solution for disk space issues, but it is actually relevant only for long lived persistence used mostly with read operations. 


In the previous post, I've reviewed usage of AWS batch, and automation of it using python's boto3 library. As the project continued, we've used the AWS batch to analyze portions of big data stored in AWS S3. While testing the analyze process as a docker image running on my local development machine, everything worked great. But when running it on the AWS Batch based on Fargate, we encountered out disk space issues. So, I went and looked for AWS Batch configuration to increase the disk space requirement for the container. It seems that AWS Fargate does support this up to 200G, as specified here.

However, AWS development team (so far) did not add this option as part of the AWS Batch job definition. So I went to look for an alternative, and found the AWS EFS, which seems like a great solution: unlimited storage, automatically allocated per your need, pay only per what you use.

So, I've created a new EFS, and updated the AWS batch job definition to use the EFS, and restarted the AWS jobs. 


... but ... wait... why are jobs stuck for so long??


Digging into the AWS EFS documentation, I've located the EFS quotas guide. The quotas limit the amount of read and write operations on the AWS EFS according to the size of the used space on the EFS. Once you hit this quota, you're stuck until the next timeframe you get an additional quota, use it, and get stuck again. 

In my case I've had used the EFS as a temporary storage for cache data, and write and read in each AWS batch job about 200GB multiple times. This means that the jobs are stuck very quickly.

The solution would be to provision in advanced the required quota for read and write operations. But the price for this is exteremly high, as if you would have used the disk space that would had been used to get this quota.

Eventually, I've gave up on the disk space caching mechanism, and changed the algorithm to use much less storage, on the downside of slower processing.


Final Note

First, I hope that AWS team would add the support for speficying disk space requirements for AWS batches. This would surely simlify the whole mess described here.

Second, AWS EFS seems to be unfit for short live storage used for many read and write operations. It is more long lived persistence that is mostly read, use low rate of read/write operations.


No comments:

Post a Comment