In this post we will review an disk space issue that I've encountered, and the findings I've discovered in the process.
TL;DR
AWS EFS seems like a great solution for disk space issues, but it is actually relevant only for long lived persistence used mostly with read operations.
In the previous post, I've reviewed usage of AWS batch, and automation of it using python's boto3 library. As the project continued, we've used the AWS batch to analyze portions of big data stored in AWS S3. While testing the analyze process as a docker image running on my local development machine, everything worked great. But when running it on the AWS Batch based on Fargate, we encountered out disk space issues. So, I went and looked for AWS Batch configuration to increase the disk space requirement for the container. It seems that AWS Fargate does support this up to 200G, as specified here.
However, AWS development team (so far) did not add this option as part of the AWS Batch job definition. So I went to look for an alternative, and found the AWS EFS, which seems like a great solution: unlimited storage, automatically allocated per your need, pay only per what you use.
So, I've created a new EFS, and updated the AWS batch job definition to use the EFS, and restarted the AWS jobs.
... but ... wait... why are jobs stuck for so long??
Digging into the AWS EFS documentation, I've located the EFS quotas guide. The quotas limit the amount of read and write operations on the AWS EFS according to the size of the used space on the EFS. Once you hit this quota, you're stuck until the next timeframe you get an additional quota, use it, and get stuck again.
In my case I've had used the EFS as a temporary storage for cache data, and write and read in each AWS batch job about 200GB multiple times. This means that the jobs are stuck very quickly.
The solution would be to provision in advanced the required quota for read and write operations. But the price for this is exteremly high, as if you would have used the disk space that would had been used to get this quota.
Eventually, I've gave up on the disk space caching mechanism, and changed the algorithm to use much less storage, on the downside of slower processing.
No comments:
Post a Comment