Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Saturday, June 14, 2025

AWS Batch for GPU Workloads




The Requirement

Recently I've had a requirement to run wordloads on GPUs. 
I wanted to use AWS jobs and not a static EC2 instance due to two reasons:

Reason 1

The workloads are sporadic, for example we might need to run them 10 times one day, and then no need to run them for a week. I did not want to manage starting and stopping a GPU based EC2 instance, and I did not want to leave an expensive EC2 instance running.

Reason 2

In some cases, I want to run 100 workloads in parallel. Using EC2 would mean either run it sequentially or managing multiple EC2 instances.



Some History


So I've started checking the AWS batch for GPU. I've already used AWS batch in several projects, see for example:



Hence I assumed it would be quick and easy. 



AWS Partial & Non-Friendly Support

Surprisingly, I've found the AWS batch support for GPU is partial and problematic. 

Lack of documentation

The documentation for AWS batch support on GPU is scarce. This is pretty weird since GPU is THE hottest topic today, and I would expect it to be more detailed. I've ended up collecting information from ChatGPT as well as some external sites.

Compute Environment

The simplest method of AWS batch is base on Fargate which includes automatic management of the compute environment. This mode is not supported for GPU based containers. The only supported modes are EC2 based and kubernetes based. So I decided going with EC2 based, which does uses EC2 instances, but the instances are automatically started and stopped upon need.

Multi Containers

I wanted to have 2 containers in each AWS job. One container is the management container that I wanted running on CPU, while the other container is the LLM models container running on GPU. Unfortunately while I was able to setup this configuration, when starting the AWS job it failed to move from the starting status to the runnable status. Looks like something in AWS cannot support multi containers on mixed CPU/GPU. I've ended up combining the two containers in a single container running two processes, which is a bad practice, but the best I could do.

AWS Console Issues

I've made all the configurations in the GUI of the AWS console, and started the AWS batch, but then I got and error that the ECS configuration cannot be used.
What?? I've using EC2 based configuration, why do I get error for ECS??
Eventually I've configured the AWS batch using AWS CloudFormation, and it did work fine. I think this is a some king of bug in the AWS console GUI.

Guessing your problems

I've finally made it to starting an AWS batch, but then the job got stuck in the runnable status and did not proceed to the running status. AWS provides zero assistance for such issue, I really had to dig into this by guessing the possible problems. Eventually I've found the EC2 that was automatically started by the AWS batch was unable to register itself into the AWS batch compute environment since it did not had a public IPv4 IP allocated. I've updated the configuration to enable public IPv4 allocation, and it worked.



Final Results

Eventually it was working. The solution includes the following:

CloudFormation Stack

AWSTemplateFormatVersion: '2010-09-09'

Resources:

ECSBatchJobRole:
Type: AWS::IAM::Role
Properties:
RoleName: ECSBatchJobRole
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ecs-tasks.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

BatchInstanceRole:
Type: AWS::IAM::Role
Properties:
RoleName: BatchInstanceRole
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ec2.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

BatchInstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
InstanceProfileName: BatchInstanceProfile
Roles:
- Ref: BatchInstanceRole

BatchLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: batch-launch-template
LaunchTemplateData:
KeyName: alon
NetworkInterfaces:
- DeviceIndex: 0
AssociatePublicIpAddress: true
SubnetId: subnet-12345678912345678
Groups:
- sg-123456789aaaaaaaaa
BlockDeviceMappings:
- DeviceName: /dev/xvda
Ebs:
VolumeSize: 200
VolumeType: gp3
DeleteOnTermination: true
IamInstanceProfile:
Name: !Ref BatchInstanceProfile

GpuComputeEnvironment:
Type: AWS::Batch::ComputeEnvironment
Properties:
Type: MANAGED
ComputeEnvironmentName: GpuEC2ComputeEnvironment
ComputeResources:
Type: EC2
AllocationStrategy: BEST_FIT_PROGRESSIVE
MinvCpus: 0
MaxvCpus: 16
DesiredvCpus: 0
InstanceTypes:
- g4dn.2xlarge
Subnets:
- subnet-12345678912345678
InstanceRole: !Ref BatchInstanceProfile
LaunchTemplate:
LaunchTemplateId: !Ref BatchLaunchTemplate
Version: !GetAtt BatchLaunchTemplate.LatestVersionNumber
ServiceRole: !GetAtt BatchServiceRole.Arn
State: ENABLED

BatchServiceRole:
Type: AWS::IAM::Role
Properties:
RoleName: AWSBatchServiceRole
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: batch.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

GpuJobQueue:
Type: AWS::Batch::JobQueue
Properties:
JobQueueName: GpuJobQueue
Priority: 1
State: ENABLED
ComputeEnvironmentOrder:
- Order: 1
ComputeEnvironment: !Ref GpuComputeEnvironment

GpuJobDefinition:
Type: AWS::Batch::JobDefinition
Properties:
JobDefinitionName: GpuJobDefinition
Type: container
PlatformCapabilities:
- EC2
ContainerProperties:
Image: XXX.dkr.ecr.us-east-1.amazonaws.com/YYY:latest
Vcpus: 1
Memory: 4096
ResourceRequirements:
- Type: GPU
Value: "1"
Command:
- nvidia-smi
JobRoleArn: !GetAtt ECSBatchJobRole.Arn

and run it using AWS CLI:
aws cloudformation deploy \
--template-file batch_cloud_formation.yaml \
--stack-name gpu-batch-setup \
--capabilities CAPABILITY_NAMED_IAM

Start the AWS Batch from GoLang 

batchClient:=            batch.NewFromConfig(*awsSession.GetSession())

input := batch.SubmitJobInput{
JobName: aws.String(jobName),
JobQueue: aws.String(jobQueue),
JobDefinition: aws.String(jobDefinition),
ContainerOverrides: &types.ContainerOverrides{
Command: containerConfig.ContainerCommandLine,
Environment: b.createContainerEnvironment(containerConfig.ContainerEnvironment),
},
}

_, err := batchClient.SubmitJob(context.Background(), &input)
kiterr.RaiseIfError(err)



Final Note

Running workloads using AWS batch is great, but using GPU based batch is still a bit immature. Still even though going through these issues, it is still one of the best methods to accomplish this.
I believe AWS will provide a better support for this in the future.

No comments:

Post a Comment