The Requirement
Recently I've had a requirement to run wordloads on GPUs.
I wanted to use AWS jobs and not a static EC2 instance due to two reasons:
Reason 1
The workloads are sporadic, for example we might need to run them 10 times one day, and then no need to run them for a week. I did not want to manage starting and stopping a GPU based EC2 instance, and I did not want to leave an expensive EC2 instance running.
Reason 2
In some cases, I want to run 100 workloads in parallel. Using EC2 would mean either run it sequentially or managing multiple EC2 instances.
Some History
So I've started checking the AWS batch for GPU. I've already used AWS batch in several projects, see for example:
Hence I assumed it would be quick and easy.
AWS Partial & Non-Friendly Support
Surprisingly, I've found the AWS batch support for GPU is partial and problematic.
Lack of documentation
The documentation for AWS batch support on GPU is scarce. This is pretty weird since GPU is THE hottest topic today, and I would expect it to be more detailed. I've ended up collecting information from ChatGPT as well as some external sites.
Compute Environment
The simplest method of AWS batch is base on Fargate which includes automatic management of the compute environment. This mode is not supported for GPU based containers. The only supported modes are EC2 based and kubernetes based. So I decided going with EC2 based, which does uses EC2 instances, but the instances are automatically started and stopped upon need.
Multi Containers
I wanted to have 2 containers in each AWS job. One container is the management container that I wanted running on CPU, while the other container is the LLM models container running on GPU. Unfortunately while I was able to setup this configuration, when starting the AWS job it failed to move from the starting status to the runnable status. Looks like something in AWS cannot support multi containers on mixed CPU/GPU. I've ended up combining the two containers in a single container running two processes, which is a bad practice, but the best I could do.
AWS Console Issues
I've made all the configurations in the GUI of the AWS console, and started the AWS batch, but then I got and error that the ECS configuration cannot be used.
What?? I've using EC2 based configuration, why do I get error for ECS??
Eventually I've configured the AWS batch using AWS CloudFormation, and it did work fine. I think this is a some king of bug in the AWS console GUI.
Guessing your problems
I've finally made it to starting an AWS batch, but then the job got stuck in the runnable status and did not proceed to the running status. AWS provides zero assistance for such issue, I really had to dig into this by guessing the possible problems. Eventually I've found the EC2 that was automatically started by the AWS batch was unable to register itself into the AWS batch compute environment since it did not had a public IPv4 IP allocated. I've updated the configuration to enable public IPv4 allocation, and it worked.
Final Results
Eventually it was working. The solution includes the following:
CloudFormation Stack
AWSTemplateFormatVersion: '2010-09-09'
Resources:
ECSBatchJobRole:
Type: AWS::IAM::Role
Properties:
RoleName: ECSBatchJobRole
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ecs-tasks.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
BatchInstanceRole:
Type: AWS::IAM::Role
Properties:
RoleName: BatchInstanceRole
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ec2.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
BatchInstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
InstanceProfileName: BatchInstanceProfile
Roles:
- Ref: BatchInstanceRole
BatchLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: batch-launch-template
LaunchTemplateData:
KeyName: alon
NetworkInterfaces:
- DeviceIndex: 0
AssociatePublicIpAddress: true
SubnetId: subnet-12345678912345678
Groups:
- sg-123456789aaaaaaaaa
BlockDeviceMappings:
- DeviceName: /dev/xvda
Ebs:
VolumeSize: 200
VolumeType: gp3
DeleteOnTermination: true
IamInstanceProfile:
Name: !Ref BatchInstanceProfile
GpuComputeEnvironment:
Type: AWS::Batch::ComputeEnvironment
Properties:
Type: MANAGED
ComputeEnvironmentName: GpuEC2ComputeEnvironment
ComputeResources:
Type: EC2
AllocationStrategy: BEST_FIT_PROGRESSIVE
MinvCpus: 0
MaxvCpus: 16
DesiredvCpus: 0
InstanceTypes:
- g4dn.2xlarge
Subnets:
- subnet-12345678912345678
InstanceRole: !Ref BatchInstanceProfile
LaunchTemplate:
LaunchTemplateId: !Ref BatchLaunchTemplate
Version: !GetAtt BatchLaunchTemplate.LatestVersionNumber
ServiceRole: !GetAtt BatchServiceRole.Arn
State: ENABLED
BatchServiceRole:
Type: AWS::IAM::Role
Properties:
RoleName: AWSBatchServiceRole
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: batch.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole
GpuJobQueue:
Type: AWS::Batch::JobQueue
Properties:
JobQueueName: GpuJobQueue
Priority: 1
State: ENABLED
ComputeEnvironmentOrder:
- Order: 1
ComputeEnvironment: !Ref GpuComputeEnvironment
GpuJobDefinition:
Type: AWS::Batch::JobDefinition
Properties:
JobDefinitionName: GpuJobDefinition
Type: container
PlatformCapabilities:
- EC2
ContainerProperties:
Image: XXX.dkr.ecr.us-east-1.amazonaws.com/YYY:latest
Vcpus: 1
Memory: 4096
ResourceRequirements:
- Type: GPU
Value: "1"
Command:
- nvidia-smi
JobRoleArn: !GetAtt ECSBatchJobRole.Arn
and run it using AWS CLI:
aws cloudformation deploy \
--template-file batch_cloud_formation.yaml \
--stack-name gpu-batch-setup \
--capabilities CAPABILITY_NAMED_IAM
Start the AWS Batch from GoLang
batchClient:= batch.NewFromConfig(*awsSession.GetSession())
input := batch.SubmitJobInput{
JobName: aws.String(jobName),
JobQueue: aws.String(jobQueue),
JobDefinition: aws.String(jobDefinition),
ContainerOverrides: &types.ContainerOverrides{
Command: containerConfig.ContainerCommandLine,
Environment: b.createContainerEnvironment(containerConfig.ContainerEnvironment),
},
}
_, err := batchClient.SubmitJob(context.Background(), &input)
kiterr.RaiseIfError(err)
Final Note
Running workloads using AWS batch is great, but using GPU based batch is still a bit immature. Still even though going through these issues, it is still one of the best methods to accomplish this.
I believe AWS will provide a better support for this in the future.
No comments:
Post a Comment