run KISS: AWS Batch for GPU Workloads

The Requirement

Recently I've had a requirement to run wordloads on GPUs.

I wanted to use AWS jobs and not a static EC2 instance due to two reasons:

Reason 1

The workloads are sporadic, for example we might need to run them 10 times one day, and then no need to run them for a week. I did not want to manage starting and stopping a GPU based EC2 instance, and I did not want to leave an expensive EC2 instance running.

Reason 2

In some cases, I want to run 100 workloads in parallel. Using EC2 would mean either run it sequentially or managing multiple EC2 instances.

Some History

So I've started checking the AWS batch for GPU. I've already used AWS batch in several projects, see for example:

- Getting AWS Batch Logs in Go

- AWS Batch in Go

- Using AWS Batch with boto3

Hence I assumed it would be quick and easy.

AWS Partial & Non-Friendly Support

Surprisingly, I've found the AWS batch support for GPU is partial and problematic.

Lack of documentation

The documentation for AWS batch support on GPU is scarce. This is pretty weird since GPU is THE hottest topic today, and I would expect it to be more detailed. I've ended up collecting information from ChatGPT as well as some external sites.

Compute Environment

The simplest method of AWS batch is base on Fargate which includes automatic management of the compute environment. This mode is not supported for GPU based containers. The only supported modes are EC2 based and kubernetes based. So I decided going with EC2 based, which does uses EC2 instances, but the instances are automatically started and stopped upon need.

Multi Containers

I wanted to have 2 containers in each AWS job. One container is the management container that I wanted running on CPU, while the other container is the LLM models container running on GPU. Unfortunately while I was able to setup this configuration, when starting the AWS job it failed to move from the starting status to the runnable status. Looks like something in AWS cannot support multi containers on mixed CPU/GPU. I've ended up combining the two containers in a single container running two processes, which is a bad practice, but the best I could do.

AWS Console Issues

I've made all the configurations in the GUI of the AWS console, and started the AWS batch, but then I got and error that the ECS configuration cannot be used.

What?? I've using EC2 based configuration, why do I get error for ECS??

Eventually I've configured the AWS batch using AWS CloudFormation, and it did work fine. I think this is a some king of bug in the AWS console GUI.

Guessing your problems

I've finally made it to starting an AWS batch, but then the job got stuck in the runnable status and did not proceed to the running status. AWS provides zero assistance for such issue, I really had to dig into this by guessing the possible problems. Eventually I've found the EC2 that was automatically started by the AWS batch was unable to register itself into the AWS batch compute environment since it did not had a public IPv4 IP allocated. I've updated the configuration to enable public IPv4 allocation, and it worked.

Final Results

Eventually it was working. The solution includes the following:

CloudFormation Stack

AWSTemplateFormatVersion: '2010-09-09'

Resources:

  ECSBatchJobRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: ECSBatchJobRole
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

  BatchInstanceRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: BatchInstanceRole
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: ec2.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

  BatchInstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      InstanceProfileName: BatchInstanceProfile
      Roles:
        - Ref: BatchInstanceRole

  BatchLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: batch-launch-template
      LaunchTemplateData:
        KeyName: alon
        NetworkInterfaces:
          - DeviceIndex: 0
            AssociatePublicIpAddress: true
            SubnetId: subnet-12345678912345678
            Groups:
              - sg-123456789aaaaaaaaa
        BlockDeviceMappings:
          - DeviceName: /dev/xvda
            Ebs:
              VolumeSize: 200
              VolumeType: gp3
              DeleteOnTermination: true
        IamInstanceProfile:
          Name: !Ref BatchInstanceProfile

  GpuComputeEnvironment:
    Type: AWS::Batch::ComputeEnvironment
    Properties:
      Type: MANAGED
      ComputeEnvironmentName: GpuEC2ComputeEnvironment
      ComputeResources:
        Type: EC2
        AllocationStrategy: BEST_FIT_PROGRESSIVE
        MinvCpus: 0
        MaxvCpus: 16
        DesiredvCpus: 0
        InstanceTypes:
          - g4dn.2xlarge
        Subnets:
          - subnet-12345678912345678
        InstanceRole: !Ref BatchInstanceProfile
        LaunchTemplate:
          LaunchTemplateId: !Ref BatchLaunchTemplate
          Version: !GetAtt BatchLaunchTemplate.LatestVersionNumber
      ServiceRole: !GetAtt BatchServiceRole.Arn
      State: ENABLED

  BatchServiceRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: AWSBatchServiceRole
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: batch.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

  GpuJobQueue:
    Type: AWS::Batch::JobQueue
    Properties:
      JobQueueName: GpuJobQueue
      Priority: 1
      State: ENABLED
      ComputeEnvironmentOrder:
        - Order: 1
          ComputeEnvironment: !Ref GpuComputeEnvironment

  GpuJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: GpuJobDefinition
      Type: container
      PlatformCapabilities:
        - EC2
      ContainerProperties:
        Image: XXX.dkr.ecr.us-east-1.amazonaws.com/YYY:latest
        Vcpus: 1
        Memory: 4096
        ResourceRequirements:
          - Type: GPU
            Value: "1"
        Command:
          - nvidia-smi
        JobRoleArn: !GetAtt ECSBatchJobRole.Arn

and run it using AWS CLI:

aws cloudformation deploy \
  --template-file batch_cloud_formation.yaml \
  --stack-name gpu-batch-setup \
  --capabilities CAPABILITY_NAMED_IAM

Start the AWS Batch from GoLang

batchClient:=            batch.NewFromConfig(*awsSession.GetSession())

input := batch.SubmitJobInput{
  JobName:       aws.String(jobName),
  JobQueue:      aws.String(jobQueue),
  JobDefinition: aws.String(jobDefinition),
  ContainerOverrides: &types.ContainerOverrides{
   Command:     containerConfig.ContainerCommandLine,
   Environment: b.createContainerEnvironment(containerConfig.ContainerEnvironment),
  },
}

_, err := batchClient.SubmitJob(context.Background(), &input)
kiterr.RaiseIfError(err)

Final Note

Running workloads using AWS batch is great, but using GPU based batch is still a bit immature. Still even though going through these issues, it is still one of the best methods to accomplish this.

I believe AWS will provide a better support for this in the future.

Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Saturday, June 14, 2025

AWS Batch for GPU Workloads