How To Fix - Batch Job Stuck in Runnable Status in AWS ?

In this post, we will explore - How To Fix - Batch Job Stuck in Runnable Status in AWS. Sometimes your AWS job seems to be stuck in the "RUNNABLE" state forever. This means your AWS jobs is not able to go beyond the RUNNABLE status possibly due to various reasons. Let's see how we can fix such issue. Before you proceed, know the below facts about the Runnable Job state -

Such a job resides in the queue.
They have no outstanding dependencies and are ready to be scheduled to the host.
As soon as sufficient resources are made available, they start.
But when sufficient resources are unavailable, the jobs would remain in this state indefinitely .

Let's explore various aspects which could cause this issue.

Security Group

Are there any Outbound Rules in the Security group attached to the Batch Compute environment. If yes, try using 0.0.0.0/0 in the security allowance and see if that helps.

Resource Unavailability or Insufficiency

While creating the Job, are you specifying more than allocated CPU, RAM etc ? e.g. If you Batch job tries to get 8GB RAM whereas allocated is 4GB RAM, then job will not run. Check the AWS compute environment if it is showing "Invalid". Check AWS Batch -> Compute Environments -> Status

Docker Images

Are you including the tag in the image reference if using Docker image ? Are you using EcsInstanceRole or EcsInstanceProfile as InstanceRole ?

Correct Roles

The roles must be defined using, at least, the next policies and trusted relationships. Otherwise they don't get enough privileges to run. If you are using Launch Template, you would require IamInstanceProfile and SecurityGroupIds in the launch template for the job to start.

Log Not Configured

Have you configured the AWS compute resources to use the awslogs log driver ? Verify that the awslogs log driver is specified with the ECS_AVAILABLE_LOGGING_DRIVERS environment variable. This is required because the AWS Batch jobs send the logs to CloudWatch. Logs for RUNNING jobs are available in CloudWatch Logs. The log group is /aws/batch/job, and the log stream name format is



first200CharsOfJobDefinitionName/default/ecs\_task\_id

Launch Template

The default AWS AMI snapshots need at least 30G of storage. If you don't have a launch template, the cloudformation use the correct storage size. But if you are using your own Launch Template, modify the storage to >30G and see if that works.

Internet access

Check if the AWS resources have Internet access either through VPC endpoint or public IP or network address translation (NAT) . Check the "Enable auto-assign public IPv4 address" setting on your Compute Environment's subnet.

EC2 Instance Limit & Auto-Scale

Check if you have crossed the Limit of EC2 instances ? Check the Auto scaling group to check the actual error that's preventing instances from being started.

ECS Container Agent

What Linux OS AMI you are using ? Is the AWS ECS Container agent installed ? Is the ECS container agent working ? It must be installed, if not, to run Batch Jobs. Hope this helps to fix the issue.

DevOps | Cloud | Cyber Security | Web-Dev | Analytics | Open Source

How To Fix - Batch Job Stuck in Runnable Status in AWS ?

Other Interesting Reads -

List of Kafka Commands Cheatsheet

How To Mask – Confidential Info in Kafka Connect Logs ?

How To Create A Kerberos Keytab File ?

How To Enable Kerberos in Cloudera Hadoop Cluster ?

Apply Pod Security Standards To Kubernetes Cluster

Indentation Problem Fix in Python

Most Important Metrics To Monitor In Kafka

Data Skewness in Spark (Salting Method)

Unicode Encode Error in Python (Ascii Codec Encode)

DevOps | Cloud | Cyber Security | Web-Dev | Analytics | Open Source

How To Fix - Batch Job Stuck in Runnable Status in AWS ?

Other Interesting Reads -

List of Kafka Commands Cheatsheet

How To Mask – Confidential Info in Kafka Connect Logs ?

How To Create A Kerberos Keytab File ?

How To Enable Kerberos in Cloudera Hadoop Cluster ?

Popular Articles

Apply Pod Security Standards To Kubernetes Cluster

Indentation Problem Fix in Python

Most Important Metrics To Monitor In Kafka

Data Skewness in Spark (Salting Method)

Unicode Encode Error in Python (Ascii Codec Encode)