DevOps | Cloud | Analytics | Open Source | Programming





How To Fix - Batch Job Stuck in Runnable Status in AWS ?



In this post, we will explore - How To Fix - Batch Job Stuck in Runnable Status in AWS. Sometimes your AWS job seems to be stuck in the "RUNNABLE" state forever. This means your AWS jobs is not able to go beyond the RUNNABLE status possibly due to various reasons. Let's see how we can fix such issue. Before you proceed, know the below facts about the Runnable Job state -

  • Such a job resides in the queue.
  • They have no outstanding dependencies and are ready to be scheduled to the host.
  • As soon as sufficient resources are made available, they start.
  • But when sufficient resources are unavailable, the jobs would remain in this state indefinitely .
  Let's explore various aspects which could cause this issue.  

  • Security Group
Are there any Outbound Rules in the Security group attached to the Batch Compute environment. If yes, try using 0.0.0.0/0 in the security allowance and see if that helps.  

  • Resource Unavailability or Insufficiency
While creating the Job, are you specifying more than allocated CPU, RAM etc ? e.g. If you Batch job tries to get 8GB RAM whereas allocated is 4GB RAM, then job will not run. Check  the AWS compute environment if it is showing "Invalid". Check AWS Batch -> Compute Environments -> Status  

  • Docker Images
Are you including the tag in the image reference if using Docker image ? Are you using EcsInstanceRole or EcsInstanceProfile as InstanceRole ?  

  • Correct Roles
The roles must be defined using, at least, the next policies and trusted relationships. Otherwise they don't get enough privileges to run. If you are using Launch Template, you would require IamInstanceProfile and SecurityGroupIds in the launch template for the job to start.  

  • Log Not Configured
Have you configured the AWS compute resources to use the awslogs log driver ? Verify that the awslogs log driver is specified with the ECS_AVAILABLE_LOGGING_DRIVERS environment variable. This is required because the AWS Batch jobs send the logs to CloudWatch. Logs for RUNNING jobs are available in CloudWatch Logs. The log group is /aws/batch/job, and the log stream name format is



first200CharsOfJobDefinitionName/default/ecs\_task\_id


 

  • Launch Template
The default AWS AMI snapshots need at least 30G of storage. If you don't have a launch template, the cloudformation use the correct storage size. But if you are using your own Launch Template, modify the storage to >30G and see if that works.    

  • Internet access
Check if the AWS resources have Internet access either through VPC endpoint or public IP or network address translation (NAT) . Check the "Enable auto-assign public IPv4 address" setting on your Compute Environment's subnet.  

  • EC2 Instance Limit & Auto-Scale
Check if you have crossed the Limit of EC2 instances ? Check the Auto scaling group to check the actual error that's preventing instances from being started.  

  • ECS Container Agent
What Linux OS AMI you are using ? Is the AWS ECS Container agent installed ? Is the ECS container agent working ? It must be installed, if not, to run Batch Jobs.   Hope this helps to fix the issue.    

Other Interesting Reads -

   



aws batch job stuck in starting ,aws batch job status ,job queue is attached to compute environment that can not run jobs with capability ec2 ,aws batch job logs ,aws batch task failed to start ,aws batch desired vcpus 0 ,aws batch job queue ,none of the instances joined the underlying ecs cluster , ,aws batch job stuck in starting ,aws batch job status ,job queue is attached to compute environment that can not run jobs with capability ec2 ,aws batch job logs ,aws batch task failed to start ,aws batch desired vcpus 0 ,aws batch job queue ,none of the instances joined the underlying ecs cluster