Batch Jobs
A batch job is a compute resource designed to run a containerized task until it completes. The execution is triggered by an event, such as an HTTP request, a message in a queue, or an object uploaded to a bucket.
A key feature of batch jobs is the ability to use spot instances, which can reduce compute costs by up to 90%.
Like other Stacktape compute resources, batch jobs are serverless, meaning you don't need to manage the underlying infrastructure. Stacktape handles server provisioning, scaling, and security for you. You can also equip your batch job's environment with a GPU in addition to CPU and RAM.
Under the hood
Stacktape uses a combination of AWS services to provide a seamless experience for running containerized jobs:
- AWS Batch: Provisions the virtual machines where your job runs and manages the execution.
- AWS Step Functions: Manages the job's lifecycle, including retries and timeouts, using a serverless state machine.
- AWS Lambda: A trigger function that connects the event source to the batch job and starts its execution.
The execution flow is as follows:
- An event from an integration (like an API Gateway) invokes the trigger function.
- The trigger function starts the batch job state machine.
- The state machine queues the job in AWS Batch.
- AWS Batch provisions the necessary resources (like a VM) and runs your containerized job.
When to use
Batch jobs are ideal for long-running, resource-intensive tasks like data processing, ETL pipelines, or machine learning model training.
If you're unsure which compute resource to use, this table provides a comparison of container-based resources in Stacktape:
Resource type | Description | Use-cases |
---|---|---|
web-service | continuously running container with public endpoint and URL | public APIs, websites |
private-service | continuously running container with private endpoint | private APIs, services |
worker-service | continuously running container not accessible from outside | continuous processing |
multi-container-workload | custom multi container workload - you can customize accessibility for each container | more complex use-cases requiring customization |
batch-job | simple container job - container is destroyed after job is done | one-off/scheduled processing jobs |
Advantages
- Pay-per-use: You only pay for the compute time your job consumes.
- Resource flexibility: The environment automatically scales to provide the CPU, memory, and GPU your job needs.
- Time flexibility: Batch jobs can run for as long as needed.
- Secure by default: The underlying environment is securely managed by AWS.
- Easy integration: Can be triggered by a wide variety of event sources.
Disadvantages
- Slow start time: After a job is triggered, it's placed in a queue and can take anywhere from a few seconds to a few minutes to start.
Basic usage
resources:myBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800events:- type: scheduleproperties:scheduleRate: cron(0 14 * * ? *) # every day at 14:00 UTC
(async () => {const event = JSON.parse(process.env.STP_TRIGGER_EVENT_DATA);// process the event})();
Container
Your code for a batch job runs inside a Docker container. You can configure its properties:
Image
A Docker container is a running instance of a Docker image. You can provide an image in four ways:
- Images built using stacktape-image-buildpack
- Images built using external-buildpack
- Images built from a custom-dockerfile
- prebuilt-images
Environment variables
Most commonly used types of environment variables:
- Static - string, number or boolean (will be stringified).
- Result of a custom directive.
- Referenced property of another resource (using $ResourceParam directive). To learn more, refer to referencing parameters guide. If you are using environment variables to inject information about resources into your script, see also property connectTo which simplifies this process.
- Value of a secret (using $Secret directive).
environment:- name: STATIC_ENV_VARvalue: my-env-var- name: DYNAMICALLY_SET_ENV_VARvalue: $MyCustomDirective('input-for-my-directive')- name: DB_HOSTvalue: $ResourceParam('myDatabase', 'host')- name: DB_PASSWORDvalue: $Secret('dbSecret.password')
Pre-set environment variables
Stacktape pre-sets the following environment variables for your job:
Name | Value |
---|---|
STP_TRIGGER_EVENT_DATA | Contains JSON stringified event from an event integration that triggered this batch job. |
STP_MAXIMUM_ATTEMPTS | The total number of attempts for this job before it is marked as failed. |
STP_CURRENT_ATTEMPT | The current attempt number. |
Logging
Any output from your code to stdout
or stderr
is captured and stored in an AWS CloudWatch log group.
You can view logs in two ways:
- AWS CloudWatch Console: Get a direct link from the Stacktape Console or by using the
stacktape stack-info
command. - Stacktape CLI: Use the
stacktape logs
command to stream logs directly in your terminal.
Log storage can incur costs, so you can configure retentionDays
to automatically delete old logs.
Forwarding logs
You can forward logs to third-party services. See Log Forwarding for more details.
Computing resources
You can specify the amount of CPU, memory, and GPU for your batch job. AWS Batch selects the most cost-effective instance type that fits your job's requirements. To learn more about GPU instances, refer to the AWS Docs.
If you define memory required for your batch-job in multiples of 1024 be aware:
Your self managed environment might spin up instances that are much bigger than
expected. This can happen because the instances in your environment need memory to handle the
management processes
(managed by AWS) associated with running the batch job.
Example: If you define 8192 memory for your batch-job, you might expect
that the self managed environment will primarily try to spin up
one of the instances from used families with memory
8GiB(8192MB). However, the self managed environment knows that instance with
such memory would not be sufficient for both the batch job and management
processes. As a result, it will try to spin up a bigger instance. To learn more about this issue, refer to
AWS Docs
Due to this behaviour, we advise to specify memory for your batch-jobs smartly.
I.e instead of specifying 8192, consider specifying lower value, i.e 7680. This
way the self managed environment will be able to use instances with 8GiB
(8192MB) of memory, which can lead to cost saving.
If you define GPUs, instances are chosen according to your need from the GPU accelerated families:
resources:myBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: batch-jobs/js-batch-job.jsresources:cpu: 2memory: 1800events:- type: scheduleproperties:scheduleRate: 'cron(0 14 * * ? *)' # every day at 14:00 UTC
Spot instances
- Batch jobs can be configured to use spot instances.
- Spot instances leverage AWS's spare computing capacity and can cost up to 90% less than "onDemand" (normal) instances.
- However, your batch job can be interrupted at any time, if AWS needs the capacity back. When this happens, your batch job receives a SIGTERM signal and you then have 120 seconds to save your progress or clean up.
- Interruptions are usually infrequent as can be seen in the AWS Spot instance advisor.
- To learn more about spot instances, refer to AWS Docs.
resources:myBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800useSpotInstances: true
Retries
- If the batch job exits with non-zero exit code (due to internal failure, timeout, spot instance interruption from AWS, etc.) and attempts are not exhausted, it can be retried.
Timeout
- When the timeout is reached, the batch job will be stopped.
- If the batch job fails and maximum attempts are not exhausted, it will be retried.
resources:myBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800timeout: 1200
Storage
Each batch job has its own ephemeral storage with a fixed size of 20GB. This storage is temporary and is deleted after the job completes or fails. To store data permanently, use Buckets.
Trigger events
Batch jobs are invoked in response to events from various integrations. A single job can have multiple triggers. The data payload from the trigger is available in the STP_TRIGGER_EVENT_DATA
environment variable as a JSON string.
Be cautious when configuring event integrations. A high volume of events can trigger a large number of batch jobs, leading to unexpected costs. For example, 1000 HTTP requests to a connected API Gateway will result in 1000 job invocations.
HTTP Api event
Triggers the job in response to a request to a specified HTTP API Gateway. Routes are matched based on the most specific path. For more details, see the AWS Docs.
resources:myHttpApi:type: http-api-gatewaymyBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800events:- type: http-api-gatewayproperties:httpApiGatewayName: myHttpApipath: /hellomethod: GET
Lambda function connected to an HTTP API Gateway "myHttpApi"
Cognito authorizer
Restricts access to users authenticated with a User Pool. The request must include an access token
. If authorized, the job receives user claims in its payload.
resources:myGateway:type: http-api-gatewaymyUserPool:type: user-auth-poolproperties:userVerificationType: email-codemyBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800events:- type: http-api-gatewayproperties:httpApiGatewayName: myGatewaypath: /some-pathmethod: '*'authorizer:type: cognitoproperties:userPoolName: myUserPool
Example cognito authorizer
import { CognitoIdentityProvider } from '@aws-sdk/client-cognito-identity-provider';const cognito = new CognitoIdentityProvider({});(async () => {const event = JSON.parse(process.env.STP_TRIGGER_EVENT_DATA);const userData = await cognito.getUser({ AccessToken: event.headers.authorization });// do something with your user data})();
Example lambda batch job that fetches user data from Cognito
Lambda authorizer
Uses a dedicated Lambda function to decide if a request is authorized. The authorizer function returns a policy document or a simple boolean response. You can configure identitySources
to specify which parts of the request are used for authorization. To learn more, see the AWS Docs.
Schedule event
Triggers the job on a defined schedule using either a fixed rate (e.g., every 5 minutes) or a cron expression.
resources:myBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800events:# invoke function every two hours- type: scheduleproperties:scheduleRate: rate(2 hours)# invoke function at 10:00 UTC every day- type: scheduleproperties:scheduleRate: cron(0 10 * * ? *)
Event Bus event
Triggers the job when a matching event is received by a specified event bus. You can use the default AWS event bus or a custom event bus.
resources:myBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800events:- type: event-busproperties:useDefaultBus: trueeventPattern:source:- 'aws.autoscaling'region:- 'us-west-2'
Batch job connected to the default event bus
resources:myEventBus:type: event-busmyBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800events:- type: event-busproperties:eventBusName: myEventBuseventPattern:source:- 'mycustomsource'
Batch job connected to a custom event bus
SNS event
Triggers the job when a message is published to an SNS topic.
resources:myBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800events:- type: snsproperties:topicName: mySnsTopicmySnsTopic:type: sns-topic
SQS event
Triggers the job when messages are available in an SQS queue. Messages are processed in batches. If the job fails to start, messages return to the queue after the visibility timeout. If the job starts but then fails, the messages are considered processed.
A single queue should be consumed by a single compute resource. If you need a fan-out pattern, consider using an SNS or EventBus integration.
resources:myBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800events:- type: sqsproperties:sqsQueueName: mySqsQueuemySqsQueue:type: sqs-queue
Kinesis event
Triggers the job when records are available in a Kinesis Data Stream. It's similar to SQS but designed for real-time data streaming. You can add a Kinesis stream using CloudFormation resources.
resources:myBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800events:- type: kinesis-streamproperties:autoCreateConsumer: truemaxBatchWindowSeconds: 30batchSize: 200streamArn: $CfResourceParam('myKinesisStream', 'Arn')onFailure:arn: $CfResourceParam('myOnFailureSqsQueue', 'Arn')type: sqscloudformationResources:myKinesisStream:Type: AWS::Kinesis::StreamProperties:ShardCount: 1myOnFailureSqsQueue:Type: AWS::SQS::Queue
DynamoDB event
Triggers the job in response to item-level changes in a DynamoDB table. You must enable DynamoDB Streams on your table.
resources:myDynamoDbTable:type: dynamo-db-tableproperties:primaryKey:partitionKey:name: idtype: stringstreamType: NEW_AND_OLD_IMAGESmyBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800events:- type: dynamo-db-streamproperties:streamArn: $ResourceParam('myDynamoDbTable', 'streamArn')batchSize: 200
S3 event
Triggers the job when a specific event (like object created
) occurs in an S3 bucket.
resources:myBucket:type: bucketmyBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800events:- type: s3properties:bucketArn: $ResourceParam('myBucket', 'arn')s3EventType: 's3:ObjectCreated:*'filterRule:prefix: order-suffix: .jpg
Cloudwatch Log event
Triggers the job when a log record is added to a specified CloudWatch log group. The event payload is BASE64 encoded and GZIP compressed.
resources:myLogProducingLambda:type: functionproperties:packaging:type: stacktape-lambda-buildpackproperties:entryfilePath: lambdas/log-producer.tsmyBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800events:- type: cloudwatch-logproperties:logGroupArn: $ResourceParam('myLogProducingLambda', 'arn')
Application Load Balancer event
Triggers the job when an Application Load Balancer receives an HTTP request matching specified conditions (e.g., path, headers, method).
resources:# load balancer which routes traffic to the functionmyLoadBalancer:type: application-load-balancermyBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800events:- type: application-load-balancerproperties:# referencing load balancer defined aboveloadBalancerName: myLoadBalancerpriority: 1paths:- /invoke-my-job- /another-path
Accessing other resources
By default, AWS resources cannot communicate with each other. Access must be granted explicitly using IAM permissions. Stacktape handles most of this automatically, but for resource-to-resource communication, you need to configure permissions.
Relational Databases are an exception, as they use their own connection-string-based access control.
There are two ways to grant permissions:
Using connectTo
The connectTo
property is a simplified way to grant basic access to other Stacktape-managed resources. It automatically configures the necessary IAM permissions and injects environment variables with connection details into your batch job.
resources:photosBucket:type: bucketmyBatchJob:type: batch-jobproperties:container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsresources:cpu: 2memory: 1800connectTo:# access to the bucket- photosBucket# access to AWS SES- aws:ses
By referencing resources (or services) in connectTo
list, Stacktape automatically:
- configures correct compute resource's IAM role permissions if needed
- sets up correct security group rules to allow access if needed
- injects relevant environment variables containing information about resource you are connecting to into the compute resource's runtime
- names of environment variables use upper-snake-case and are in form
STP_[RESOURCE_NAME]_[VARIABLE_NAME]
, - examples:
STP_MY_DATABASE_CONNECTION_STRING
orSTP_MY_EVENT_BUS_ARN
, - list of injected variables for each resource type can be seen below.
- names of environment variables use upper-snake-case and are in form
Granted permissions and injected environment variables are different depending on resource type:
Bucket
- Permissions:
- list objects in a bucket
- create / get / delete / tag object in a bucket
- Injected env variables:
NAME
,ARN
DynamoDB table
- Permissions:
- get / put / update / delete item in a table
- scan / query a table
- describe table stream
- Injected env variables:
NAME
,ARN
,STREAM_ARN
MongoDB Atlas cluster
- Permissions:
- Allows connection to a cluster with
accessibilityMode
set toscoping-workloads-in-vpc
. To learn more about MongoDB Atlas clusters accessibility modes, refer to MongoDB Atlas cluster docs. - Creates access "user" associated with compute resource's role to allow for secure credential-less access to the the cluster
- Allows connection to a cluster with
- Injected env variables:
CONNECTION_STRING
Relational(SQL) database
- Permissions:
- Allows connection to a relational database with
accessibilityMode
set toscoping-workloads-in-vpc
. To learn more about relational database accessibility modes, refer to Relational databases docs.
- Allows connection to a relational database with
- Injected env variables:
CONNECTION_STRING
,JDBC_CONNECTION_STRING
,HOST
,PORT
(in case of aurora multi instance cluster additionally:READER_CONNECTION_STRING
,READER_JDBC_CONNECTION_STRING
,READER_HOST
)
Redis cluster
- Permissions:
- Allows connection to a redis cluster with
accessibilityMode
set toscoping-workloads-in-vpc
. To learn more about redis cluster accessibility modes, refer to Redis clusters docs.
- Allows connection to a redis cluster with
- Injected env variables:
HOST
,READER_HOST
,PORT
Event bus
- Permissions:
- publish events to the specified Event bus
- Injected env variables:
ARN
Function
- Permissions:
- invoke the specified function
- invoke the specified function via url (if lambda has URL enabled)
- Injected env variables:
ARN
Batch job
- Permissions:
- submit batch-job instance into batch-job queue
- list submitted job instances in a batch-job queue
- describe / terminate a batch-job instance
- list executions of state machine which executes the batch-job according to its strategy
- start / terminate execution of a state machine which executes the batch-job according to its strategy
- Injected env variables:
JOB_DEFINITION_ARN
,STATE_MACHINE_ARN
User auth pool
- Permissions:
- full control over the user pool (
cognito-idp:*
) - for more information about allowed methods refer to AWS docs
- full control over the user pool (
- Injected env variables:
ID
,CLIENT_ID
,ARN
SNS Topic
- Permissions:
- confirm/list subscriptions of the topic
- publish/subscribe to the topic
- unsubscribe from the topic
- Injected env variables:
ARN
,NAME
SQS Queue
- Permissions:
- send/receive/delete message
- change visibility of message
- purge queue
- Injected env variables:
ARN
,NAME
,URL
Upstash Kafka topic
- Injected env variables:
TOPIC_NAME
,TOPIC_ID
,USERNAME
,PASSWORD
,TCP_ENDPOINT
,REST_URL
Upstash Redis
- Injected env variables:
HOST
,PORT
,PASSWORD
,REST_TOKEN
,REST_URL
,REDIS_URL
Private service
- Injected env variables:
ADDRESS
aws:ses
(Macro)
- Permissions:
- gives full permissions to aws ses (
ses:*
). - for more information about allowed methods refer to AWS docs
- gives full permissions to aws ses (
Using iamRoleStatements
For fine-grained control, you can provide raw IAM role statements. This allows you to define custom permissions to any AWS resource.
resources:myBatchJob:type: batch-jobproperties:resources:cpu: 2memory: 1800container:packaging:type: stacktape-image-buildpackproperties:entryfilePath: path/to/my/batch-job.tsiamRoleStatements:- Resource:- $CfResourceParam('NotificationTopic', 'Arn')Effect: AllowAction:- 'sns:Publish'cloudformationResources:NotificationTopic:Type: AWS::SNS::Topic
Default VPC connection
Certain resources, like Relational Databases, must be placed within a VPC. If your stack contains such resources, Stacktape automatically creates a default VPC and connects them to it.
Batch jobs are connected to this VPC by default, allowing them to communicate with other VPC-enabled resources without extra configuration. To learn more, see our guide on VPCs.
Referenceable parameters
The following parameters can be easily referenced using $ResourceParam directive directive.
To learn more about referencing parameters, refer to referencing parameters.
Arn of the job definition resource
- Usage:
$ResourceParam('<<resource-name>>', 'jobDefinitionArn')
Arn of the state machine controlling the execution flow of the batch job
- Usage:
$ResourceParam('<<resource-name>>', 'stateMachineArn')
Arn of the log group aggregating logs from the batch job
- Usage:
$ResourceParam('<<resource-name>>', 'logGroupArn')
Pricing
You are charged for:
- The compute instances running your batch jobs.
- A negligible amount for the Lambda functions and Step Functions that manage the job's execution.
Pricing depends on the instance type and region. You can significantly reduce costs (by up to 90%) by using spot instances.