Alarms
Alarms offer a straightforward way to monitor your infrastructure and receive prompt notifications when your resources experience issues or become overloaded.
You can configure alarms to monitor specific metrics of a resource type. When a metric crosses a predefined threshold, the alarm triggers a configured action, such as sending a notification.
Under the hood, Stacktape alarms are implemented using _CloudWatch Alarm_s.
How to create alarms
There are two ways to create alarms:
- Global alarms are created in the Stacktape Console and apply to all resources of a specific type managed by Stacktape.
- In-config alarms are defined directly within a resource's configuration in your
stacktape.yml
file.
Global alarms
Global alarms are templates created in the Stacktape Console. When you deploy a stack that matches the alarm's serviceName
and stage
criteria, Stacktape creates a concrete alarm for each eligible resource in that stack.
When configuring a global alarm, you can specify:
- The resource type and metric to monitor.
- The threshold for the metric.
- Automatic Slack or email notifications.
- The stacks the alarm applies to, based on
serviceName
andstage
.
Global alarms only apply to stacks that are created or updated after the alarm is created. If you create a global alarm, you must run the stacktape deploy
command to apply it to your existing stacks.
Creating a global alarm
-
Navigate to the Alarms page in the Stacktape Console and click Create new alarm.
-
Configure the alarm. The example below shows an alarm that monitors the error rate of Lambda functions and is limited to the
prod
stage. -
Deploy your stack using the
stacktape deploy
command to create the alarms.
In-config alarms
You can define alarms directly in your stacktape.yml
file as a property of the resource you want to monitor.
resources:myLambdaFunction:type: functionproperties:packaging:type: stacktape-lambda-buildpackproperties:entryfilePath: 'lambdas/js-lambda.js'alarms:- trigger:type: lambda-error-rateproperties:thresholdPercent: 5notificationTargets:- type: slackproperties:conversationId: C038XXXXXXaccessToken: $Secret('slack-access-token')
An in-config alarm consists of three parts:
- Trigger: Specifies the metric to monitor.
- Notifications (optional): Defines where to send notifications when the alarm is triggered.
- Evaluation (optional): Configures the evaluation period for the monitored metric.
Trigger
The trigger specifies the metric to be monitored.
resources:myLambdaFunction:type: function# ...alarms:- trigger:type: lambda-error-rateproperties:thresholdPercent: 5
See the Trigger Types section for a list of all available alarms.
Notifications
You can configure notifications to be sent to Slack, MS Teams, or by email.
Slack:
resources:myResource:type: ...properties:# ...alarms:- trigger:# ...notificationTargets:- type: slackproperties:conversationId: C038XXXXXXaccessToken: $Secret('slack-access-token')
MS Teams:
resources:myResource:type: ...properties:# ...alarms:- trigger:# ...notificationTargets:- type: ms-teamsproperties:webhookUrl: MY_WEBHOOK_URL
Email:
resources:myResource:type: ...properties:# ...alarms:- trigger:# ...notificationTargets:- type: emailproperties:sender: alarm@company.comrecipient: support@company.com
Evaluation
The evaluation section configures the evaluation period for the monitored metric.
resources:myResource:type: ...properties:# ...alarms:- trigger:# ...evaluation:period: 200
Trigger a Lambda function on alarm
You can trigger a Lambda function when an alarm is fired. For more information, see the alarm event documentation.
Trigger types
New trigger types are added regularly. If you have a request for a new trigger, please open a GitHub issue.
Lambda Error Rate
- Error rate is calculated as percentage ratio (invocations that ended with error / total invocation count) during evaluation period (1 minute by default).
Lambda Duration
- By default trigger is fired when average(avg) execution duration during evaluation period (1 minute by default) is greater than threshold
- You can optionally customize trigger behaviour by modifying
statistic
andcomparisonOperator
properties
Database Read Latency
- By default trigger is fired when average(avg) read latency during evaluation period (1 minute by default) is greater than threshold
- You can optionally customize trigger behaviour by modifying
statistic
andcomparisonOperator
properties
Database Write Latency
- By default trigger is fired when average(avg) write latency during evaluation period (1 minute by default) is greater than threshold
- You can optionally customize trigger behaviour by modifying
statistic
andcomparisonOperator
properties
Database CPU Utilization
- By default trigger is fired when average(avg) cpu utilization during evaluation period (1 minute by default) is greater than threshold
- You can optionally customize trigger behaviour by modifying
statistic
andcomparisonOperator
properties
Database Free Storage
- By default trigger is fired when minimum(min) free storage space during evaluation period (1 minute by default) is lower than threshold
- You can optionally customize trigger behaviour by modifying
statistic
andcomparisonOperator
properties
Database Free Memory
- By default trigger is fired when average(avg) free memory during evaluation period (1 minute by default) is lower than threshold
- You can optionally customize trigger behaviour by modifying
statistic
andcomparisonOperator
properties
Database Connection Count
- By default trigger is fired when average(avg) amount of connections during evaluation period (1 minute by default) is greater than threshold
- You can optionally customize trigger behaviour by modifying
statistic
andcomparisonOperator
properties
Http Api Gateway Error Rate
- Error rate is calculated as percentage ratio (4xx and 5xx response count / total response count) during evaluation period (1 minute by default).
Http Api Gateway Latency
- By default trigger is fired when average(avg) latency during evaluation period (1 minute by default) is greater than threshold
- Latency denotes the time between when API Gateway receives a request from a client and when it returns a response to the client.
- You can optionally customize trigger behaviour by modifying
statistic
andcomparisonOperator
properties
Application Load Balancer Error Rate
- Error rate is calculated as percentage ratio (4xx and 5xx response count / total response count) during evaluation period (1 minute by default).
Application Load Balancer Custom
- You can pick any metric available in the list below
- Threshold will be compared with the calculated value which is calculated using the formula
statistic(CHOSEN_METRIC)
where: statistic
- is a statistic function applied to the collected values (metrics) retrieved during the evaluation period (avg
(average) by default)CHOSEN_METRIC
- is a metric you choose
Available metrics are:
- ActiveConnectionCount - The total number of concurrent TCP connections active from clients to the load balancer and from the load balancer to targets.
- AnomalousHostCount - The number of hosts detected with anomalies.
- ClientTLSNegotiationErrorCount - The number of TLS connections initiated by the client that did not establish a session with the load balancer due to a TLS error.
- ConsumedLCUs - The number of load balancer capacity units (LCU) used by your load balancer.
- DesyncMitigationMode_NonCompliant_Request_Count - The number of requests that do not comply with RFC 7230.
- DroppedInvalidHeaderRequestCount - The number of requests where the load balancer removed HTTP headers with header fields that are not valid before routing the request.
- MitigatedHostCount - The number of targets under mitigation.
- ForwardedInvalidHeaderRequestCount - The number of requests routed by the load balancer that had HTTP headers with header fields that are not valid.
- GrpcRequestCount - The number of gRPC requests processed over IPv4 and IPv6.
- HTTP_Fixed_Response_Count - The number of fixed-response actions that were successful.
- HTTP_Redirect_Count - The number of redirect actions that were successful.
- HTTP_Redirect_Url_Limit_Exceeded_Count - The number of redirect actions that couldn't be completed because the URL in the response location header is larger than 8K.
- HTTPCode_ELB_3XX_Count - The number of HTTP 3XX redirection codes that originate from the load balancer.
- HTTPCode_ELB_4XX_Count - The number of HTTP 4XX client error codes that originate from the load balancer.
- HTTPCode_ELB_5XX_Count - The number of HTTP 5XX server error codes that originate from the load balancer.
- HTTPCode_ELB_500_Count - The number of HTTP 500 error codes that originate from the load balancer.
- HTTPCode_ELB_502_Count - The number of HTTP 502 error codes that originate from the load balancer.
- HTTPCode_ELB_503_Count - The number of HTTP 503 error codes that originate from the load balancer.
- HTTPCode_ELB_504_Count - The number of HTTP 504 error codes that originate from the load balancer.
- IPv6ProcessedBytes - The total number of bytes processed by the load balancer over IPv6.
- IPv6RequestCount - The number of IPv6 requests received by the load balancer.
- NewConnectionCount - The total number of new TCP connections established from clients to the load balancer and from the load balancer to targets.
- NonStickyRequestCount - The number of requests where the load balancer chose a new target because it couldn't use an existing sticky session.
- ProcessedBytes - The total number of bytes processed by the load balancer over IPv4 and IPv6.
- RejectedConnectionCount - The number of connections that were rejected because the load balancer had reached its maximum number of connections.
- RequestCount - The number of requests processed over IPv4 and IPv6.
- RuleEvaluations - The number of rules processed by the load balancer given a request rate averaged over an hour.
- HealthyHostCount - The number of targets that are considered healthy.
- HTTPCode_Target_2XX_Count - The number of HTTP 2XX response codes generated by the targets.
- HTTPCode_Target_3XX_Count - The number of HTTP 3XX response codes generated by the targets.
- HTTPCode_Target_4XX_Count - The number of HTTP 4XX response codes generated by the targets.
- HTTPCode_Target_5XX_Count - The number of HTTP 5XX response codes generated by the targets.
- RequestCountPerTarget - The average request count per target, in a target group.
- TargetConnectionErrorCount - The number of connections that were not successfully established between the load balancer and target.
- TargetResponseTime - The time elapsed, in seconds, after the request leaves the load balancer until the target starts to send the response headers.
- TargetTLSNegotiationErrorCount - The number of TLS connections initiated by the load balancer that did not establish a session with the target.
- UnHealthyHostCount - The number of targets that are considered unhealthy.
- HealthyStateDNS - The number of zones that meet the DNS healthy state requirements.
- HealthyStateRouting - The number of zones that meet the routing healthy state requirements.
- UnhealthyRoutingRequestCount - The number of requests that are routed using the routing failover action (fail open).
- UnhealthyStateDNS - The number of zones that do not meet the DNS healthy state requirements.
- UnhealthyStateRouting - The number of zones that do not meet the routing healthy state requirements.
- LambdaInternalError - The number of requests to a Lambda function that failed because of an issue internal to the load balancer or AWS Lambda.
- LambdaTargetProcessedBytes - The total number of bytes processed by the load balancer for requests to and responses from a Lambda function.
- LambdaUserError - The number of requests to a Lambda function that failed because of an issue with the Lambda function.
- ELBAuthError - The number of user authentications that could not be completed due to an internal error.
- ELBAuthFailure - The number of user authentications that could not be completed because the IdP denied access to the user.
- ELBAuthLatency - The time elapsed, in milliseconds, to query the IdP for the ID token and user info.
- ELBAuthRefreshTokenSuccess - The number of times the load balancer successfully refreshed user claims using a refresh token provided by the IdP.
- ELBAuthSuccess - The number of authenticate actions that were successful.
- ELBAuthUserClaimsSizeExceeded - The number of times that a configured IdP returned user claims that exceeded 11K bytes in size.
Sqs Queue Received Messages
- By default trigger is fired when average(avg) amount of messages received by consuments during evaluation period (1 minute by default) is greater than threshold
- The average number is calculated from
NumberOfMessagesReceived
metric exposed by the SQS. - You can optionally customize trigger behaviour by modifying
statistic
andcomparisonOperator
properties
Sqs Queue Not Empty
- By default trigger is fired when the queue is not empty.
- For queue to be considered empty all of these metrics exposed by SQS queue must be 0 during the evaluation period:
ApproximateNumberOfMessagesVisible
ApproximateNumberOfMessagesNotVisible
NumberOfMessagesReceived
NumberOfMessagesSent