Alarms

Overview

Alarms are easy way to get keep your infrastructure in check and swiftly receive notifications when your resources get overwhelmed or faulty.

Alarms are always configured to monitor a specified metric of a specified resource type. After threshold for the metric is broken, configured action is taken.

Under the hood

Under the hood alarms are implemented as AWS Cloudwatch Alarms

Usage

Alarms can be created in 2 ways:

Global alarms - created through Stacktape console. These alarms are applied to all resources of a certain type managed by Stacktape.
In-config alarms - you can specify alarm directly on the resource in your stacktape config file.

Global Alarms

Global alarms are created through Stacktape console. By creating alarm in console, you are creating a template(blueprint) of an alarm. Only when you deploy(create/update) your stack after creating global alarm (and alarm is eligible for your stack's serviceName and stage), actual alarm is created for the eligible resources of your stack.

When configuring global alarm you can specify:

what type of resource and metric you wish to monitor
thresholds for the metric
automatic Slack or email notifications
which stacks does the alarm apply to (serviceName, stage)

Global alarms you create in console get applied only to resources in stacks that are created/updated after the global alarm was created in console. In other words, if you create global alarm, you need to use deploy command to apply alarms to resources in your stacks.

Creating Global Alarm

Go to Stacktape console alarms page and click Create new alarm button

Alarm page

Configure alarm according to your needs. In our example we are creating alarm to monitor Lambda functions' error rate. We are also limiting alarm only to prod stage (used for production environments).

Creating alarm

After creating alarm you can use deploy command on your stack to create actual alarms in the stack.

In-config Alarms

You can specify alarm directly in the stacktape config file as a property of a resource which should be monitored.


resources:
  myLambdaFunction:
    type: function
    properties:
      packaging:
        type: stacktape-lambda-buildpack
        properties:
          entryfilePath: 'lambdas/js-lambda.js'
      alarms:
        - trigger:
            type: lambda-error-rate
            properties:
              thresholdPercent: 5
          notificationTargets:
            - type: slack
              properties:
                conversationId: C038XXXXXX
                accessToken: $Secret('slack-access-token')

Schema of alarm

LambdaAlarm API reference

trigger

Required

evaluation

notificationTargets

Alarm consists of 3 parts:

Trigger - Type of the trigger specifies what metric will be monitored.
Notifications(OPTIONAL) - Specifies where to send notification when alarm is triggered.
Evaluation(OPTIONAL) - Configures evaluation period (interval) for the monitored metric.

Trigger

Type of the trigger specifies what metric will be monitored. Depending on the type further properties of trigger can be specified

LambdaErrorRateTrigger API reference

type

Required

properties.thresholdPercent

Required

properties.comparisonOperator


resources:
  myLambdaFunction:
    type: function
    ...
      alarms:
        - trigger:
            type: lambda-error-rate
            properties:
              thresholdPercent: 5

See section Trigger types to see all available alarms

Notifications

Specifies where to send notification when alarm is triggered. There are 3 integrations available:

Slack integration:


resources:
  myResource:
    type: ...
    properties:
      ...
      alarms:
        - trigger:
            ...
          notificationTargets:
            - type: slack
              properties:
                conversationId: C038XXXXXX
                accessToken: $Secret('slack-access-token')

MS Teams integration:


resources:
  myResource:
    type: ...
    properties:
      ...
      alarms:
        - trigger:
            ...
          notificationTargets:
            - type: ms-teams
              properties:
                webhookUrl: MY_WEBHOOK_URL

Email integration:


resources:
  myResource:
    type: ...
    properties:
      ...
      alarms:
        - trigger:
            ...
          notificationTargets:
            - type: email
              properties:
                sender: alarm@company.com
                recipient: support@company.com

Evaluation

Configures evaluation period (interval) for the monitored metric.

AlarmEvaluation API reference

period

Default: 60

evaluationPeriods

Default: 1

breachedPeriods

Default: 1


resources:
  myResource:
    type: ...
    properties:
      ...
      alarms:
        - trigger:
            ...
          evaluation:
            period: 200

Lambda function with error rate alarm

Trigger Lambda on Alarm

See alarm event on function page.

Trigger types

New trigger types are being added continuously. If you have a specific wish for a trigger do not hesitate to open a Github issue.

Lambda Error Rate

Error rate is calculated as percentage ratio (invocations that ended with error / total invocation count) during evaluation period (1 minute by default).

LambdaErrorRateTrigger API reference

type

Required

properties.thresholdPercent

Required

properties.comparisonOperator

Lambda Duration

By default trigger is fired when average(avg) execution duration during evaluation period (1 minute by default) is greater than threshold
You can optionally customize trigger behaviour by modifying statistic and comparisonOperator properties

LambdaDurationTrigger API reference

type

Required

properties.thresholdMilliseconds

Required

properties.comparisonOperator

properties.statistic

Default: avg

Database Read Latency

By default trigger is fired when average(avg) read latency during evaluation period (1 minute by default) is greater than threshold
You can optionally customize trigger behaviour by modifying statistic and comparisonOperator properties

RelationalDatabaseReadLatencyTrigger API reference

type

Required

properties.thresholdSeconds

Required

properties.comparisonOperator

properties.statistic

Default: avg

Database Write Latency

By default trigger is fired when average(avg) write latency during evaluation period (1 minute by default) is greater than threshold
You can optionally customize trigger behaviour by modifying statistic and comparisonOperator properties

RelationalDatabaseWriteLatencyTrigger API reference

type

Required

properties.thresholdSeconds

Required

properties.comparisonOperator

properties.statistic

Default: avg

Database CPU Utilization

By default trigger is fired when average(avg) cpu utilization during evaluation period (1 minute by default) is greater than threshold
You can optionally customize trigger behaviour by modifying statistic and comparisonOperator properties

RelationalDatabaseCPUUtilizationTrigger API reference

type

Required

properties.thresholdPercent

Required

properties.comparisonOperator

properties.statistic

Default: avg

Database Free Storage

By default trigger is fired when minimum(min) free storage space during evaluation period (1 minute by default) is lower than threshold
You can optionally customize trigger behaviour by modifying statistic and comparisonOperator properties

RelationalDatabaseFreeStorageTrigger API reference

type

Required

properties.thresholdMB

Required

properties.comparisonOperator

properties.statistic

Default: avg

Database Free Memory

By default trigger is fired when average(avg) free memory during evaluation period (1 minute by default) is lower than threshold
You can optionally customize trigger behaviour by modifying statistic and comparisonOperator properties

RelationalDatabaseFreeMemoryTrigger API reference

type

Required

properties.thresholdMB

Required

properties.comparisonOperator

properties.statistic

Default: avg

Database Connection Count

By default trigger is fired when average(avg) amount of connections during evaluation period (1 minute by default) is greater than threshold
You can optionally customize trigger behaviour by modifying statistic and comparisonOperator properties

RelationalDatabaseConnectionCountTrigger API reference

type

Required

properties.thresholdCount

Required

properties.comparisonOperator

properties.statistic

Default: avg

Http Api Gateway Error Rate

Error rate is calculated as percentage ratio (4xx and 5xx response count / total response count) during evaluation period (1 minute by default).

HttpApiGatewayErrorRateTrigger API reference

type

Required

properties.thresholdPercent

Required

properties.comparisonOperator

Http Api Gateway Latency

By default trigger is fired when average(avg) latency during evaluation period (1 minute by default) is greater than threshold
Latency denotes the time between when API Gateway receives a request from a client and when it returns a response to the client.
You can optionally customize trigger behaviour by modifying statistic and comparisonOperator properties

HttpApiGatewayLatencyTrigger API reference

type

Required

properties.thresholdMilliseconds

Required

properties.comparisonOperator

properties.statistic

Default: avg

Application Load Balancer Error Rate

Error rate is calculated as percentage ratio (4xx and 5xx response count / total response count) during evaluation period (1 minute by default).

ApplicationLoadBalancerErrorRateTrigger API reference

type

Required

properties.thresholdPercent

Required

properties.comparisonOperator

Application Load Balancer Custom

You can pick any metric available in the list below
Threshold will be compared with the calculated value which is calculated using the formula statistic(CHOSEN_METRIC) where:
statistic - is a statistic function applied to the collected values (metrics) retrieved during the evaluation period (avg(average) by default)
CHOSEN_METRIC - is a metric you choose

Available metrics are:

ActiveConnectionCount - The total number of concurrent TCP connections active from clients to the load balancer and from the load balancer to targets.
AnomalousHostCount - The number of hosts detected with anomalies.
ClientTLSNegotiationErrorCount - The number of TLS connections initiated by the client that did not establish a session with the load balancer due to a TLS error.
ConsumedLCUs - The number of load balancer capacity units (LCU) used by your load balancer.
DesyncMitigationMode_NonCompliant_Request_Count - The number of requests that do not comply with RFC 7230.
DroppedInvalidHeaderRequestCount - The number of requests where the load balancer removed HTTP headers with header fields that are not valid before routing the request.
MitigatedHostCount - The number of targets under mitigation.
ForwardedInvalidHeaderRequestCount - The number of requests routed by the load balancer that had HTTP headers with header fields that are not valid.
GrpcRequestCount - The number of gRPC requests processed over IPv4 and IPv6.
HTTP_Fixed_Response_Count - The number of fixed-response actions that were successful.
HTTP_Redirect_Count - The number of redirect actions that were successful.
HTTP_Redirect_Url_Limit_Exceeded_Count - The number of redirect actions that couldn't be completed because the URL in the response location header is larger than 8K.
HTTPCode_ELB_3XX_Count - The number of HTTP 3XX redirection codes that originate from the load balancer.
HTTPCode_ELB_4XX_Count - The number of HTTP 4XX client error codes that originate from the load balancer.
HTTPCode_ELB_5XX_Count - The number of HTTP 5XX server error codes that originate from the load balancer.
HTTPCode_ELB_500_Count - The number of HTTP 500 error codes that originate from the load balancer.
HTTPCode_ELB_502_Count - The number of HTTP 502 error codes that originate from the load balancer.
HTTPCode_ELB_503_Count - The number of HTTP 503 error codes that originate from the load balancer.
HTTPCode_ELB_504_Count - The number of HTTP 504 error codes that originate from the load balancer.
IPv6ProcessedBytes - The total number of bytes processed by the load balancer over IPv6.
IPv6RequestCount - The number of IPv6 requests received by the load balancer.
NewConnectionCount - The total number of new TCP connections established from clients to the load balancer and from the load balancer to targets.
NonStickyRequestCount - The number of requests where the load balancer chose a new target because it couldn't use an existing sticky session.
ProcessedBytes - The total number of bytes processed by the load balancer over IPv4 and IPv6.
RejectedConnectionCount - The number of connections that were rejected because the load balancer had reached its maximum number of connections.
RequestCount - The number of requests processed over IPv4 and IPv6.
RuleEvaluations - The number of rules processed by the load balancer given a request rate averaged over an hour.
HealthyHostCount - The number of targets that are considered healthy.
HTTPCode_Target_2XX_Count - The number of HTTP 2XX response codes generated by the targets.
HTTPCode_Target_3XX_Count - The number of HTTP 3XX response codes generated by the targets.
HTTPCode_Target_4XX_Count - The number of HTTP 4XX response codes generated by the targets.
HTTPCode_Target_5XX_Count - The number of HTTP 5XX response codes generated by the targets.
RequestCountPerTarget - The average request count per target, in a target group.
TargetConnectionErrorCount - The number of connections that were not successfully established between the load balancer and target.
TargetResponseTime - The time elapsed, in seconds, after the request leaves the load balancer until the target starts to send the response headers.
TargetTLSNegotiationErrorCount - The number of TLS connections initiated by the load balancer that did not establish a session with the target.
UnHealthyHostCount - The number of targets that are considered unhealthy.
HealthyStateDNS - The number of zones that meet the DNS healthy state requirements.
HealthyStateRouting - The number of zones that meet the routing healthy state requirements.
UnhealthyRoutingRequestCount - The number of requests that are routed using the routing failover action (fail open).
UnhealthyStateDNS - The number of zones that do not meet the DNS healthy state requirements.
UnhealthyStateRouting - The number of zones that do not meet the routing healthy state requirements.
LambdaInternalError - The number of requests to a Lambda function that failed because of an issue internal to the load balancer or AWS Lambda.
LambdaTargetProcessedBytes - The total number of bytes processed by the load balancer for requests to and responses from a Lambda function.
LambdaUserError - The number of requests to a Lambda function that failed because of an issue with the Lambda function.
ELBAuthError - The number of user authentications that could not be completed due to an internal error.
ELBAuthFailure - The number of user authentications that could not be completed because the IdP denied access to the user.
ELBAuthLatency - The time elapsed, in milliseconds, to query the IdP for the ID token and user info.
ELBAuthRefreshTokenSuccess - The number of times the load balancer successfully refreshed user claims using a refresh token provided by the IdP.
ELBAuthSuccess - The number of authenticate actions that were successful.
ELBAuthUserClaimsSizeExceeded - The number of times that a configured IdP returned user claims that exceeded 11K bytes in size.

ApplicationLoadBalancerCustomTrigger API reference

type

Required

properties.metric

Required

properties.threshold

Required

properties.statistic

Default: avg

properties.comparisonOperator

Sqs Queue Received Messages

By default trigger is fired when average(avg) amount of messages received by consuments during evaluation period (1 minute by default) is greater than threshold
The average number is calculated from NumberOfMessagesReceived metric exposed by the SQS.
You can optionally customize trigger behaviour by modifying statistic and comparisonOperator properties

SqsQueueReceivedMessagesCountTrigger API reference

type

Required

properties.thresholdCount

Required

properties.comparisonOperator

properties.statistic

Default: avg

Sqs Queue Not Empty

By default trigger is fired when the queue is not empty.
For queue to be considered empty all of these metrics exposed by SQS queue must be 0 during the evaluation period:
- ApproximateNumberOfMessagesVisible
- ApproximateNumberOfMessagesNotVisible
- NumberOfMessagesReceived
- NumberOfMessagesSent

SqsQueueNotEmptyTrigger API reference

type

Required