Stacktape


Web scraper with Puppeteer

Web scraper with Puppeteer

  • This project deploys a web link scraper implemented in Typescript using headless chrome - Puppeteer.
  • The scraper runs in a Lambda function

Pricing

  • The infrastructure required for this application uses exclusively "serverless", pay-per-use infrastructure. If your load won't get high, these costs will be close to $0.

Prerequisites

If you're deploying from your local machine (not from a CI/CD pipeline), you need the following prerequisites:

1. Generate your project

The command below will bootstrap the project with pre-built application code and pre-configured stacktape.yml config file.

Copy

stp init --projectId lambda-web-scraper-puppeteer

2. Deploy your stack

  • To provision all the required infrastructure and to deploy your application to the cloud, all you need is a single command.
  • The deployment will take ~5-15 minutes. Subsequent deploys will be significantly faster.

Copy

stp deploy --stage <<stage>> --region <<region>>

stage is an arbitrary name of your environment (for example staging, production or dev-john)

region is the AWS region, where your stack will be deployed to. All the available regions are listed below.


Region name & Locationcode
Europe (Ireland)eu-west-1
Europe (London)eu-west-2
Europe (Frankfurt)eu-central-1
Europe (Milan)eu-south-1
Europe (Paris)eu-west-3
Europe (Stockholm)eu-north-1
US East (Ohio)us-east-2
US East (N. Virginia)us-east-1
US West (N. California)us-west-1
US West (Oregon)us-west-2
Canada (Central)ca-central-1
Africa (Cape Town)af-south-1
Asia Pacific (Hong Kong)ap-east-1
Asia Pacific (Mumbai)ap-south-1
Asia Pacific (Osaka-Local)ap-northeast-3
Asia Pacific (Seoul)ap-northeast-2
Asia Pacific (Singapore)ap-southeast-1
Asia Pacific (Sydney)ap-southeast-2
Asia Pacific (Tokyo)ap-northeast-1
China (Beijing)cn-north-1
China (Ningxia)cn-northwest-1
Middle East (Bahrain)me-south-1
South America (São Paulo)sa-east-1

3. Test your application

After a successful deployment, some information about the stack will be printed to the console (URLs of the deployed services, links to logs, metrics, etc.).

  • Navigate to <<mainApiGateway-url>>/scrape-links/<<url-of-the-website-to-scrape-without-http://>>. The URL of your mainApiGateway is printed to the terminal as mainApiGateway->url.

4. Run the application in development mode

To run functions in the development mode (remotely on AWS), you can use the dev command. For example, to develop and debug lambda function scrapeLinks, you can use

Copy

stp dev --region <<your-region>> --stage <<stage>> --resourceName scrapeLinks

The command will:

  • quickly re-build and re-deploy your new function code
  • watch for the function logs and pretty-print them to the terminal

The function is rebuilt and redeployed, when you either:

  • type rs + enter to the terminal
  • use the --watch option and one of your source code files changes

5. Hotswap deploys

  • Stacktape deployments use AWS CloudFormation under the hood. It brings a lot of guarantees and convenience, but can be slow for certain use-cases.

  • To speed up the deployment, you can use the --hotSwap flag that avoids Cloudformation.

  • Hotswap deployments work only for source code changes (for lambda function, containers and batch jobs) and for content uploads to buckets.

  • If the update deployment is not hot-swappable, Stacktape will automatically fall back to using a Cloudformation deployment.

Copy

stacktape deploy --hotSwap --stage <<stage>> --region <<region>>

6. Delete your stack

  • If you no longer want to use your stack, you can delete it.
  • Stacktape will automatically delete every infrastructure resource and deployment artifact associated with your stack.

Copy

stp delete --stage <<stage>> --region <<region>>

Stack description

Stacktape uses a simple stacktape.yml configuration file to describe infrastructure resources, packaging, deployment pipeline and other aspects of your services.

You can deploy your services to multiple environments (stages) - for example production, staging or dev-john. A stack is a running instance of a service. It consists of your application code (if any) and the infrastructure resources required to run it.

The configuration for this service is described below.

1. Service name

You can choose an arbitrary name for your service. The name of the stack will be constructed as {service-name}-{stage}.

Copy

serviceName: lambda-web-scraper-puppeteer

2. Resources

  • Every resource must have an arbitrary, alphanumeric name (A-z0-9).
  • Stacktape resources consist of multiple (sometimes more than 15) underlying AWS or 3rd party resources.

2.1 HTTP API Gateway

API Gateway receives requests and invokes our scraper function.

You can specify more properties on the gateway, in our case we are going with the default setup.

Copy

resources:
mainApiGateway:
type: http-api-gateway

2.2 Function

Core of our application is a lambda function scrapeLinks which can scrape links from arbitrary webpage.

The function is configured as follows:

  • Packaging - determines how the lambda artifact is built. The easiest and most optimized way to build the lambda from Typescript/Javascript is using stacktape-lambda-buildpack. We only need to configure entryfilePath. Stacktape automatically transpiles and builds the application code with all of its dependencies, creates the lambda zip artifact, and uploads it to a pre-created S3 bucket on AWS. You can also use other types of packaging.
  • Events - Events determine how is function triggered. In this case, we are triggering the function when an event (HTTP request) is delivered to the HTTP API gateway to URL path /scrape-links/{url}, where {url} is a path parameter(url can be arbitrary URL of page, i.e stacktape.com/blog). The event(request) including the path parameter is passed to the function handler as an argument.

Copy

scrapeLinks:
type: function
properties:
packaging:
type: stacktape-lambda-buildpack
properties:
entryfilePath: src/scrape-links.ts
excludeDependencies:
- puppeteer
events:
- type: http-api-gateway
properties:
httpApiGatewayName: mainApiGateway
path: /scrape-links/{url}
method: GET
Need help? Ask a question on SlackDiscord or info@stacktape.com.