0.0.1 • Published 4 years ago

actor-jomashop-crawler v0.0.1

Weekly downloads
-
License
ISC
Repository
-
Last release
4 years ago

Apify Actor - Jomashop

Overview

We use the Apify SDK to build and runn our scrapers. Within this template, we're using Puppeteer as the crawler of choice.

Setup

First, be sure you have Node v12 installed.

  • Generate a Github Access Token

  • Set a new shell env variable, GH_TOKEN, with your new Github Access Token.

  • Copy .npmrc-dist to .npmrc

  • Then run npm install.

You'll need to a login to our Apify account.

  • Log into Apify platform: npm run apify:login

Proxies

Out of the box, this actor template has Apify proxy usage enabled.

You will need to go to your Apify Proxy dashboard, take the password and set as the environment variable APIFY_PROXY_PASSWORD.

If you choose to not use a proxy you'll need to comment out proxyConfiguration in the instantiation parameters of PuppeteerCrawler in the file src/main.ts.

To use custom proxies check out the documentation.

Note: If you use a trial account, you'll get an error when attempting to use Apify Proxies. Access to Apify Proxy during the trial period is allowed only from actors running on the Apify platform

Running

You can run the actor locally via: npm run dev

Note: To run locally, you'll need to copy .env-dist to .env and add the appropriate User ID and API token, both of which can be found on the Apify account page.

Or you can push the actor to the Apify Cloud Platform: npm run apify:push

Scheduling

The Apify SDK does not currently handle scheduling, however scheduling is available as a CLI tool within this template.

Getting started

You must first build this script by running:

npm run build

In order to use it from any path you must then run:

npm link

this will add apify-schedule as a command to your command line and make it available to use.

Usage

You can view the usage help menu by running:

apify-schedule

The following commands are available via this tool:

Usage: apify-schedule [options]

Options:
  -l, --list [number]        list schedules (default 100)
  -g, --get <scheduleId>     get schedule using schedule id
  -c, --create               run schedule creation wizard
  -u, --update <scheduleId>  update specified schedule
  -d, --delete <scheduleId>  delete specified schedule
  --asc                      sort by createdAt field in ascending order
  -h, --help                 display help for command

Currently assigning Actors and Actor Tasks needs to be done manually via Apify Cloud Platform.

Puppeteer Extras (plugins)

puppeteer-extra has been added for extension support. e.g. To add a plugin that can solve captcha: you can do so with puppeteer-extra-plugin-recaptcha which natively supports 2captcha.

PuppeteerCrawler Template

This template is a production ready boilerplate for developing with PuppeteerCrawler. Use this to bootstrap your projects using the most up-to-date code.

If you're looking for examples or want to learn more visit:

Documentation reference