1.0.2 • Published 3 years ago

microfrontier v1.0.2

Weekly downloads
-
License
ISC
Repository
github
Last release
3 years ago

MicroFrontier · npm npm version Docker Pulls Docker Image Size (tag)

A web crawler frontier implementation in TypeScript backed by Redis. MicroFrontier is a scalable and distributed frontier implemented through Redis Queues.

  • Fast Ingestion & High throughput
  • Multiple priority queues
  • Custom priority strategy
  • Per-Hostname crawl rate limit or default delay fallback
  • Easy to use HTTP Microservice
  • Multi-processing support

Example of Mercator Frontier1

Queue

Usage

MicroFrontier can be used both as a Javascript library SDK, from the command line or with a Docker deploy.

Command Line

Install microfrontier with:

npm i -g microfrontier

Run microfrontier

microfrontier --port 3035 --redis:host localhost #see configuration for other parameters

As a package

Npm:

npm i microfrontier

Yarn:

yarn add microfrontier

Docker

docker pull adileo/microfrontier

Configuration

ENV VARCLI PARAMSDescription
host--hostHost name to start the microservice http server. Default value: 127.0.0.1
port--portPort to start the microservice http server. Default value: 8090
redis_host--redis:hostRedis server host. Default value: 127.0.0.1
redis_port--redis:portRedis server port. Default value: 6379
redis_*--redis:*Parameters are interpreted by nconf and passed to ioredis as the client config.
config_frontierName--config:frontierNamePrefix used for Redis keys.
config_*--config:*Parameters are interpreted by nconf, default value below.
{
    frontierName: 'frontier',
    priorities: {
        'high':     {probability: 0.6},
        'normal':   {probability: 0.3},
        'low':      {probability: 0.1},
    },
    defaultCrawlDelay: 1000
}

How to

Adding an URL to the frontier

Via HTTP

curl --location --request POST 'http://127.0.0.1:8090/frontier' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "http://www.example.com",
    "priority": "normal",
    "meta": {
        "foo": "bar"
    }
}'

Via SDK

import { URLFrontier } from "microfrontier"

const frontier = new URLFrontier(config)

frontier.add("http://www.example.com", "normal", {"foo": "bar"}).then(() => {
    console.log('URL added')
})

Getting an URL from the frontier

curl --location --request GET 'http://127.0.0.1:8090/frontier'
import { URLFrontier } from "microfrontier"

const frontier = new URLFrontier(config)

frontier.get().then((item) => {
    // {url: "http://www.example.com", meta: {"foo":"bar"}}
})

Citations

[1]: High-Performance Web Crawling - Marc Najork, Allan Heydon