4.0.2 • Published 4 years ago

mini-crawler v4.0.2

Weekly downloads
1
License
MIT
Repository
github
Last release
4 years ago

mini-crawler

A really small crawler based on p-queue and got.

Installation

npm install mini-crawler

Usage

const { Crawler } = require('..')
const crawler = new Crawler({
  // All of p-queue options are accepted,
  // see https://www.npmjs.com/package/p-queue#options.
  concurrency: 10,
  interval: 1000,
  intervalCap: 500
})

crawler.crawl({
  // The crawling will begin at this URL.
  url: 'https://www.google.com/search?q=Tame+Impala',
  // All of got's options are accepted,
  // except isStream and resolveBodyOnly.
  // See https://www.npmjs.com/package/got#options.
  gotOptions: {
    headers: {
      'Accept-Language': 'en-US'
    }
  },
  callback: (err, { body, options: { url } }) => {
    if (err) {
      console.error(`Error occured at ${url}!`)
      return
    }

    console.log(`Fetched ${url}!`)

    const urls = [...body.matchAll(/href="(.*?)"/g)]
      .map(match => match[1]
        .replace(/&/g, '&')
      )
    console.log(`Found ${urls.length} new URLs!`)

    // The returned value will be used for further crawls.
    // It will be transformed into a CrawlOptions array,
    // then crawl() will be called for all of the array's items.
    // To understand the transforming mechanism,
    // see https://github.com/trustedtomato/mini-crawler/blob/master/src/result-to-crawloptions-array.ts
    return urls
  }
})

Output:

Fetched https://www.google.com/search?q=Tame+Impala!
Found 145 new URLs!
Fetched https://www.google.com/?sa=X&ved=0ahUKEwiM39zjjuHqAhUUdCsKHf7xAX8QOwgC!
Found 19 new URLs!
Fetched https://www.google.com/?output=search&ie=UTF-8&sa=X&ved=0ahUKEwiM39zjjuHqAhUUdCsKHf7xAX8QPAgE!
Found 19 new URLs!
Fetched https://www.google.com/advanced_search!
Found 22 new URLs!
Fetched https://www.google.com/search?q=Tame+Impala&ie=UTF-8&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiM39zjjuHqAhUUdCsKHf7xAX8Q_AUICSgC!
Found 59 new URLs!
…

API

For the auto-generated typedoc, see the project's GitHub Page.

Examples

See the examples folder. There lies a crawler which tries to find Hitler's Wikipedia article by navigating from article to article, using the links in them. You know, the classic clicks to Hitler game.

To try out the example, clone this respository, cd into it, run npm install && npm run build then node examples/clicks-to-hitler.js Barack_Obama. (At the time of writing, you can get there with two clicks.)

4.0.2

4 years ago

4.0.1

4 years ago

4.0.0

4 years ago

3.0.0

5 years ago

2.0.0

5 years ago

1.1.0

5 years ago

1.0.0

5 years ago