1.0.1 • Published 4 years ago

mrcrawler v1.0.1

Weekly downloads
-
License
Apache-2.0
Repository
github
Last release
4 years ago

MrCrawler

MrCrawler is a simple but powerful crawler engine that you can use to broad crawl with no eforce. MrCrawler is built on top of Puppeteer and Playwright. You can also use Redis with a built in worker distribuiter allowing you use MrCrawler concurrently in different machines.

Installation

yarn add mrcrawler
npm install mrcrawler

Usage

const { MRCrawler } = require('mrcrawler')

class DemoCrawler extends MRCrawler {
  constructor (options) {
    super(options)
    this.startUrl = options.startUrl

    this.linksToVisit = []
  }

  async customBeforeLauchPage (page, browser) {
    try {
      this.emit('log', 'Launching first page')
      await page.goto(this.startUrl, { waitUntil: 'networkidle2', timeout: 0 })
      await this.crawl(page, browser)
    } catch (error) {
      console.log(error)
    }
  }

  async crawlPage (page, browser) {
    const pageTitle = await page.evaluate(() => {
      const titleTag = document.querySelector('title').textContent
      return titleTag
    })
    const pageLinks = await page.evaluate(() => {
      const links = Array.from(document.querySelectorAll('a')).map((link) => {
        if (!link.href.includes('javascript') && link.href !== '') {
          return link.href
        }
      })
      const filtered = links.filter((el) => el != null)
      return filtered
    })

    pageLinks.map((link) => {
      this.linksToVisit.push(link)
    })

    this.emit('log', `Title: ${pageTitle}`)
  }

  async afterPageCrawl (page, browser) {
    const nextLinkToVisit = this.linksToVisit.shift()
    await page.goto(nextLinkToVisit)
    await this.crawl(page, browser)
  }
}

const myDemoCrawler = new DemoCrawler({
  startUrl: 'http://books.toscrape.com',
  headless: false
})
myDemoCrawler.runCrawler()
myDemoCrawler.on('log', (event) => console.log(event))

To run your crawler after the setup, you have to call it:

const myDemoCrawler = new DemoCrawler({
  startUrl: 'http://books.toscrape.com',
  headless: false
})

myDemoCrawler.runCrawler()

myDemoCrawler.on('log', (event) => console.log(event))
myDemoCrawler.on('error', (event) => console.log(event))
myDemoCrawler.on('done', (event) => console.log(event))

Extensions

RedisCache: Enable a redis list to manage all visited urls and urls to visit. Can be used for distribuited crawling - you can run your crawler on more than 1 machine - it will enable a fast way to do broad crawls. To use RedisCache check the examples folder and take a look at the withRedis.js file.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

1.0.1

4 years ago

1.0.0

4 years ago