mrcrawler v1.0.1
MrCrawler
MrCrawler is a simple but powerful crawler engine that you can use to broad crawl with no eforce. MrCrawler is built on top of Puppeteer and Playwright. You can also use Redis with a built in worker distribuiter allowing you use MrCrawler concurrently in different machines.
Installation
yarn add mrcrawler
npm install mrcrawler
Usage
const { MRCrawler } = require('mrcrawler')
class DemoCrawler extends MRCrawler {
constructor (options) {
super(options)
this.startUrl = options.startUrl
this.linksToVisit = []
}
async customBeforeLauchPage (page, browser) {
try {
this.emit('log', 'Launching first page')
await page.goto(this.startUrl, { waitUntil: 'networkidle2', timeout: 0 })
await this.crawl(page, browser)
} catch (error) {
console.log(error)
}
}
async crawlPage (page, browser) {
const pageTitle = await page.evaluate(() => {
const titleTag = document.querySelector('title').textContent
return titleTag
})
const pageLinks = await page.evaluate(() => {
const links = Array.from(document.querySelectorAll('a')).map((link) => {
if (!link.href.includes('javascript') && link.href !== '') {
return link.href
}
})
const filtered = links.filter((el) => el != null)
return filtered
})
pageLinks.map((link) => {
this.linksToVisit.push(link)
})
this.emit('log', `Title: ${pageTitle}`)
}
async afterPageCrawl (page, browser) {
const nextLinkToVisit = this.linksToVisit.shift()
await page.goto(nextLinkToVisit)
await this.crawl(page, browser)
}
}
const myDemoCrawler = new DemoCrawler({
startUrl: 'http://books.toscrape.com',
headless: false
})
myDemoCrawler.runCrawler()
myDemoCrawler.on('log', (event) => console.log(event))
To run your crawler after the setup, you have to call it:
const myDemoCrawler = new DemoCrawler({
startUrl: 'http://books.toscrape.com',
headless: false
})
myDemoCrawler.runCrawler()
myDemoCrawler.on('log', (event) => console.log(event))
myDemoCrawler.on('error', (event) => console.log(event))
myDemoCrawler.on('done', (event) => console.log(event))
Extensions
RedisCache
: Enable a redis list to manage all visited urls and urls to visit. Can be used for distribuited crawling - you can run your crawler on more than 1 machine - it will enable a fast way to do broad crawls.
To use RedisCache
check the examples folder and take a look at the withRedis.js
file.
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.