1.0.2 • Published 18 days ago

ngrab v1.0.2

Weekly downloads
-
License
MIT
Repository
-
Last release
18 days ago

Intro

A lightweight node spider. Supports:

  1. FollowLink
  2. Csutom headers
  3. Bloom filter
  4. Retry mechanism
  5. Proxy Request
  6. Routing
  7. Crawling from last visited link
  8. Free use of parser and memory

Usage

import { Crawler, userAgent } from 'ngrab'
import cheerio from 'cheerio'

// For example, crawling the hottest projects on Github
let crawler = new Crawler({
    // required && unique
    name: 'myCrawler',
    // enable bloom filter
    bloom: true,
    // set random intervals(ms) between requests
    interval: () => (Math.random() * 16 + 4) * 1000, // [4s, 20s]
    // initial Link
    startUrls: ['https://github.com/trending'],
})

// download(name, cb)
crawler.download('trending', async ({ req, res, followLinks, resolveLink }) => {
    if (!res) return
    // parsing HTML strings
    let $ = cheerio.load(res.body.toString())
    // extract data
    let repoList: Array<{ name: string; href: string }> = [],
        $rows = $('.Box-row')
    if ($rows.length) {
        $rows.each(function (index) {
            let $item = $(this)

            repoList.push({
                name: $('.lh-condensed a .text-normal', $item)
                    .text()
                    .replace(/\s+/g, ' ')
                    .trim(),
                href: $('.lh-condensed a', $item).attr('href') as string,
            })
        })
        // print
        console.log(repoList) // or store in your Database
        // follow links
        // repoList.forEach((v) => followLinks(resolveLink(v.href)))
    }
})

// start crawling
crawler.run()

Custom Headers

The request hook will execute before each request:

// request(name, cb)
crawler.request('headers', async (context) => {
    // set custom headers
    Object.assign(context.req.headers, {
        'Cache-Control': 'no-cache',
        'User-Agent': userAgent(), // set random UserAgent
        Accept: '*/*',
        'Accept-Encoding': 'gzip, deflate, compress',
        Connection: 'keep-alive',
    })
})

Routes

Instead of parsing everything in 'crawler.download()', you can split the parsing code into different routes:

crawler.route({
    url: 'https://github.com/trending', // for trending page (compatible with minimatch)
    async download(({req, res})){
        // parsing ...
    }
})

crawler.route({
    url: 'https://github.com/*/*', // for repository page
    async download(({req, res})){
        // parsing ...
    }
})

crawler.route({
    url: 'https://github.com/*/*/issues', // for issues page
    async download(({req, res})){
        // parsing ...
    }
})

Proxy

You can provider a proxy server getter when initializing the crawler:

let crawler = new Crawler({
    name: 'myCrawler',
    startUrls: ['https://github.com/trending'],
    async proxy() {
        let url = await getProxyUrlFromSomeWhere()
        // The return value will be used as a proxy when sending a request
        return url
    },
})