Web-crawl NPM | npm.io

#Web Scraper

This is a simple web scraping helper module that I threw together to help me setup a web crawler whenever I needed one.

##Setup/configuration

The web scraper just needs a config object with these keys

name: The name of the crawler
type: The type of crawler (currently just WebCrawler, adding more types as time goes on)
params: Parameters for start point. Uses request-promise library params
delay: How many seconds to wait between page hops
settings: How to crawl the website, currently only RexExp objects are used. Supports both individual or arrays of RexExp objects. Any link that is not followed or scraped will be ignored.
- follow: Object/Array of regular expressions for links to follow ('click' on)
- scrape: Object/Array of regular expressions for links to scrape data from
- ignore: (OPTIONAL) Object/Array of regular expressions for links to ignore. These will not be checked to be scraped or followed.

parse: directory of parsers with what to scrape off of each website. Example in next section
output: Where to put the data when the crawl is completed

const Scraper = require('web-crawl')

let exampleScraper = new Scraper({
    name: 'Example Crawler',
    type: 'WebCrawler',
    params: {
        uri: 'https://www.example.com',
        headers: {
            'User-Agent': 'Some Way To Identify Me'
        }
    },
    delay: 3,
    settings: {
        follow: new RegExp('https:\/\/www.exmaple.com\/data'),
        scrape: new RegExp('\/data\/specific\/'),
        ignore: [new RegExp('comments'), new RegExp('about-us')]
    },
    parse: require('./ScraperModules'),
    output: require('scraper-writer')
})
exampleScraper.start()

##Parser Setup

Parsers are very simple modules that contain an xPath string and process function.

xPath: xPath for selecting what to scrape
process: Function on how to parse the scrape. Result is a wrapped response that has both extract() and extract_first() functions. The extract() function returns all matching results in an array. The extract_first() function returns the first item of that array.

In your directory, currently you need an index.js file like below that contains the exports of your parsers

module.exports = {
    name: require('./name.js')
    description: require('./description.js')
}

Example parser file

module.exports = {
    xPath: '//h1[@id=\'huge-feature-box-title\']/text()',
    process: result => {
        return result.extract_first()
    }
}

New: You can also use an array of xpath strings in the xpath valuefor if you want more than one item parsed for a given file.

##Output Setup

This is just a simple module that has a write function. I have a basic file writer one that can be used. Users are welcome to create their own as well.

let fs = require('fs')

module.exports = {
    write: data => {
        fs.writeFile('results.json', JSON.stringify(data, null, 1), err => {
            if (err)
                console.error(err)
        })
    }
}

request request-promise robots-parser xmldom xpath

@infinitebrahmanuniverse/nolb-web-c @everything-registry/sub-chunk-3122

7 years ago

7 years ago

7 years ago

7 years ago

7 years ago