0.1.0-beta.5 • Published 6 years ago

web-crawl v0.1.0-beta.5

Weekly downloads
4
License
ISC
Repository
bitbucket
Last release
6 years ago

#Web Scraper Version Build Status Downloads

This is a simple web scraping helper module that I threw together to help me setup a web crawler whenever I needed one.

##Setup/configuration

The web scraper just needs a config object with these keys

  • name: The name of the crawler

  • type: The type of crawler (currently just WebCrawler, adding more types as time goes on)

  • params: Parameters for start point. Uses request-promise library params

  • delay: How many seconds to wait between page hops

  • settings: How to crawl the website, currently only RexExp objects are used. Supports both individual or arrays of RexExp objects. Any link that is not followed or scraped will be ignored.

    • follow: Object/Array of regular expressions for links to follow ('click' on)
    • scrape: Object/Array of regular expressions for links to scrape data from
    • ignore: (OPTIONAL) Object/Array of regular expressions for links to ignore. These will not be checked to be scraped or followed.
  • parse: directory of parsers with what to scrape off of each website. Example in next section

  • output: Where to put the data when the crawl is completed

const Scraper = require('web-crawl')

let exampleScraper = new Scraper({
    name: 'Example Crawler',
    type: 'WebCrawler',
    params: {
        uri: 'https://www.example.com',
        headers: {
            'User-Agent': 'Some Way To Identify Me'
        }
    },
    delay: 3,
    settings: {
        follow: new RegExp('https:\/\/www.exmaple.com\/data'),
        scrape: new RegExp('\/data\/specific\/'),
        ignore: [new RegExp('comments'), new RegExp('about-us')]
    },
    parse: require('./ScraperModules'),
    output: require('scraper-writer')
})
exampleScraper.start()

##Parser Setup

Parsers are very simple modules that contain an xPath string and process function.

  • xPath: xPath for selecting what to scrape
  • process: Function on how to parse the scrape. Result is a wrapped response that has both extract() and extract_first() functions. The extract() function returns all matching results in an array. The extract_first() function returns the first item of that array.

In your directory, currently you need an index.js file like below that contains the exports of your parsers

module.exports = {
    name: require('./name.js')
    description: require('./description.js')
}

Example parser file

module.exports = {
    xPath: '//h1[@id=\'huge-feature-box-title\']/text()',
    process: result => {
        return result.extract_first()
    }
}

New: You can also use an array of xpath strings in the xpath valuefor if you want more than one item parsed for a given file.

##Output Setup

This is just a simple module that has a write function. I have a basic file writer one that can be used. Users are welcome to create their own as well.

let fs = require('fs')

module.exports = {
    write: data => {
        fs.writeFile('results.json', JSON.stringify(data, null, 1), err => {
            if (err)
                console.error(err)
        })
    }
}
0.1.0-beta.5

6 years ago

0.1.0-beta.4

6 years ago

0.1.0-beta.3

6 years ago

0.1.0-beta.1

6 years ago

0.1.0-alpha.16

6 years ago

0.1.0-alpha.15

6 years ago

0.1.0-alpha.14

6 years ago

0.1.0-alpha.12

6 years ago

0.1.0-alpha.11

6 years ago

0.1.0-alpha.10

6 years ago

0.1.0-alpha.8

6 years ago

0.1.0-alpha.7

6 years ago

0.1.0-alpha.6

6 years ago

0.1.0-alpha.5

6 years ago

0.1.0-alpha.4

6 years ago

0.1.0-alpha.3

6 years ago

0.1.0-alpha.2

6 years ago

0.1.0-alpha.1

6 years ago

0.1.3-alpha.1

6 years ago

0.1.3

6 years ago

0.1.2

6 years ago

0.1.1

6 years ago

0.1.0

6 years ago

0.0.6

6 years ago

0.0.5

6 years ago

0.0.4

6 years ago

0.0.3

6 years ago

0.0.2

6 years ago

0.0.1

6 years ago