0.4.2 • Published 9 years ago

cheers2 v0.4.2

Weekly downloads
3
License
-
Repository
github
Last release
9 years ago

Cheers

Scrape a website efficiently, block by block, page by page.

Motivations

This is a Cheerio based scraper, useful to extract data from a website using CSS selectors. The motivation behind this package is to provide a simple cheerio-based scraping tool, able to divide a website into blocks, and transform each block into a JSON object using CSS selectors.

Built on top of the excellents :

https://github.com/cheeriojs/cheerio https://github.com/chriso/curlrequest https://github.com/kriskowal/q

CSS mapping syntax inspired by :

https://github.com/dharmafly/noodle

Getting Started

Install the module with: npm install cheers

Usage

Configuration options:

  • config.url : the URL to scrape
  • config.blockSelector : the CSS selector to apply on the page to divide it in scraping blocks. This field is optional (will use "body" by default)
  • config.scrape : the definition of what you want to extract in each block. Each key has two mandatory attributes : selector (a CSS selector or . to stay on the current node) and extract. The possible values for extract are text, html, outerHTML, a RegExp or the name of an attribute of the html element (e.g. "href")

Roadmap

  • Option to use request instead of curl
  • Option to change the user agent
  • Command line tool
  • Website pagination
  • Option to use a headless browser
  • Unit tests

Contributors

Cheers!

License

Copyright (c) 2014 Fabien Allanic
Licensed under the MIT license.