0.0.15 • Published 8 years ago

scrapr v0.0.15

Weekly downloads
5
License
ISC
Repository
github
Last release
8 years ago

Scrapr

"Why should I use this"?

There are websites that rely on javascript frameworks (like jQuery or AngularJS) and process dynamic data after the page load. For these type of websites, you should use a browser to interpert the javascript code and then get the data. Which is what this tool does: provides methods for this task using SlimerJS (Firefox's Gecko engine) in the background.

Installation

  $ npm install scrapr --save

Methods


getHtmlViaBrowser(url, loadImages)
  • url: string (required)
  • loadImages: bool (optional)

Opens a browser under the hood, waits for the page load and then gets the data. Returns a promise with a jQuery object ($). This is useful if the page relies on javascript and updates the html content after the page load.

var scrapr = require('scrapr');

// Opens a browser (loading images), goes to google, gets html tag, finds "feeling lucky button" and prints it
scrapr.getHtmlViaBrowser('http://www.google.com', true).then(function($){
    $('html').filter(function(){  
      var htmlTag = $(this);
      var luckyButton = htmlTag.find('input.lsb')[0];
      console.log($(luckyButton).attr('value'));
    });
}, function(err){
    console.log('Could not request page. Error: ' + err.message);
});

getHtmlViaRequest(url)
  • url: string (required)

Makes a direct GET request to the url and returns a promise with a jQuery object ($). This is useful if the page does not rely on javascript and updates the html content after the page load.

var scrapr = require('scrapr');

// Gets google's html tag, finds "feeling lucky button" and prints it
scrapr.getHtmlViaRequest('http://www.google.com').then(function($){
    $('html').filter(function(){  
        var htmlTag = $(this);
        var luckyButton = htmlTag.find('input.lsb')[0];
        console.log($(luckyButton).attr('value'));
    });
}, function(err){
    console.log('Could not request page. Error: ' + err.message);
});

parseListIntoSlices(arrayToParse, length)
  • arrayToParse: string (required)
  • length: number (required)

A helper function that splits a large array into slices with the specified length. Useful for throttling large amount of requests while doing parallel requests. For example: scraping 50 pages into slices of 5 with a minute interval for each slice.

var scrapr = require('scrapr');

// Creates an array of 50 elements and then split it into slices of 6
var largeArray = new Array();
for(var i = 0; i < 50; i++) {
  largeArray.push(i);
}

var slices = scrapr.parseListIntoSlices(largeArray, 6);
console.log(slices);

Author

Renan Caldas de Oliveira

0.0.15

8 years ago

0.0.14

8 years ago

0.0.13

8 years ago

0.0.12

8 years ago

0.0.11

8 years ago

0.0.10

8 years ago

0.0.9

8 years ago

0.0.8

8 years ago

0.0.7

8 years ago

0.0.6

8 years ago

0.0.5

8 years ago

0.0.4

8 years ago

0.0.3

8 years ago

0.0.2

8 years ago

0.0.1

8 years ago