0.1.0 • Published 10 years ago

google-crawler v0.1.0

Weekly downloads
2
License
ISC
Repository
bitbucket
Last release
10 years ago

Google Crawler

This project is an effort to turn a publicly available paste into an NPM package. The original paste is available at the following URL:

It's an express middleware that will spit raw HTML to Google's crawler according to their specification:

It allows indexing Javascript heavy applications (SPA) by providing an HTML rendering of pages when they are requested with a special _escaped_fragment_ parameter.

It relies on a PhantomJS backend to run the frontend's Javascript.

Installation

This module is available through NPM:

npm install --save google-crawler

Usage

var express = require('express');
var google_crawler = require('google-crawler');

var server = imports.express();

server.use(google_crawler({
  scraper: 'http://scraper.example.com/img/'
}));

// Continue setting things up..

On your frontend, you'll want to include the following element:

<meta name="fragment" content="!">

Configuration

The middleware accepts the following parameters:

  • shebang: a boolean to determine wheter or not to build URLs with a shebang.
  • scraper: an URL pointing to the PhantomJS backend.

Sample backend

PhantomJS backends are expected to be built with phantom-crawler:

Here's a sample crawler:

phantom.injectJs('crawler/crawler.js');

new Crawler()
  .chrome()
  .debug()
  .crawl(function () {

    return [
      '<!DOCTYPE html>',
      '<html>',
        document.head.outerHTML,
        document.body.outerHTML,
      '</html>'
    ].join('\n');

  })
  .serve(require('system').env.PORT || 8888);
0.1.0

10 years ago

0.0.6

10 years ago

0.0.5

10 years ago

0.0.4

10 years ago

0.0.2

10 years ago

0.0.1

10 years ago