Icrawl NPM | npm.io

icrawl

Crawl pages and generate htmls corresponding to the path.

Features

With nginx, you can do SEO on the front-end rendered page.
Built-in server, you can directly crawl the page based on the built folder
The html save path corresponds to the url path
Does not depend on any front-end framework
Provide API calls and command line calls

Examples

Node API

const path = require('path')
const Crawl = require('icrawl')
const crawl = new Crawl({
  requestTimeout: 10000,
  isNormalizeSourceURL: true,
  routes: [
    'https://nodejs.org/api/path.html',  
    'https://nodejs.org/api/url.html'
  ],
  path: path.resolve(__dirname, 'static')
})
crawl.start()

Configuration

.icrawlrc.js in your project root

const path = require('path')

module.exports = {
  isNormalizeSourceURL: true,
  routes: [
    'https://nodejs.org/api/path.html',  
    'https://nodejs.org/api/url.html'
  ],
  path: path.resolve(__dirname, 'static')
}

package.json

"scripts": {
  "build": "icrawl"
}

options

options <Object>
- viewport <Object> viewport size
  - width <Number>
  - height <Number>
- maxPageCount <Number> Number of pages that can be opened in parallel, default: 10
- isNormalizeSourceURL <Boolean | Object> Whether to convert the relative path of images, anchors, links, scripts to absolute paths in your crawled html, for example, When you crawl the page url is http://www.example.com/example, it will be /favicon.ico to http://www.example.com/favicon.ico. You can also set each option individually. default: false
  - links <Boolean>
  - images <Boolean>
  - scripts <Boolean>
  - anchors <Boolean>
- requestTimeout <Number> Number of milliseconds for request timeout, default: 30000ms, set to 0 to wait indefinitely
- host <String> default: ''
- routes <Array<String>> The list of routes to be crawled, the relative path needs to set the host option
- outputPath <String> Html saved directory
- saveHTML <Boolean> Whether to save the crawl page as html, default: true
- depth <Number | Object> Specify page depth if it is a Number, The A page is configured on the routes, the A (depth: 0) page contains a link to B (depth: 1), and the B page contains a link to C (depth: 2), default: 0
  - value <Number> page depth
  - include <RegExp> Included link, default: null
  - exclude <RegExp> Excluded link, default: null
  - after <Function(Array<PageRoute>)> Callback after page link collection is complete, default: null
- serverConfig <String | Object> If the page to be crawled is not on a server, you can specify this option to start a server locally. If it is a String, specify the directory where the page is located. default: null
  - path <String> The directory where the page is located, for example, your build directory path, then you can run icrawl after build command or put two commands together in scripts
  - port <Number> default: 3333
  - public <String> This option needs to be specified when the isNormalizeSourceURL option is specified as true at the same time. Relative paths will be converted relative to this option
  - isFallback <Boolean> For SPA, alwalys change the requested location to the index.html
- requestInterception <Object> Filter requests, use this configuration reasonably to speed up crawling. For example, we don't need to wait for images, css, fonts, third-party scripts to load, after all, we only need to save the rendered html most of the time
  - include <RegExp>
  - exclude <RegExp>
- progressBarStyle <Object> Progress bar style
  - prefix <String> default: ''
  - suffix <String> default: ''
  - remaining <String> default: '░'
  - completed <String> default: '█'

crawl.start()

return: Promise

PageRoute

url <String> The page url to crawl
root <PageRoute> The root of chain
referer <PageRoute> The parent of this url

Tips

By configuring nginx, you can enable SEO for front-end rendering pages.
If you use nginx you will need to install the set-misc-nginx-module module, or install the OpenResty directly.