0.1.0 • Published 5 years ago

icrawl v0.1.0

Weekly downloads
2
License
MIT
Repository
github
Last release
5 years ago

icrawl

Crawl pages and generate htmls corresponding to the path.

Features

  • With nginx, you can do SEO on the front-end rendered page.
  • Built-in server, you can directly crawl the page based on the built folder
  • The html save path corresponds to the url path
  • Does not depend on any front-end framework
  • Provide API calls and command line calls

Examples

Node API

const path = require('path')
const Crawl = require('icrawl')
const crawl = new Crawl({
  requestTimeout: 10000,
  isNormalizeSourceURL: true,
  routes: [
    'https://nodejs.org/api/path.html',  
    'https://nodejs.org/api/url.html'
  ],
  path: path.resolve(__dirname, 'static')
})
crawl.start()

Configuration

.icrawlrc.js in your project root

const path = require('path')

module.exports = {
  isNormalizeSourceURL: true,
  routes: [
    'https://nodejs.org/api/path.html',  
    'https://nodejs.org/api/url.html'
  ],
  path: path.resolve(__dirname, 'static')
}

package.json

"scripts": {
  "build": "icrawl"
}

options

  • options <Object>
    • viewport <Object> viewport size
      • width <Number>
      • height <Number>
    • maxPageCount <Number> Number of pages that can be opened in parallel, default: 10
    • isNormalizeSourceURL <Boolean | Object> Whether to convert the relative path of images, anchors, links, scripts to absolute paths in your crawled html, for example, When you crawl the page url is http://www.example.com/example, it will be /favicon.ico to http://www.example.com/favicon.ico. You can also set each option individually. default: false
      • links <Boolean>
      • images <Boolean>
      • scripts <Boolean>
      • anchors <Boolean>
    • requestTimeout <Number> Number of milliseconds for request timeout, default: 30000ms, set to 0 to wait indefinitely
    • host <String> default: ''
    • routes <Array<String>> The list of routes to be crawled, the relative path needs to set the host option
    • outputPath <String> Html saved directory
    • saveHTML <Boolean> Whether to save the crawl page as html, default: true
    • depth <Number | Object> Specify page depth if it is a Number, The A page is configured on the routes, the A (depth: 0) page contains a link to B (depth: 1), and the B page contains a link to C (depth: 2), default: 0
      • value <Number> page depth
      • include <RegExp> Included link, default: null
      • exclude <RegExp> Excluded link, default: null
      • after <Function(Array<PageRoute>)> Callback after page link collection is complete, default: null
    • serverConfig <String | Object> If the page to be crawled is not on a server, you can specify this option to start a server locally. If it is a String, specify the directory where the page is located. default: null
      • path <String> The directory where the page is located, for example, your build directory path, then you can run icrawl after build command or put two commands together in scripts
      • port <Number> default: 3333
      • public <String> This option needs to be specified when the isNormalizeSourceURL option is specified as true at the same time. Relative paths will be converted relative to this option
      • isFallback <Boolean> For SPA, alwalys change the requested location to the index.html
    • requestInterception <Object> Filter requests, use this configuration reasonably to speed up crawling. For example, we don't need to wait for images, css, fonts, third-party scripts to load, after all, we only need to save the rendered html most of the time
      • include <RegExp>
      • exclude <RegExp>
    • progressBarStyle <Object> Progress bar style
      • prefix <String> default: ''
      • suffix <String> default: ''
      • remaining <String> default: '░'
      • completed <String> default: '█'

crawl.start()

return: Promise

PageRoute

  • url <String> The page url to crawl
  • root <PageRoute> The root of chain
  • referer <PageRoute> The parent of this url

Tips

  • By configuring nginx, you can enable SEO for front-end rendering pages.
  • If you use nginx you will need to install the set-misc-nginx-module module, or install the OpenResty directly.

License

MIT licensed.

0.1.0

5 years ago

0.0.13

5 years ago

0.0.12

5 years ago

0.0.11

5 years ago

0.0.10

5 years ago

0.0.9

5 years ago

0.0.8

5 years ago

0.0.7

5 years ago

0.0.6

5 years ago

0.0.5

5 years ago

0.0.4

5 years ago

0.0.3

5 years ago

0.0.2

5 years ago

0.0.1

5 years ago