0.1.0 • Published 7 years ago
icrawl v0.1.0
icrawl
Crawl pages and generate htmls corresponding to the path.
Features
- With nginx, you can do SEO on the front-end rendered page.
- Built-in server, you can directly crawl the page based on the built folder
- The html save path corresponds to the url path
- Does not depend on any front-end framework
- Provide API calls and command line calls
Examples
Node API
const path = require('path')
const Crawl = require('icrawl')
const crawl = new Crawl({
requestTimeout: 10000,
isNormalizeSourceURL: true,
routes: [
'https://nodejs.org/api/path.html',
'https://nodejs.org/api/url.html'
],
path: path.resolve(__dirname, 'static')
})
crawl.start()Configuration
.icrawlrc.js in your project root
const path = require('path')
module.exports = {
isNormalizeSourceURL: true,
routes: [
'https://nodejs.org/api/path.html',
'https://nodejs.org/api/url.html'
],
path: path.resolve(__dirname, 'static')
}package.json
"scripts": {
"build": "icrawl"
}options
options<Object>viewport<Object> viewport sizewidth<Number>height<Number>
maxPageCount<Number> Number of pages that can be opened in parallel, default:10isNormalizeSourceURL<Boolean | Object> Whether to convert the relative path of images, anchors, links, scripts to absolute paths in your crawled html, for example, When you crawl the page url ishttp://www.example.com/example, it will be/favicon.icotohttp://www.example.com/favicon.ico. You can also set each option individually. default:falselinks<Boolean>images<Boolean>scripts<Boolean>anchors<Boolean>
requestTimeout<Number> Number of milliseconds for request timeout, default:30000ms, set to0to wait indefinitelyhost<String> default:''routes<Array<String>> The list of routes to be crawled, the relative path needs to set thehostoptionoutputPath<String> Html saved directorysaveHTML<Boolean> Whether to save the crawl page as html, default:truedepth<Number | Object> Specify page depth if it is aNumber, The A page is configured on the routes, the A (depth: 0) page contains a link to B (depth: 1), and the B page contains a link to C (depth: 2), default:0value<Number> page depthinclude<RegExp> Included link, default:nullexclude<RegExp> Excluded link, default:nullafter<Function(Array<PageRoute>)> Callback after page link collection is complete, default:null
serverConfig<String | Object> If the page to be crawled is not on a server, you can specify this option to start a server locally. If it is aString, specify the directory where the page is located. default:nullpath<String> The directory where the page is located, for example, yourbuilddirectory path, then you can runicrawlafterbuildcommand or put two commands together inscriptsport<Number> default:3333public<String> This option needs to be specified when theisNormalizeSourceURLoption is specified astrueat the same time. Relative paths will be converted relative to this optionisFallback<Boolean> For SPA, alwalys change the requested location to theindex.html
requestInterception<Object> Filter requests, use this configuration reasonably to speed up crawling. For example, we don't need to wait for images, css, fonts, third-party scripts to load, after all, we only need to save the renderedhtmlmost of the timeinclude<RegExp>exclude<RegExp>
progressBarStyle<Object> Progress bar styleprefix<String> default:''suffix<String> default:''remaining<String> default:'░'completed<String> default:'█'
crawl.start()
return: Promise
PageRoute
url<String> The page url to crawlroot<PageRoute> Therootof chainreferer<PageRoute> The parent of this url
Tips
- By configuring nginx, you can enable SEO for front-end rendering pages.
- If you use nginx you will need to install the set-misc-nginx-module module, or install the OpenResty directly.