1.0.3 • Published 8 years ago
configurable-html-parser v1.0.3
HTML Body Parser
Parse html document with a specific configuration object.
Install
npm install configurable-html-parser
Usage
Parser my personal github repositories for an example.
var request = require('request');
var parser = require('../lib/parser.js');
request('https://github.com/Herobs?tab=repositories', function(err, res, body) {
console.log(parser(body, {
author: {
selector: 'h1.vcard-names > span.vcard-username'
},
repositories: {
selector: 'ul.repo-list.js-repo-list > li',
children: {
name: {
selector: 'h3.repo-list-name',
regexp: /<a[\s\S]*?>([\s\S]*?)<\/a>/
},
url: {
selector: 'h3.repo-list-name',
regexp: /<a\s+href="([\s\S]*?)">([\s\S]*?)<\/a>/
},
desc: {
selector: 'p.repo-list-description'
}
}
}
}));
});
Parser will return an object which construction match the configuration object. The first param is the html body will be parsed, and the second is the configuration object. The configuration rule as following:
- each key - value as a output item.
selector
is the css selector apply to current dom context.regexp
is a regular expression to extract data exactly.childMatch
is the regular expression child match. If not specified and there is a child match, the first child match will be returned, or the full match string will be returned.children
is context flag, if specified, current context passed to the children, and will parse the children recursively.
example output:
{
author: 'Herobs',
repositories:
[{
name: 'body-parser',
url: '/Herobs/body-parser',
desc: 'html body parser using a configuration object.'
}, {
name: 'GhostTheme',
url: '/Herobs/GhostTheme',
desc: 'Theme for Ghost'
}]
}
License
This package released under MIT license.