0.2.1 • Published 8 years ago
html-explorer v0.2.1
html-explorer - HTML page explorer
html-explorer extracts main information from a HTML page.
Currently it extracts:
- Page meta:
title
description
keywords
canonical
feeds
- Main images - a ordered list of images;
- Main videos - a ordered list of videos;
- Page content - main page content/article;
- Page encoding;
Usage
var explorer = require('html-explorer');
explorer.explore('http://edition.cnn.com/')
.then(function(page){
// page object
});
Result structure
url
(String) - inputurl
param;href
(String) - server response url;canonical
(String) - page canonical;title
(String);description
(String);keywords
(String);content
(String);encoding
(String): utf8, windows-1251, iso-8859-2, etc.;feeds
(Feed) - list of feeds:title
(String);href
(String) - feed url;
images
(Image) - a list of images:src
(String) - image src;viewWidth
(Number) - image view width if founded;viewHeight
(Number);width
(Number) - real image width;height
(Number);alt
(String);title
(String);rating
(Number) - count of words matching page title words;type
(String) - (only ifidentify
option is true) - can be:bmp
,gif
,jpg
,png
,psd
,svg
,tiff
orwebp
;data
(Buffer) - (only ifidentify
option is true) - image data.
videos
(Video) - a list of videos:sourceType
(String) - video source type:URL
,YOUTUBE
,VIMEO
orIFRAME
;sourceId
(String) - depends ofsourceType
: url or source id;width
(Number) - video width;height
(Number) - video height;
API
explorer.explore(url, [options])
Explores an url.
Options
page
- html page options:timeout
(Number) 5000 - request timeout;headers
(Object) {}- request headers;canonical
(Boolean) true - find or not;feeds
(Boolean|Function) - find or not, function for validating a feed;validator
(Function) noop - Validates page after exploring info, throw an error if invalid;html
(Boolean|String) false - Return HTML text or not. If is string it will be used as remote HTML body;lang
(String) - page language 2 chars code;
content
(Boolean|Object) - content options:filter
(Boolean|Object):minLine
: (Number) 50 - accepted minimum line length;minPhrase
: (Number) 100 - accepted minimum phrase length;phraseEndRegex
: (Regex) default: /.!?:;¡¿%$/ - end phrase puctuation regex;phraseEnd
: (Boolean) false - require phrase to end with a puctuation;maxInvalidLines
: (Number) 3 - maximum consecutive invalid lines;minScore
: (Number) 0.3 - min in text search score: 0 to 1;
images
(Boolean|Object) - images explorer options:limit
(Number) 5 - maximum number of images to return;filter
(Object):minViewHeight
(Number) 180 - accepted minimum image view height;minViewWidth
(Number) 220 - accepted minimum image view width;minHeight
(Number) 200 - accepted minimum image height;minWidth
(Number) 250 - accepted minimum image width;minRating
(Number) 0 - accepted minimum image rating(...);minRatio
(Number) null - accepted minimum image ratio (ratio
=width
/height
);maxRatio
(Number) null - accepted maximum image ratio;invalidRatio
(Number | Number) 1 - example: value 1 will exclude all images with width=height;invalidExtensions
(String) gif, png - invalid image extensions;src
(RegExp) see source code - invalidate image by SRC;extraSrc
(RegExp) - invalidate image by SRC;cssClass
(RegExp) - filter image by its css class;types
(String|String) - accepted image types (bmp
,gif
,jpg
,png
,psd
,svg
,tiff
,webp
), default:['jpg']
;invalidTypes
(String|String) - invalid image types;
identify
(Boolean) false - identify imagewidth
,height
andtype
by downloading data;data
(Boolean) false - set imagedata
property. Works only ifidentify
is true.timeout
(Number) 1000 - image downloading timeout, in ms.
video
(Boolean|Object) - video explorer options:limit
(Number) 1 - maximum number or videos to return;filter
(Object):minHeight
(Number) 200 - accepted minimum image height;minWidth
(Number) 250 - accepted minimum image width;minRatio
(Number) null - accepted minimum image ratio (ratio
=width
/height
);maxRatio
(Number) null - accepted maximum image ratio;invalidRatio
(Number | Number) 1 - example: value 1 will exclude all images with width=height;src
(RegExp) see source code - invalidate image by SRC;extraSrc
(RegExp) - invalidate image by SRC;
priority
(String) - video source type priority - default:['YOUTUBE', 'VIMEO', 'URL', 'IFRAME']
;customFinders
(Finder) - a list of custom video fiders.
Changelog
v0.1.12 - July 16, 2016
- filter page content by relevancy score option;
- added
lang
option; - using ascripe module instead of readability-js;
- using in-text-search module;
v0.1.11 - August 16, 2016
- find videos from known iframes
v0.1.9 - August 15, 2015
- explore content with
readability-js
- fix videos explore bug
v0.1.6 - August 3, 2015
- explore videos from microdata
v0.1.5 - August 3, 2015
- filter page content
- better encoding detection & add to the response object
v0.1.4 - August 2, 2015
- tests
- extracting page content
- editorconfig, eslint
v0.1.2 - June 17, 2015
- custom video finders
- sort videos by priority option
- head(og:video) video finder
v0.1.1 - June 13, 2015
- decode page urls
- image downloading timeout
v0.1.0 - May 30, 2015
- detect embedded videos
- better images order
v0.0.8 - May 29, 2015
- detect charset from content-type response header
- image filter:
invalidRatio
v0.0.7 - May 22, 2015
- filter images by view size - width & heigth detected in image attributes
- merge images with same src
0.2.1
8 years ago
0.2.0
8 years ago
0.1.12
9 years ago
0.1.11
9 years ago
0.1.10
9 years ago
0.1.9
10 years ago
0.1.8
10 years ago
0.1.7
10 years ago
0.1.6
10 years ago
0.1.5
10 years ago
0.1.4
10 years ago
0.1.2
10 years ago
0.1.1
10 years ago
0.1.0
10 years ago
0.0.8
10 years ago
0.0.7
10 years ago
0.0.6
10 years ago
0.0.5
10 years ago
0.0.4
10 years ago
0.0.3
10 years ago
0.0.2
10 years ago