promise-parser v0.0.4
#promise-parser
Promise-based HTML/XML parser and web scraper for NodeJS.
##Features
- Fast: uses libxml C bindings
- Lightweight: no dependencies like jQuery, cheerio, or jsdom
- Clean: promise based interface- no more nested callbacks
- Flexible: supports both CSS and XPath selectors
##Example
var pp = require('promise-parser');
var parser = new pp();
// scrape all craigslist listings
parser
.get('www.craigslist.org/about/sites')
.find('h1 + div a')
.set('location')
.follow('@href')
.find('header + table a')
.set('category')
.follow('@href')
.find('p > a')
.follow('@href', { next: '.button.next' })
.set({
'title': 'section > h2',
'description': '#postingbody',
'subcategory': 'div.breadbox > span[4]',
'time': 'time@datetime',
'latitude': '#map@data-latitude',
'longitude': '#map@data-longitude',
'images[]': 'img@src'
})
.get(function(listing) {
// do something with listing data
})
##Install
npm install promise-parser
##Usage
new promise-parser([opts])
###opts [object]
- opts.http object - HTTP options given to needle instance
- opts.http.timeout int - Timeout in milliseconds
- opts.http.proxy string - Forward requests through HTTP(s) proxy
- opts.http.concurrency int - Number of simultaneous HTTP requests
- opts.http.tries int - Number of tries before giving up on a request
##Promises
####.parse(string)
Parse an HTML or XML string
HTTP GET request
HTTP POST request
####.find(selector, opts)
Find elements based on selector
within the current context
Follow URLs found within the element text or attr
####.set(args)
Find and set values for context.data
// set 'title' to current element text
pp.set('title')
// set 'title' to text of 'a.title'
pp.set('title', 'a.title')
// set multiple
pp.set({
// set 'title' to text of 'a.title'
'title': 'a.title',
// set 'description' to text of 'p.description'
'description': 'p.description',
// set 'url' to 'a.permalink' href attribute
'url': 'a.permalink @href',
// set 'images[]' to the 'src' attribute of each '<img>'
'images[]': 'img @src',
});
####.then(callback(next))
Calls callback
from the context of the current element.
To continue, the callback must call next([context])
at least once.
The context
argument can optionally be a new context.
pp.then(function(next) {
var links = this.find('a');
this.log('found '+links.length+' links');
links.forEach(function(link) {
next(link);
});
})
#####context
The this
value of .then
callback function is set to the current context.
The context is a libxmljs Element
object representing the current HTML/XML element.
In addition to all of the libxmljs Element
functions,
each context
also supports these functions:
- context.request(url, data, callback(context))
- context.post(url, data, callback(context))
- context.log(msg)
- context.debug(msg)
- context.error(msg)
- context.data object
####.data(callback(data))
Get data stored in context.data
####.done(callback)
Calls callback
when parsing has completely finished
####.log(callback(msg))
Call callback
when any log messages are received
####.error(callback(msg))
Call callback
when any error messages are received
####.debug(callback(msg))
Call callback
when any debug messages are received
##CSS helpers
These CSS helper selectors are provided to simplify complex CSS expressions and to add jQuery-like functionality.
####:contains(string)
Select elements whose contents contain string
####:starts-with(string)
Select elements whose contents start with string
####:ends-with(string)
Select elements whose contents end with string
####:first
Select first element (shortcut for :first-of-type
)
####:first(n), :limit(n)
Select first n
elements
####:last
Select last element (shortcut for :last-of-type
)
####:last(n)
Select last n
elements
####:even
Select even elements
####:odd
Select odd elements
####:skip(n), skip-first(n)
Skip first n
elements
####:skip-last(n)
Skip last n
elements
####:range(n1, n2)
Select n1
through n2
elements inclusive
####.exampleSelectorn
Select n
th element (shortcut for :nth-of-type
)
####@attribute
Select attribute
##Dependencies