1.2.6 • Published 2 years ago

chowdown v1.2.6

Weekly downloads
6
License
ISC
Repository
github
Last release
2 years ago

chowdown

Build Status Dependencies Coverage

A JavaScript library that allows for the quick transformation of DOM documents into useful formats.

Table of Contents

Installation

$ npm install chowdown

Basic Usage

Let's suppose there's a webpage, http://somewebpage.com with the following markup:

<div>
  <div class="author">
    <a href="/dennis" class="name">Dennis Reynolds</a>
    <span class="age">41</span>
    <img src="dennis.jpg"/>
    <div class="book">
      <span class="title">The Dennis System</span>
      <span class="year">2009</span>
    </div>
    <div class="book">
      <span class="title">Chardee MacDennis: A Guide</span>
      <span class="year">2011</span>
    </div>
  </div>
  <div class="author">
    <a href="/stephen" class="name">Stephen King</a>
    <span class="age">69</span>
    <img src="stephen.jpg"/>
    <div class="book">
      <span class="title">Clown Town</span>
      <span class="year">1990</span>
    </div>
  </div>
  <a class="next" href="/search?page=2"/>
</div>

To quickly pull out the name and age of each author into an array of objects, we can do the following:

const chowdown = require('chowdown');

// Returns a promise
chowdown('http://somewebpage.com')
  .collection('.author', {
    name: '.name',
    age: '.age'
  });

This will resolve to:

[
  { name: 'Dennis Reynolds', age: '41'},
  { name: 'Stephen King', age: '69'}
]

When executed, all chowdown queries return an instance of a bluebird Promise.

Attributes

Chowdown is built on top of cheerio and hence it uses the familiar jQuery selector format. However, chowdown's selectors also make it possible to get a DOM element's attribute by appending the attribute's name to the end of a selector (following a /).

This makes getting the src attribute of each author's image easy:

chowdown('http://somewebpage.com')
  .collection('.author', {
    name: '.name',
    age: '.age',
    image: 'img/src'
  });

This will resolve to:

[
  { name: 'Dennis Reynolds', age: '41', image: 'dennis.jpg'},
  { name: 'Stephen King', age: '69', image: 'stephen.jpg'}
]

If no attribute is specified in the selector for simple types of queries (i.e string or number queries), then chowdown will automatically grab an element's inner text.

Nesting

Using chowdown, we can construct much more complex queries. It's possible to construct queries for use inside of other queries.

If we wanted to retrieve each of the author's books, we could do the following:

chowdown('http://somewebpage.com')
  .collection('.author', {
    name: '.name',
    age: '.age',
    books: chowdown.query.collection('.book', {
      title: '.title',
      year: '.year'
    })
  });

// or, alternatively:

chowdown('http://somewebpage.com')
  .collection('.author', {
    name: '.name',
    age: '.age',
    books: (author) => author.collection('.book', {
      title: '.title',
      year: '.year'
    })
  });

These will both resolve to:

[
  { 
    name: 'Dennis Reynolds',
    age: '41',
    books: [
      {
        title: 'The Dennis System',
        year: '2009'
      },
      {
        title: 'Chardee MacDennis: A Guide',
        year: '2011'
      }
    ]
  },
  { 
    name: 'Stephen King',
    age: '69',
    books: [
      {
        title: 'Clown Town',
        year: '1990'
      }
    ]
  }
]

Querying

As seen above, it's possible to take shortcuts to describe queries. Anywhere a string is found in place of a query, it will be used as the selector parameter in a string query:

let scope = chowdown('http://somewebpage.com');

scope.collection('.author', '.name')
// => Resolves to: ['Dennis Reynolds', 'Stephen King']

scope.collection('.author', chowdown.query.string('.name'))
// => Resolves to: ['Dennis Reynolds', 'Stephen King']

Likewise, anywhere an object is found in place of a query, it will be used as the pick parameter in an object query.

let scope = chowdown('http://somewebpage.com');

scope.collection('.author', {name: '.name'})
// => Resolves to: [{name: 'Dennis Reynolds'}, {name: 'Stephen King'}]

scope.collection('.author', chowdown.query.object({name: '.name'}))
// => Resolves to: [{name: 'Dennis Reynolds'}, {name: 'Stephen King'}]

Finally, anywhere a function is found in place of a query, it will be used as the fn parameter in a callback query.

let scope = chowdown('http://somewebpage.com');

scope.collection('.author', (author) => author.string('.name'))
// => Resolves to: ['Dennis Reynolds', 'Stephen King']

scope.collection('.author', chowdown.query.callback((author) => author.string('.name')))
// => Resolves to: ['Dennis Reynolds', 'Stephen King']

Manually created queries can also be executed directly on a Scope like this:

let scope = chowdown('http://somewebpage.com');

scope.execute(chowdown.query.string('.author:nth-child(1) .name'))
// => Resolves to: 'Dennis Reynolds'

Creating Scopes

The library's main function is actually an alias for chowdown.request; this is one of three functions that allow for the creation of Scope objects:

chowdown.request(request, options)


Issues a request using request-promise with the given request object or uri string and returns a Scope created from the response.

Parameters

  • request {string|object} Either a uri or a request object that will be passed to request-promise.
  • [options] {object} An object of configuration options.
    • [client=rp] {function} A client function to use in place of request-promise. It will be passed a request object or uri and should return a promise that resolves to a string or cheerio object.

Returns

  • Scope A scope wrapping the response of the request.

chowdown.file(file)


Reads from the file located at file and returns a Scope created from the contents of the file.

Parameters

  • file {string} The filename.

Returns

  • Scope A scope wrapping the file's contents.

chowdown.body(body)


Load a DOM document directly from a cheerio object or string and returns a Scope created from this document.

Parameters

  • body {cheerio|string} Either an existing cheerio object or a DOM string.

Returns

  • Scope A scope wrapping the body.

Using Scopes

Scope instances have methods that allow you to query directly on a document (or part of a document):

  • scope.string: creates and executes a string query within the scope.
  • scope.number: creates and executes a number query within the scope.
  • scope.collection: creates and executes a collection query within the scope.
  • scope.object: creates and executes a object query within the scope.
  • scope.raw: creates and executes a raw query within the scope.
  • scope.regex: creates and executes a regex query within the scope.
  • scope.context: creates and executes a context query within the scope.
  • scope.uri: creates and executes a uri query within the scope.
  • scope.follow: creates and executes a follow query within the scope.
  • scope.paginate: creates and executes a paginate query within the scope.
  • scope.callback: creates and executes a callback query within the scope.
  • scope.execute

scope.execute(query)


Executes the given query on the document used by this scope.

Parameters

  • query {Query<T>} The query to execute within this scope.

Returns

  • Promise<T> A promise resolving to the result of the query.

Example

let scope = chowdown.request('http://somewebpage.com');

let query = chowdown.query.string('.author:nth-child(1) .name');

scope.execute(query);

This will resolve to:

'Dennis Reynolds'

Creating Queries

The main chowdown function has a query property containing methods that allow for the creation of different types of queries:

All of the following examples use the same sample uri and markup as before.

chowdown.query.string(selector, options)


Creates a query to find a string at the given selector in a document. Any retrieved non-string value will be coerced into a string.

Parameters

  • selector {string} A selector to find the string in a document.
  • [options] {object} An object of configuration options.
    • [default=''] {string} The default value to return if no string is found.
    • [throwOnMissing=false] {boolean} A flag that dictates whether or not to throw an error if no string is found.
    • [format=[]] {function|function[]} A function or array of functions used to format the retrieved string.

Returns

  • Query<string> The constructed string query.

Example

let scope = chowdown('http://somewebpage.com');

let query = chowdown.query.string('.author:nth-child(1) .name');

scope.execute(query);

This will resolve to:

'Dennis Reynolds'

chowdown.query.number(selector, options)


Creates a query to find a number at the given selector in a document. Any retrieved non-number value will be coerced into a number.

Parameters

  • selector {string} A selector to find the number in a document.
  • [options] {object} An object of configuration options.
    • [default=NaN] {number} The default value to return if no number is found.
    • See chowdown.query.string for other possible options.

Returns

  • Query<number> The constructed number query.

Example

let scope = chowdown('http://somewebpage.com');

let query = chowdown.query.number('.author:nth-child(1) .age');

scope.execute(query);

This will resolve to:

41

chowdown.query.collection(selector, inner, options)


Creates a query to find an array of values such that each value in the array is the result of the inner query executed on a child document. The set of child documents is pointed to by the selector parameter.

Parameters

  • selector {string} A selector to find the child documents in a document.
  • inner {Query<T>} The inner query to execute on each child document.
  • [options] {object} An object of configuration options.
    • [default=[]] {any[]} The default value to return if no child documents are found.
    • [filter] {function} A function used to filter the resulting array. Every item in the array is passed through this function and the values for which the function is truthy are kept.
    • See chowdown.query.string for other possible options.

Returns

  • Query<T[]> The constructed collection query.

Example

let scope = chowdown('http://somewebpage.com');

let query = chowdown.query.collection('.author', chowdown.query.number('.age'));

scope.execute(query);

This will resolve to:

[41, 69]

chowdown.query.object(pick, options)


Creates a query that will find an object in a document such that each value in the object is the result of the corresponding query in the pick parameter.

Parameters

  • pick {object} The object of queries to map.
  • [options] {object} An object of configuration options.

Returns

  • Query<object> The constructed object query.

Example

let scope = chowdown('http://somewebpage.com');

let query = chowdown.query.object({
  name: chowdown.query.string('.author:nth-child(1) .name'),
  age: chowdown.query.number('.author:nth-child(1) .age')
});

scope.execute(query);

This will resolve to:

{
  name: 'Dennis Reynolds',
  age: 41
}

chowdown.query.raw(fn, options)


Creates a query that calls fn with the underlying cheerio function and cheerio context. The result of this query will be the result of this call.

Parameters

  • fn {function} The raw function to be called with the cheerio instance.
  • [options] {object} An object of configuration options.
    • [default=undefined] {any} The default value to return if undefined is returned from the function.
    • See chowdown.query.string for other possible options.

Returns

  • Query<any> A promise that resolves to the result of the raw function.

Example

let scope = chowdown('http://somewebpage.com');

let query = chowdown.query.raw(($, context) => $('.author:nth-child(2) .name').text());

scope.execute(query);

This will resolve to:

'Stephen King'

chowdown.query.regex(selector, pattern, group, options)


Creates a query that will find a string in a document using the given selector and perform a regex match on it using pattern.

Parameters

  • selector {string} A selector to find the string in a document.
  • pattern {RegExp} The pattern used to match on the retrieved string.
  • [group] {number} The index of a matched group to return.
  • [options] {object} An object of configuration options.
    • [default=[]] {any[]} The default value to return if no matches are made.
    • See chowdown.query.string for other possible options.

Returns

  • Query<string|string[]> The constructed regex query.

Example

let scope = chowdown('http://somewebpage.com');

let query = chowdown.query.regex('.author:nth-child(2)', /(Stephen) (.*)/);

scope.execute(query);

This will resolve to (roughly):

['Stephen King', 'Stephen', 'King']

If we want a specific group:

let scope = chowdown('http://somewebpage.com');

let query = chowdown.query.regex('.author:nth-child(2)', /(Stephen) (.*)/, 2);

scope.execute(query);

This will resolve to:

'King'

chowdown.query.context(selector, inner, options)


Creates a query that executes the inner query within the context of a child document pointed to by the given selector.

Parameters

  • selector {string} A selector to find the child document.
  • inner {Query<T>} The inner query to execute on the child document.
  • [options] {object} An object of configuration options.
    • [default=undefined] {any} The default value to return if the context can't be found.
    • See chowdown.query.string for other possible options.

Returns

  • Query<T> The constructed context query.

Example

let scope = chowdown('http://somewebpage.com');

let query = chowdown.query.context('.author:nth-child(1) .book:nth-of-type(1)',
  chowdown.query.object({
    title: '.title',
    year: (book) => book.number('.year')
  })
);

scope.execute(query);

This will resolve to:

{
  title: 'The Dennis System',
  year: 2009
}

chowdown.query.uri(selector, base, options)


Creates a query that finds a URI in a document using the given selector and resolves it relative to the given base URI. Will automatically attempt to grab the href attribute of the element specified by selector.

If no URI is retrieved from the document, chowdown will not attempt to resolve the default value agsint the base URI.

Parameters

  • selector {string} A selector to find the URI.
  • [base] {string} The base URI for the retrieved URI.
  • [options] {object} An object of configuration options.

Returns

  • Query<string> The constructed URI query.

Example

let scope = chowdown('http://somewebpage.com');

let query = chowdown.query.uri('.author:nth-child(1) .name', 'http://somewebpage.com');

scope.execute(query);

This will resolve to:

'http://somewebpage.com/dennis'

chowdown.query.follow(uri, inner, options)


Creates a query that follows the URI pointed to by the uri query and executes the inner query on the document at this URI.

Parameters

  • uri {string|object|function} A query to find the URI.
  • inner {Query<T>} A query to execute on the document at the URI.
  • [options] {object} An object of configuration options.
    • [default=undefined] {any} The default value to return if there's an error accessing the page.
    • [client=rp] {function} A client function to use in place of request-promise. It will be passed a request object or URI and should return a promise that resolves to a string or cheerio object.
    • [request] {object} An object of other request options to pass to client.
    • See chowdown.query.string for other possible options.

Returns

  • Query<T> The constructed follow query.

Example

In the sample markup (for the uri http://somewebpage.com), we can see the first author's div contains a link to http://somewebpage.com/dennis. Let's assume the markup at this uri is as follows:

<a id="favourite-food">DeVitos</a>

We can use a follow query to read such important information like this:

let scope = chowdown('http://somewebpage.com');

let query = chowdown.query.follow(
  (doc) => doc.uri('.author:nth-child(1) .name'),
  (otherPage) => otherPage.string('#favourite-food')
);

scope.execute(query);

This will resolve to:

'DeVitos'

chowdown.query.paginate(inner, uri, max, options)


Creates a query that executes the inner query on multiple pages. The link to the next page is pointed to by the uri query. Pagination will stop after max pages have been requested. If max is a function, pagination will stop whenever it returns false.

Parameters

  • inner {Query<T>} A query to execute on each document.
  • uri {string|object|function} A query to find the next page's URI in each document.
  • [max=Infinity] {number|function} The maximum number of pages to retrieve or a function that takes the current number of pages and the last page and returns false when it's desirable to stop.
  • [options] {object} An object of configuration options.
    • [default=undefined] {any} The default value to return if there's an error accessing a page.
    • [client=rp] {function} A client function to use in place of request-promise. It will be passed a request object or URI and should return a promise that resolves to a string or cheerio object.
    • [request] {object} An object of other request options to pass to client.
    • [merge=flatten] {function} The function used to merge the paginated results. Takes one argument pages - an array of all page results. Uses lodash.flatten by default.
    • See chowdown.query.string for other possible options.

Returns

  • Query<any> The constructed paginate query.

Example

In the sample markup, there exists a link to the next page of results http://somewebpage.com/search?page=2 at the bottom of the page. Let's assume the markup at this page is as follows:

<div>
  <div class="author">
    <a href="/william" class="name">William Shakespeare</a>
    <span class="age">453</span>
    <img src="william.jpg"/>
    <div class="book">
      <span class="title">Hamlet</span>
      <span class="year">1600</span>
    </div>
  </div>
  <a class="next" href="/search?page=3"/>
</div>

We can execute queries on both the first page and this page (and as many more as we'd like) with the following query:

let scope = chowdown('http://somewebpage.com');

let names = chowdown.query.collection('.author', '.name');

// The last argument is the maximum number of pages to read.
let pages = chowdown.query.paginate(names, '.next', 2);

scope.execute(query);

This will resolve to:

['Dennis Reynolds', 'Stephen King', 'William Shakespeare']

chowdown.query.callback(fn, options)


Creates a query that calls fn with a Scope that wraps a document (or part of a document) and returns the result of this call.

Parameters

  • fn {function} A function to call with a Scope for a document.
  • [options] {object} An object of configuration options.

Returns

  • Query<any> The constructed callback query.

Example

let scope = chowdown('http://somewebpage.com');

let query = chowdown.query.callback((document) => document.string('.author:nth-child(2) .name'));

scope.execute(query);

This will resolve to:

'Stephen King'

Testing

If you have cloned this repository, it's possible to run the tests by executing the following command from the root of the repository:

$ npm test

License (ISC)

See the LICENSE file for details.

1.2.6

2 years ago

1.2.5

4 years ago

1.2.4

4 years ago

1.2.3

5 years ago

1.2.2

6 years ago

1.2.1

6 years ago

1.2.0

6 years ago

1.1.0

7 years ago

1.0.1

7 years ago

1.0.0

7 years ago