0.0.1 • Published 8 years ago

cull v0.0.1

Weekly downloads
3
License
MIT
Repository
github
Last release
8 years ago

[api] [github] [npm]

  • RxJS based library build to help me scrape data in my side projects
  • Very much work in progress - I'm new to all of this and learn a lot everyday. The api is far from stable.
  • Build from the desire to simplify scraping and remove all that boilerplate.
    • I now know, that what I perceived as boilerplate was just me being inefficient.
    • With time this has become more and more of a playground for me to try and grok all the cool things I hear about.
  • This project aims to build on the awesome ideas I got from looking at libraries like x-ray and node-osmosis.
    • I try throughly document everything as I couldn't grasp what was going on in those libraries. This has turned out to be not as simple as I imagined - a lot of my current documentaton reads like "parameter name: the name of the object".

Example

  • As x-ray served as the main inspiration it only seems fitting to use the same example. Note that the api produces longer code and is less expressive than x-ray - I don't expect this to change.

    import Cull from 'cull';
    var cull = new Cull();
    cull
      .setSources({
        list: 'https://dribbble.com',
        page: 'list   | .next_page                     | paginate | 3',
        item: ['page  | ol.dribbbles > li.group']
      })
      .setTables({
        item: {
          pindex: 'page | -                              | index',
          eindex: 'item | -                              | index',
          title: 'item  | .dribbble-img strong           | text',
          image: 'item  | .dribbble-img div:nth-child(2) | data     | src'
        }
      })
      .log();

Features

  • Flexible schema: The schema is independent of the structure of the scraped pages. Selections can resolve single elements or arrays thereof, nesting allows complex data structures. Output and input are independent - define the input once and use it throughout your output.
  • Pluggable & Configurable: All default classes can be swapped out for custom classes exposing the same interface. For most cases, however, it should be enough to configure the default classes to your liking by providing custom configuration objects to the Cull constructor.
  • Crawler:
    • Pagination: Crawl through websites easily by specifying the next-page link and the limit of pages to grab.
    • Concurrency & Delay: Scrape responsibly - adapt the number of concurrent requests and the delay inbetween issuing requests.
  • based on popular libraries: The default classes are based on request, knex and cheerio.

Introduction

  • For now the api hasn't been stable enough to allow writing a more complete documentation, this will mereley be a short overview of the general gist of it.

Schema

Sources and Tables are differentiated.

  • Sources
    • The schemata of the website. To crawl through the website, simple values are categorized into Documents and non-Documents
      • Any non-Document value in returned from a source is crawled into a new Document. This includes non-selector values provided directly (e.g. an array of urls on the source as 'pages')
  • Tables
    • The schemata of your data. Reference sources and transform the data into a fitting representation.
    • The values of all a tables sources are combined according to their relationships. Parent source values are combined with all their children but not with unrelated parent values. More specifically, the cross product of sources grouped by their most direct common ancestor is formed.
      • On hand of the dribble example:
        • Each value from the item source that is a child to page 1 is combined with the parent page 1, but not page 2.
        • If multiple, unrelated, children to page were included, e.g. item and menu, each item for page 1 would be grouped with each menu for page 1.

Selection & Manipulation

Selectors select & transform the data.

  • Selectors are defined through string|string[] of the format of 'source | selection | transformation | transformation arguments'.
    • 'source | selection | transformation | transformation arguments' scrapes a single element
    • ['source | selection | transformation | transformation arguments'] scrapes all elements
  • Pages/Selections are modeled as Documents.

    • Transformations have access to the document (exposing page/selection index, request & response), selector & cull instance

      • A variety of default transformations is defined

        // scrape a single item as text
        'list  | li   | text'
        // scrape all items as html
        ['list | li   | html']
        // scrape a links absolute url
        'list  | li a | href'
        // scrape a links data-src
        'list  | li a | data | src'
        // scrape a source documents (in this case page) request url
        // '-' denotes a complete selection (no scope, the complete original model is selected)
        // As the url transformation reads from the document request attribute the scope is actually meaningless and could be set to anything
        'list  | - | url'
        // ...
      • Once the API has settled, a complete list will be provided - for now have a look at the source code of the document class.

    • Return values of transformations can be simple values or observables. Observables are flatMapped into simple values to allow async operations in transformations.

Output

Console

  • The output can be logged using cull.log().
  • cull.connect() executes your crawl but does not log any output apart from errors.

Database

  • The default Store is based on knex. To store rows to the database, provide a storeConfig (knexConfig) object to cull & execute cull with cull.save().
    • Tables are created automatically if they do not exist.
      • Columns of automatically created tables (for now) have the type text.
      • All of the tables columns named 'id' and containing 'id' at the end of their name are set as the tables primary key using union()
    • Any duplicate rows (as defined by the tables primary key) are not inserted and logged as an error

Pluggable & Configurable

Everything is configurable/swappable through the Cull constructor. A few examples:

  • To provide custom transformations, just provide a transformation object containing transformation functions to Cull.

    transformations = {
      // transform away. document exposes request, response, index & the model ($ / cheerio instance)
      answerToEverything: (document, selector, cull) => 42
    };
    var cull = new Cull({ transformations });
  • To replace '|' as the seperator for selectors provide a Selector class with a different seperator defined on Selector.seperator

    var Selector = Cull.classes.Selector;
    Selector.seperator = '!';
    var cull = new Cull({ Selector });
  • To replace the default Document class, provide a class to Cull({ Document }) that is instantiated with (request, response, index) & can be resolved into an observable emitting the resolved value by providing a selector instance: new Document().resolve(selector) The same goes for other classes - look up their interface in the API documentation.