1.1.0 • Published 4 years ago

extract-from-document v1.1.0

Weekly downloads
2
License
ISC
Repository
github
Last release
4 years ago

extract-from-document

This is a utility function that simplifies data extraction from document

Installing

Via npm:

$ npm install [-g] extract-from-document

Usage

To use this function you should get an instance of document or element (IScope) eg. when you are inside a browser and a recipe (IRecipe) which will configure what you want to extract from a document. A recipe can be a Source, Scope or IMap.

Source is a simplest one - it has element selector, optional attribute name (default: 'innerText') and optional isSingle flag (default: true).

export class Source {
  constructor(public selector: string, public attribute: string = 'innerText', public isSingle: boolean = true) {}
}

IMap is an object with key-value pair where the key is a string and value is an IRecipe

export interface IMap {
  [key: string]: IRecipe;
}

Scope is a map (IMap) with a selector (specify a context). It is useful eg. when you want to extract data from a specific table or row.

export class Scope {
  constructor(public map: IMap, public selector: string, public isSingle: boolean = true) {}
}

Example usage is shown below with a function getDocument() that you must implement or replace that will return a document or HTML element.

import { extractFromDocument, IRecipe, Source } from 'extract-from-document';

const document: HTMLElement | Document = getDocument() // obtain somehow document instance
const recipe: IRecipe = new Source('.some-class-selector');
const result = extractFromDocument(recipe, document); 

Example usage with puppeter

I will show you how you can use it with a puppeter.

Implement a helper function called extract which will be inside ./util/extract.ts file. It will launch a browser, open page and pass extractFromDocument with a provided recipe to evaluate function which will extract data from a given url.

import { launch } from 'puppeteer';

import { extractFromDocument, IRecipe } from 'extract-from-document';

export async function extract(recipe: IRecipe, url: string) {
  const browser = await launch({ args: ['--no-sandbox', '--disable-setuid-sandbox'] });
  const page = await browser.newPage();
  await page.goto(url);

  const result = await page.evaluate(extractFromDocument, recipe);

  await browser.close();

  return result;
}

Now we import this extract function and specify a recipe what we want to extract and a url to inform from where we want to do it. We are logging stringified result to a console.

import { Scope, Source } from 'extract-from-document';
import { extract } from './util/extract';

const recipe = {
  hotNetworkQuestions: new Scope({
    title: new Source('a'),
    url: new Source('a', 'href'),
  }, '#hot-network-questions li', false),
  related: new Scope({
    answer: {
      url: new Source('a[title^="Vote score"]', 'href'),
      votes: new Source('.answer-votes'),
    },
    title: new Source('.question-hyperlink'),
    url: new Source('.question-hyperlink', 'href'),
  }, '.module.sidebar-related .spacer', false),
};
const url = 'https://stackoverflow.com/questions/24825860/code-coverage-for-jest';

extract(recipe, url).then((result: any) => console.info(JSON.stringify(result, null, 2)));

In a result we will get:

{
  "hotNetworkQuestions": [
    {
      "title": "What computer would be fastest for Mathematica Home Edition?",
      "url": "https://mathematica.stackexchange.com/questions/195184/what-computer-would-be-fastest-for-mathematica-home-edition"
    },
    {
      "title": "Slither Like a Snake",
      "url": "https://codegolf.stackexchange.com/questions/183153/slither-like-a-snake"
    },
    {
      "title": "How is simplicity better than precision and clarity in prose?",
      "url": "https://writing.stackexchange.com/questions/44589/how-is-simplicity-better-than-precision-and-clarity-in-prose"
    }
  ],
  "related": [
    {
      "answer": {
        "url": "https://stackoverflow.com/q/336859?rq=1",
        "votes": "6394"
      },
      "title": "var functionName = function() {} vs function functionName() {}",
      "url": "https://stackoverflow.com/questions/336859/var-functionname-function-vs-function-functionname?rq=1"
    },
    {
      "answer": {
        "url": "https://stackoverflow.com/q/40465047?rq=1",
        "votes": "173"
      },
      "title": "How can I mock an ES6 module import using Jest?",
      "url": "https://stackoverflow.com/questions/40465047/how-can-i-mock-an-es6-module-import-using-jest?rq=1"
    }
  ]
}
1.1.0

4 years ago

1.0.0

5 years ago