0.0.1 • Published 4 years ago

nescavater v0.0.1

Weekly downloads
1
License
MIT
Repository
gitlab
Last release
4 years ago

pipeline status coverage report

Nescavater

JSON-Driven Site Scraper (Scavater). It is useful when you want to extract some information from a site when the extraction process may not require any complex logic, only xpath patterns of the elements that contain the information are sufficient.

Install Package

It will also install chromium binary (size ~120mb) needed by puppeteer (headless engine)

npm install nescavater --save

Usage Example

If you are using mongoose for the store engine then you need to setup the connection first.

const mongoose = require('mongoose');

(() => {
  const connectionOption = {
    useNewUrlParser: true,
    useUnifiedTopology: true,
    useCreateIndex: true,
    useFindAndModify: false,
  };

  await mongoose.connect(process.env.MONGODB_URI, connectionOption);

  const crawler = new Crawler({
    options: { connection: mongoose.connection },
  });

  // preferable you have json file
  const htmlConfig = {
    name: 'example',
    sites: [
      'https://example.com',
    ],
    engine: {
      type: 'html',
      options: {},
    },
    attributes: {
      name: {
        target: 'string',
        output: 'single',
        type: 'xpath',
        selectors: [
          {
            type: 'text',
            selector: '//x:h1[@class="page-title"]',
          },
        ],
      },
    },
  };

  const jsonConfig = JSON.stringify(htmlConfig);

  // you only need to set config once as it should be stored in mongodb
  await crawler.setConfig(jsonConfig);

  const url = 'https://example.com';
  const config = await crawler.getConfigByUrl(url);
  const result = await crawler.fetch(url, config);
  console.log(result);
})();

Sample output:

{
  "name": "some extracted value"
}

Configuration

  • name: (any) -- Unique identifier of the configuration
  • sites: (array of string) -- Site patterns which will use the extraction patterns. A group of site patterns should only exist once. e.g ["https://example.com", "https://m.example.com"].
  • engine: (shape)
    • type: (one of)
      • html: -- Light engine for plain HTML site only. For Javascript site, use headless instead.
      • headless: -- Heavy engine It uses puppeteer to render the site in headless mode. It can be used for plain HTML or Javascript site.
    • options: (any of)
      • waitForXPath: -- Tell the engine to wait for a certain xpath to be visible before doing the extraction (only available for headless type)
  • attributes: (shape)
    • target attribute key: (shape) -- The target's attribute key or value container variable.
      • target: (one of)
        • number -- Convert the type of the value found by the engine into number type
        • string -- Convert the type of the value found by the engine into string type
        • boolean -- Convert the type of the value found by the engine into boolean type
      • output: (one of)
        • single: -- non-array value which has type determined by the target
        • multiple: -- array value which has type determined by the target
      • type (one of)
        • xpath: -- Use xpath selector
      • *selectors: (array of shape)
        • type: (one of)
          • text -- Get text value from the selected element
          • html -- Get HTML from the selected element
          • attr -- Get attribute value from the selected element
        • selector: (string) -- Xpath selector of the target element

Sample JSON config with HTML engine:

{
  "name": "example",
  "sites": [
    "https://example.com"
  ],
  "engine": {
    "type": "html",
    "options": {}
  },
  "attributes": {
    "name": {
      "target": "string",
      "output": "single",
      "type": "xpath",
      "selectors": [
        {
          "type": "text",
          "selector": "//x:h1[@class=\"page-title\"]"
        }
      ]
    }
  }
}

Sample JSON config with Headless engine:

{
  "name": "example",
  "sites": [
    "https://example.com"
  ],
  "engine": {
    "type": "headless",
    "options": {
      "waitForXPath": "//div[@class=\"fotorama__stage\"]"
    }
  },
  "attributes": {
    "name": {
      "target": "string",
      "output": "single",
      "type": "xpath",
      "selectors": [
        {
          "type": "text",
          "selector": "//h1[@class=\"page-title\"]"
        }
      ]
    }
  }
}