0.0.1 • Published 4 years ago
nescavater v0.0.1
Nescavater
JSON-Driven Site Scraper (Scavater). It is useful when you want to extract some information from a site when the extraction process may not require any complex logic, only xpath patterns of the elements that contain the information are sufficient.
Install Package
It will also install chromium binary (size ~120mb) needed by puppeteer (headless engine)
npm install nescavater --save
Usage Example
If you are using mongoose for the store engine then you need to setup the connection first.
const mongoose = require('mongoose');
(() => {
const connectionOption = {
useNewUrlParser: true,
useUnifiedTopology: true,
useCreateIndex: true,
useFindAndModify: false,
};
await mongoose.connect(process.env.MONGODB_URI, connectionOption);
const crawler = new Crawler({
options: { connection: mongoose.connection },
});
// preferable you have json file
const htmlConfig = {
name: 'example',
sites: [
'https://example.com',
],
engine: {
type: 'html',
options: {},
},
attributes: {
name: {
target: 'string',
output: 'single',
type: 'xpath',
selectors: [
{
type: 'text',
selector: '//x:h1[@class="page-title"]',
},
],
},
},
};
const jsonConfig = JSON.stringify(htmlConfig);
// you only need to set config once as it should be stored in mongodb
await crawler.setConfig(jsonConfig);
const url = 'https://example.com';
const config = await crawler.getConfigByUrl(url);
const result = await crawler.fetch(url, config);
console.log(result);
})();
Sample output:
{
"name": "some extracted value"
}
Configuration
- name: (any) -- Unique identifier of the configuration
- sites: (array of string) -- Site patterns which will use the extraction patterns. A group of site patterns should only exist once. e.g
["https://example.com", "https://m.example.com"]
. - engine: (shape)
- type: (one of)
- html: -- Light engine for plain HTML site only. For Javascript site, use
headless
instead. - headless: -- Heavy engine It uses puppeteer to render the site in headless mode. It can be used for plain HTML or Javascript site.
- html: -- Light engine for plain HTML site only. For Javascript site, use
- options: (any of)
- waitForXPath: -- Tell the engine to wait for a certain xpath to be visible before doing the extraction (only available for headless type)
- type: (one of)
- attributes: (shape)
- target attribute key: (shape) -- The target's attribute key or value container variable.
- target: (one of)
- number -- Convert the type of the value found by the engine into number type
- string -- Convert the type of the value found by the engine into string type
- boolean -- Convert the type of the value found by the engine into boolean type
- output: (one of)
- single: -- non-array value which has type determined by the target
- multiple: -- array value which has type determined by the target
- type (one of)
- xpath: -- Use xpath selector
- *selectors: (array of shape)
- type: (one of)
- text -- Get text value from the selected element
- html -- Get HTML from the selected element
- attr -- Get attribute value from the selected element
- selector: (string) -- Xpath selector of the target element
- type: (one of)
- target: (one of)
- target attribute key: (shape) -- The target's attribute key or value container variable.
Sample JSON config with HTML engine:
{
"name": "example",
"sites": [
"https://example.com"
],
"engine": {
"type": "html",
"options": {}
},
"attributes": {
"name": {
"target": "string",
"output": "single",
"type": "xpath",
"selectors": [
{
"type": "text",
"selector": "//x:h1[@class=\"page-title\"]"
}
]
}
}
}
Sample JSON config with Headless engine:
{
"name": "example",
"sites": [
"https://example.com"
],
"engine": {
"type": "headless",
"options": {
"waitForXPath": "//div[@class=\"fotorama__stage\"]"
}
},
"attributes": {
"name": {
"target": "string",
"output": "single",
"type": "xpath",
"selectors": [
{
"type": "text",
"selector": "//h1[@class=\"page-title\"]"
}
]
}
}
}