0.0.7 • Published 7 years ago
mrcrowley v0.0.7
MrCrowley
Retrieve data from different websites using html elements to gather the information you need.
Installation
- Install node
npm install -g mrcrowley
mrcrowley --config="/home/user/.crawl.json" --save="/home/user/crawlResults.json"
Usage
Core usage
I still have to document how you can require
and use the core
directly but just so that you know, you can do it and the results are based on promises
.
CLI
Set a .crawl.json
and run all the tasks you want when you pass it to mrcrowley
.
Note: Any kind of path should be absolute or relative to the place the script is called.
mrcrowley --config=<config_json_src> --output=<file_to_save_src> --force=<false|true>
Notes:
<config_json_src>
: Path to the config json for crawling. It is required<file_to_save_src>
: Path for the file you want to have the results. For now, onlyjson
is supported. It is requiredforce
: It forces to create a new output. If false and the output file exists, it will just update. It will default tofalse
Configuration
{
"projectId": "<project_id>",
"projectName": "<project_name>",
"data": [{
"src": "<url_path>",
"name": "<request_name>",
"throttle": 2000,
"enableJs": false,
"waitFor": "<html_selector>",
"wait": {
"selector": "<html_selector>",
"for": 5000
},
"modifiers": {
"<query_var_in_url>": ["<var_to_replace>"]
},
"retrieve": {
"<name>": {
"selector": "<html_selector>",
"attribute": "<attribute_to_retrieve>",
"ignore": ["<regex_pattern_to_ignore>"]
}
}
}]
}
Notes:
retrieve
: Besides the simplified version, you may also nest it to get contained data{ "src": "...", "retrieve": { "<name>": { "selector": "<parent_html_selector>", "retrieve": { "<name>": { "selector": "<child_html_selector>", "attribute": "<attribute_to_retrieve>", "ignore": ["<regex_pattern_to_ignore>"] } } } } }
attribute
: If not provided, text content will be returned. Optional key.ignore
: Ignore results with a regex pattern. Optional key.enableJs
: Javascript isn't enable by default for security reasons. Use this if you really need itwait
: Usually used withenableJs
. If the sources uses javascript to render, you maywait
for the selector to be present<var_to_replace>
: It can also be an object with keysmin
(it will default to0
) andmax
(it will default to10
){ "src": "...", "modifiers": { "<query_var_in_url>": ["<var_to_replace>"], "<limit_var_in_url>": [{ "min": 0, "max": 10 }] }, "retrieve": {} }
Examples
Go under the src/_test/data folder and check the *.json
.