0.1.30 • Published 10 years ago
microcrawler v0.1.30
microcrawler
Status
Screenshots
Available Official Crawlers
List of official publicly available crawlers.
Missing something? Feel free to open issue.
- craiglist.com - microcrawler-crawler-craiglist.com
- firmy.cz - microcrawler-crawler-firmy.cz
- google.com - microcrawler-crawler-google.com
- news.ycombinator.com - microcrawler-crawler-news.ycombinator.com
- sreality.cz - microcrawler-crawler-sreality.cz
- xkcd.com - microcrawler-crawler-xkcd.com
- yelp.com - microcrawler-crawler-yelp.com
- youjizz.com - microcrawler-crawler-youjizz.com
Prerequisites
Installation
From npmjs.org (the easy way)
This is the easiest way. The prerequisites still needs to be satisfied.
npm install -g microcrawlerFrom Sources
This is useful if you want to tweak the source code, implement new crawler, etc.
# Clone repository
git clone https://github.com/ApolloCrawler/microcrawler.git
# Enter folder
cd microcrawler
# Install required packages - dependencies
npm install
# Install from local sources
npm install -g .Usage
Show available commands
$ microcrawler
Usage: microcrawler [options] [command]
Commands:
collector [args] Run data collector
config [args] Run config
exporter [args] Run data exporter
worker [args] Run crawler worker
crawl [args] Crawl specified site
help [cmd] display help for [cmd]
Options:
-h, --help output usage information
-V, --version output the version numberCheck microcrawler version
$ microcrawler --version
0.1.27Initialize config file
$ microcrawler config init
2016-09-03T10:45:13.105Z - info: Creating config file "/Users/tomaskorcak/.microcrawler/config.json"
{
"client": "superagent",
"timeout": 10000,
"throttler": {
"enabled": false,
"active": true,
"rate": 20,
"ratePer": 1000,
"concurrent": 8
},
"retry": {
"count": 2
},
"headers": {
"Accept": "*/*",
"User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
"From": "googlebot(at)googlebot.com"
},
"proxy": {
"enabled": false,
"list": [
"https://168.63.20.19:8145"
]
},
"natFaker": {
"enabled": true,
"base": "192.168.1.1",
"bits": 16
},
"amqp": {
"uri": "amqp://localhost",
"queues": {
"collector": "collector",
"worker": "worker"
},
"options": {
"heartbeat": 60
}
},
"couchbase": {
"uri": "couchbase://localhost:8091",
"bucket": "microcrawler",
"username": "Administrator",
"password": "Administrator",
"connectionTimeout": 60000000,
"durabilityTimeout": 60000000,
"managementTimeout": 60000000,
"nodeConnectionTimeout": 10000000,
"operationTimeout": 10000000,
"viewTimeout": 10000000
},
"elasticsearch": {
"uri": "localhost:9200",
"index": "microcrawler",
"log": "debug"
}
}Edit config file
$ vim ~/.microcrawler/config.jsonShow config file
$ microcrawler config show
{
"client": "superagent",
"timeout": 10000,
"throttler": {
"enabled": false,
"active": true,
"rate": 20,
"ratePer": 1000,
"concurrent": 8
},
"retry": {
"count": 2
},
"headers": {
"Accept": "*/*",
"User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
"From": "googlebot(at)googlebot.com"
},
"proxy": {
"enabled": false,
"list": [
"https://168.63.20.19:8145"
]
},
"natFaker": {
"enabled": true,
"base": "192.168.1.1",
"bits": 16
},
"amqp": {
"uri": "amqp://example.com",
"queues": {
"collector": "collector",
"worker": "worker"
},
"options": {
"heartbeat": 60
}
},
"couchbase": {
"uri": "couchbase://example.com:8091",
"bucket": "microcrawler",
"username": "Administrator",
"password": "Administrator",
"connectionTimeout": 60000000,
"durabilityTimeout": 60000000,
"managementTimeout": 60000000,
"nodeConnectionTimeout": 10000000,
"operationTimeout": 10000000,
"viewTimeout": 10000000
},
"elasticsearch": {
"uri": "example.com:9200",
"index": "microcrawler",
"log": "debug"
}
}Start Couchbase
TBD
Start Elasticsearch
TBD
Start Kibana
TBD
Query elasticsearch
TBD
Example usage
Craiglist
microcrawler crawl craiglist.index http://sfbay.craigslist.org/sfc/sss/Firmy.cz
microcrawler crawl firmy.cz.index "https://www.firmy.cz?_escaped_fragment_="microcrawler crawl google.index http://google.com/search?q=Buena+VistaHacker News
microcrawler crawl hackernews.index https://news.ycombinator.com/xkcd
microcrawler crawl xkcd.index http://xkcd.comYelp
microcrawler crawl yelp.index "http://www.yelp.com/search?find_desc=restaurants&find_loc=Los+Angeles%2C+CA&ns=1&ls=f4de31e623458437"Youjizz
microcrawler crawl youjizz.com.index http://youjizz.comCredits
- @pavelbinar for QA and not just that.
0.1.30
10 years ago
0.1.29
10 years ago
0.1.28
10 years ago
0.1.27
10 years ago
0.1.26
10 years ago
0.1.25
10 years ago
0.1.24
10 years ago
0.1.23
10 years ago
0.1.22
10 years ago
0.1.21
10 years ago
0.1.20
10 years ago
0.1.19
10 years ago
0.1.18
10 years ago
0.1.17
10 years ago
0.1.16
10 years ago
0.1.15
10 years ago
0.1.14
10 years ago
0.1.13
10 years ago
0.1.12
10 years ago
0.1.11
10 years ago
0.1.10
10 years ago
0.1.9
10 years ago
0.1.8
10 years ago
0.1.7
10 years ago
0.1.6
10 years ago
0.1.5
10 years ago
0.1.4
10 years ago
0.1.3
10 years ago
0.1.2
10 years ago
0.1.1
10 years ago
0.1.0
10 years ago
0.0.5
11 years ago
0.0.3
12 years ago
0.0.2
12 years ago
0.0.1
12 years ago


