2.0.2 • Published 8 years ago
bot-marvin v2.0.2
bot-marvin
Highly scalable crawler with best features.
Basic useful feature list:
- Asynchronus crawling
- Distributed Breadth first crawls
- Scalable horizontally as well vertically
- Url partitioning for better scheduling
- Scheduling using fetch interval and priority
- Supports robots.txt and sitemap.xml parsing
- Uses Apache Tika for file parsing
- Web app for viewing crawled data and analytics
- Faul Tolerant and Auto Recovery on failures
- Wide range support of all meta tags and http codes.
- Support for all the tags advised by google crawl guide.
- Creates web graph
- Collects rss feeds and author info
- Pluggable parsers
- Pluggable indexers (currently MongoDB supported)
install
sudo npm install bot-marvin
Starting your first crawl
//You need to create a seed.json file first
//it looks like this
[
{
"_id": "http://www.imdb.com",
"parseFile": "nutch",
"priority": 1,
"fetch_interval": "monthly",
"limit_depth": -1
},
{
"_id": "http://www.elastic.co",
"parseFile": "nutch",
"priority": 1,
"fetch_interval": "monthly",
"limit_depth": -1
},
{
"_id": "http://www.rottentomatoes.com",
"parseFile": "nutch",
"priority": 1,
"fetch_interval": "monthly",
"limit_depth": 10
}
]
/*
_id : is the url
parseFile : is the file name present in parsers dir (default: 'nutch')
priority : is from 1-100 indicates the percentage of urls of the domain in a single crawl job.
Number of urls of a domain in batch = (priority/100) * batch_size
Fetch interval is recrawl interval supported values (always|weekly|monthly|yearly) you can add custom time intervals in the config
limit_depth: is used to restrict crawling by depth, -1 means no limit by depth
*/
# Step 1 Set your db configuration
sudo bot-marvin-db
# Step 2 Set your bot config
sudo bot-marvin --config
# Step 3 Load your seed file
sudo bot-marvin --loadSeedFile <path_to_your_seed_file>
# Step 4 Run your crawler
sudo bot-marvin
Contributing
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request :D
###Documentation is available at http://tilakpatidar.github.io/bot-marvin
Stuff used to make this:
- request for making http requests
- mongodb for mongodb connectivity
- underscore Js utility functions library
- immutable Js lib for advanced data structures
- check-types for Strict type checking
- cheerio for parsing html pages
- robots for parsing robots.txt files
- colors for beautiful consoling
- crypto for encryption
- death for handling gracefull exit
- minimist for cmd line features
- progress for download progress bars
- string-editor for providing nano like editor for editing config from terminal
- node-static server for web app
- feed-read for parsing rss feeds