1.0.7 β€’ Published 2 years ago

gumo v1.0.7

Weekly downloads
-
License
MIT
Repository
github
Last release
2 years ago

πŸ•ΈοΈGumo

"Gumo" (θœ˜θ››) is Japanese for "spider".

npm version MIT license

Overview πŸ‘“

A web-crawler (get it?) and scraper that extracts data from a family of nested dynamic webpages with added enhancements to assist in knowledge mining applications. Written in NodeJS.

Table of Contents πŸ“–

Features 🌟

  • Crawl hyperlinks present on the pages of any domain and its subdomains.
  • Scrape meta-tags and body text from every page.
  • Store entire sitemap in a GraphDB (currently supports Neo4J).
  • Store page content in ElasticSearch for easy full-text lookup.

Installation πŸ—οΈ

NPM

Usage πŸ‘¨β€πŸ’»

From code:

// 1: import the module
const gumo = require('gumo')

// 2: instantiate the crawler
let cron = new gumo()

// 3: call the configure method and pass the configuration options
cron.configure({
    'neo4j': { // replace with your details or remove if not required
        'url' : 'neo4j://localhost',
        'user' : 'neo4j',
        'password' : 'gumo123'
    },
    'elastic': { // replace with your details or remove if not required
        'url' : 'http://localhost:9200',
        'index' : 'myIndex'
    },
    'crawler': {
        'url': 'https://www.example.com',
    }
});

// 4: start crawling
cron.insert()

Note: The config params passed to cron.configure above are the default values. Please refer to the Configuration section below to learn more about the customization options that are available.

Configuration βš™οΈ

The behavior of the crawler can be customized by passing a custom configuration object to the config() method. The following are the attributes which can be configured:

Attribute ( * - Mandatory )TypeAccepted ValuesDescriptionDefault ValueDefault Behavior
* crawler.urlstringBase URL to start scanning from"" (empty string)Module is disabled
crawler.CookiestringCookie string to be sent with each request (useful for pages that require auth)"" (empty string)Cookies will not be attached to the requests
crawler.saveOutputAsHtmlstring"Yes"/"No"Whether or not to store scraped content as HTML files in the output/html/ directory"No"Saving output as HTML files is disabled
crawler.saveOutputAsJsonstring"Yes"/"No"Whether or not to store scraped content as JSON files in the output/json/ directory"No"Saving output as JSON files is disabled
crawler.maxRequestsPerSecondintrange: 1 to 5000The maximum number of requests to be sent to the target in one second5000
crawler.maxConcurrentRequestsintrange: 1 to 5000The maximum number of concurrent connections to be created with the host at any given time5000
crawler.whiteListArray(string)If populated, only these URLs will be traversed[] (empty array)All URLs with the same hostname as the "url" attribute will be traversed
crawler.blackListArray(string)If populated, these URLs will ignored[] (empty array)
crawler.depthintrange: 1 to 999Depth up to which nested hyperlinks will be followed3
* elastic.urlstringURI of the ElasticSearch instance to connect to"http://localhost:9200"
* elastic.indexstringThe name of the ElasticSearch index to store results in"myIndex"
* neo4j.urlstringThe URI of a running Neo4J instance (uses the Bolt driver to connect)"neo4j://localhost"
* neo4j.userstringNeo4J server username"neo4j"
* neo4j.passwordstringNeo4J server password"gumo123"

ElasticSearch ⚑

The content of the web page will be stored along with the url, and a hash. The index for the elastic search can be selected through config.json index attribute. If the index already exists in the elastic search it will be used, else it will create one.

id: hash, index: config.index, type: 'pages', body: JSON.stringify(page content)

GraphDB β˜‹

The sitemap of all the traversed pages is stored in a convenient graph. The following structure of nodes and relationships is followed:

Nodes

  • Label: Page
  • Properties:
Property NameTypeDescription
pidStringUID generated by the crawler which can be used to uniquely identify a page across ElasticSearch and GraphDB
linkStringURL of the current page
parentStringURL of the page from which the current page was accessed (typically only used while creating relationships)
titleStringPage title as it appears in the page header

Relationships

NameDirectionCondition
links_to(a)-r1:links_to->(b)b.link = a.parent
links_from(b)-r2:links_from->(a)b.link = a.parent

TODO β˜‘οΈ

  • Make it executable from CLI
  • Enable to send config parameters while invoking the gumo
  • Write more tests