0.0.3 • Published 7 years ago

the-scraping-machine v0.0.3

Weekly downloads
2
License
GPL-3.0
Repository
github
Last release
7 years ago

shieldsIO shieldsIO shieldsIO shieldsIO

The scraping machine

Under development - More news soon

gilling_machine

This is just the beginning of a long journey

Let's make web scraping fun again!

From a JSON Config file... you can create a web scraping script and see the output.

Concept

  1. You just need to define your needs in a JSON file, like demo.json
  2. The you execute node index demo.json in order to start the process in index.js
    • First it validates the arguments and data
    • Then decides the language to use. For now only Python +3 (Beautiful Soup) and Node.js (X-ray) supported
    • Then render all the info in the handlebars template, like templates/python.hbs or templates/node.hbs
  3. The script file is generated, like google.py or google.js
  4. The script will be executed as a process child by Node generating the final output, like google.json

Demo

Inside demo.json:

{
	"source_type": "url",
	"url": "http://google.es",
	"file_name": "google",
	"data": [
		{
			"name": "web-title",
			"type": "selector",
			"query": "title"
		}, {
			"name": "web2",
			"type": "selector",
			"query": "title"
		}
	]
}

Start the machine

  • For Python script output:
    node index.js demo.json 
    node index.js demo.json python
  • For Node script output
    node index.js demo.json js
    node index.js demo.json node

Output

[
    {
        "web-title": "Google",
        "web2": "Google"
    }
]

Testing

You can test your changes...

npm test

Future Implementations

  • Support for Node.js (X-Ray).
  • Support for CSS3 Selectors.
  • Support for recursive queries.
  • Support for "follow links", like a crawler.
  • Implementation as CLI
  • Basic Testing
  • esLint Support
  • JSDoc Support
  • Basic Gulp Tasks
  • Example Folder

Achievements

v.0.0.3

Features:

  • Added support to JSDoc
  • Added Gulp Tasks
  • Added Basic Testing with Mocha, Chai and Istanbul
  • Added .editorconfig
  • Added esLint support
  • Added example folder
  • Added support to Node.js

Notes: Main target: Improved Proof of concept

v.0.0.2

Features:

  • Roadmap added
  • Added File strucutre
  • Defined a minimal json strcuture
  • Added minimal validation
  • Added a template engine
  • Added support for python
  • Added dynamic information from the setup config file

Notes: Main target: Proof of concept

v.0.0.1

Features:

Notes: Just a "Hello world"