als-scraper v0.2.1
Als-Scraper
If something wrong or not working properly, please write me to: sh.mashkanta@gmail.com
Als-scraper is a library with 3 classes: 1. Document - gets html text, builds DOM tree and allows to query elements (similar to Cherio,but different) 2. Request - sends web requests (get/post/put/delete) 3. Scraper - sends request (with Request) and grabs elements or their properties (with Document)
What's new?
- classList methods (add,remove)
- id not in attributes
- new method
build(filePath)for building html from dom tree - Some minor fixes
Document
Document is a class which gets html as string and return new object with DOM tree. Thanks to this DOM tree, Document object can select elements and their properties inside collections or as single element. Collections and elements has additional methods for selecting.
Creating new object
const { readFileSync } = require("fs")
let html = readFileSync('test.html','utf-8')
let {Document} = require('als-scraper')
let document = new Document(html)QuerySelector for single element
Then document object has created, you can select elements or collections.
For selecting single element, use $(selector) and for selecting collections $$(selector).
Selecting element
document.$('div') // select first div in document
document.$('div.some') // select first div element with some classAt this time, selector supports this:
- Selects all elements -
* - element -
div - class -
.some-class - id -
#some-id - attribute -
[some-attribute="some value"][prop][prop~=value][prop|=value][prop^="value"][prop$="value"][prop*="value"]
Multiple elements selector is not supported right now (planing to add on next versions). The meaning, the folowing, won't work:
div pdiv > pdiv + pp ~ ul
Each returned element, has the folowing:
element = {
parent, // parent element
prev, // previous element (null if no exists)
next, // next element (null if no exists)
innerText, // innner text of element and it's childNodes separated by |
children, // array of childNodes(elements and text nodes) - includes text element too
tagName, // tag name of element
id, // id of element if exists
attributes, // object of attributes (id not included)
classList, // array of classes and add and remove methods
$(selector),
$$(selector),
}Text node has the folowing:
textElement = {
text,
prev,
next
}You can add or remove classes with classList methods. Example:
let element = document.$('div')
element.classList.remove('some')
element.classList.add('another')
element.classList.add('onemore')Also you can change element's id:
let element = document.$('div')
element.id = 'new-id'QuerySelector for Collection $$()
To select few elements, use $$(selector) method.
document.$$('div') // return collection of all div elementsThe collection is array which has the elements and two methods: each and parse.
each method gets callback function with 3 parameters: element it self, index of the element in collection and collection itself.
Here example:
let array = []
document.$$('div').each((element,index,collection) => {
if(element.innerText.includes('some text'))
array.push(element)
})parse method, gets two parameters: part and fn and return array with results.
partis a part of element. It can be innerText, id, tagName or any property inside attributes.fnis a filter function which gets content of part. If return true, content will be included.
Example:
new Document(htmlText).$$('div')
.parse('innerText',
content=> (content.length > 0) ? true : false)Building html
For building html again, use build method.
Example:
let element = document.$('div')
element.classList.add('another')
element.classList.remove('some')
element.id = 'new-id'
document.build() // return new html text
document.build([__dirname,'new-index.html']) // will create a file with new html textRequest
Request is a class which sends web requests. It has constructor and 4 request methods:
new Request(url)
.get(fn)
.post(fn,data='')
.put(fn,data)
.delete(fn,data)
// fn = function(data/error,statusCode)urlparameter, has to includehttp:\\orhttps:\\fnis a function which gets 2 paremeters: response's data or error and status code.datahas to be a string data to send in case of post/put/delete methods
Examle:
let {Request} = require('als-scraper')
new Request()Scraper
Scraper.
parse(url,selector,fn,part) // fn(data/error,statusCode)
write(url,selector,filePath,part)Example:
let {Scraper} = require('als-scraper')
let url = 'http://www.columbia.edu/~fdc/sample.html'
let selector = 'div'
let pathForFile = 'example.json'
let part = 'innerText'
Scraper.writeHtml(url,selector,pathForFile,part)
Scraper.parseHtml(url,selector,function(data,status) {
console.log(data,status)
},part)