puppeteer-for-crawling v0.0.2
Puppeteer for crawling
Adds methods like attr
, html
, getElementsAttribute
, etc.. that puppeteer misses by default.
Example:
- Puppeteer:
let description = await page.evaluate( () => {
return document.querySelector('[itemprop="about"]').innerText
})
- Puppeteer + puppeteer-for-crawling:
let description = await page.q('[itemprop="about"]').text();
Install
npm install puppeteer-for-crawling
Usage
const puppeteer = require('puppeteer');
require('puppeteer-for-crawling')
const browser = await puppeteer.launch({
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
]
});
const page = await browser.newPage();
await page.goto('https://github.com/tetreum/puppeteer-for-crawling')
// you can use q for faster coding
let description = await page.q('[itemprop="about"]').text();
let readme = await page.q('#readme').html();
let id = await page.q('meta[name="octolytics-dimension-repository_id"]').attr("content");
let inputNames = await page.getElementsAttribute('input', "name");
console.log("Q method:", id, description, inputNames, readme)
// or keep using puppeteer's selector ($) and call the new methods
let description = (await page.$('[itemprop="about"]')).text();
let readme = (await page.$('#readme')).html();
console.log("$ method:", id, description)
await page.goto('https://github.com/login')
if (await page.q('#login form').isVisible()) {
await page.fill('#login form', {
'login': "test",
'password': "test_password"
});
}
Added methods
Documentation
class: ElementSelector
New class introduced by this package. ElementSelector can be created with the page.q method.
Unlike page.$
, calling page.q
alone won't perform any evaluate action/element won't get requested. It is made to request element parts like, attributes, or inner content, rather than the element itself.
It will make the crawling experience faster/easier to maintain:
let $el = await (await page.$('title')).text()
let $el2 = await page.q('title').text()
console.log($el, $el2)
Once you call a method (like text
, attr
, etc..) over q, the elementHandle
inside will be cached so it won't be requested everytime you call another method over it
elementSelector.isVisible()
- returns: <boolean>
Checks if selector is visible.
elementSelector.attr(name, val)
Gets/Sets requested attribute
elementSelector.text()
- returns: <string>
Returns element's innerText
elementSelector.prop(name)
Returns element's property name
class: Page
From puppeteer.
page.q(selector)
selector
<string> A selector to query frame for- returns: <ElementSelector>
The method prepares to queries frame for the selector.
page.exists(selector)
The method checks if given selector exists.
page.getElementsAttribute(selector, attribute)
selector
<string> A selector to query frame forattribute
<string> Attribute's name to get from selector- returns: <Array<String>>
Gets attribute values from from the selector
page.fill(selector, fields)
selector
<string> A selector to query frame for the formfields
<Object> Key (field name)<->Value object of fields to set- returns: <Array<String>>
Gets attribute values from from the selector
class: ElementHandle
From puppeteer.
elementHandle.isVisible()
- returns: <boolean>
Checks if selector is visible.
elementHandle.attr(name, val)
Gets/Sets requested attribute
elementHandle.text()
- returns: <string>
Returns element's innerText
elementHandle.prop(name)
Returns element's property name
You may also want
For fast debugging/printing: https://github.com/tetreum/perfect-print-js