Puppeteer-for-crawling NPM

Puppeteer for crawling

Adds methods like attr, html, getElementsAttribute, etc.. that puppeteer misses by default.

Example:

Puppeteer:

let description = await page.evaluate( () => {
    return document.querySelector('[itemprop="about"]').innerText
})

Puppeteer + puppeteer-for-crawling:

let description = await page.q('[itemprop="about"]').text();

Install

npm install puppeteer-for-crawling

Usage

        const puppeteer = require('puppeteer');

        require('puppeteer-for-crawling')

        const browser = await puppeteer.launch({
            args: [
                '--no-sandbox',
                '--disable-setuid-sandbox',
            ]
        });
        const page = await browser.newPage();

        await page.goto('https://github.com/tetreum/puppeteer-for-crawling')

        // you can use q for faster coding
        let description = await page.q('[itemprop="about"]').text();
        let readme = await page.q('#readme').html();
        let id = await page.q('meta[name="octolytics-dimension-repository_id"]').attr("content");
        let inputNames = await page.getElementsAttribute('input', "name");

        console.log("Q method:", id, description, inputNames, readme)

        // or keep using puppeteer's selector ($) and call the new methods
        let description = (await page.$('[itemprop="about"]')).text();
        let readme = (await page.$('#readme')).html();

        console.log("$ method:", id, description)

        await page.goto('https://github.com/login')

        if (await page.q('#login form').isVisible()) {
            await page.fill('#login form', {
                'login': "test",
                'password': "test_password"
            });
        }

Added methods

Documentation

class: ElementSelector

New class introduced by this package. ElementSelector can be created with the page.q method.

Unlike page.$, calling page.q alone won't perform any evaluate action/element won't get requested. It is made to request element parts like, attributes, or inner content, rather than the element itself. It will make the crawling experience faster/easier to maintain:

let $el = await (await page.$('title')).text()
let $el2 = await page.q('title').text()
console.log($el, $el2)

Once you call a method (like text, attr, etc..) over q, the elementHandle inside will be cached so it won't be requested everytime you call another method over it

elementSelector.isVisible()

returns: <boolean>

Checks if selector is visible.

elementSelector.attr(name, val)

name <string> Attribute's name
val <mixed> Optional attribute's value to set
returns: <null>

Gets/Sets requested attribute

elementSelector.text()

returns: <string>

Returns element's innerText

elementSelector.prop(name)

name <string> Property name
returns: <string>

Returns element's property name

class: Page

From puppeteer.

page.q(selector)

selector <string> A selector to query frame for
returns: <ElementSelector>

The method prepares to queries frame for the selector.

page.exists(selector)

selector <string> A selector to query frame for
returns: <boolean>

The method checks if given selector exists.

page.getElementsAttribute(selector, attribute)

selector <string> A selector to query frame for
attribute <string> Attribute's name to get from selector
returns: <Array<String>>

Gets attribute values from from the selector

page.fill(selector, fields)

selector <string> A selector to query frame for the form
fields <Object> Key (field name)<->Value object of fields to set
returns: <Array<String>>

Gets attribute values from from the selector

class: ElementHandle

From puppeteer.

elementHandle.isVisible()

returns: <boolean>

Checks if selector is visible.

elementHandle.attr(name, val)

name <string> Attribute's name
val <mixed> Optional attribute's value to set
returns: <null>

Gets/Sets requested attribute

elementHandle.text()

returns: <string>

Returns element's innerText

elementHandle.prop(name)

name <string> Property name
returns: <string>

Returns element's property name

You may also want

For fast debugging/printing: https://github.com/tetreum/perfect-print-js

puppeteer crawling crawl crawler

@infinitebrahmanuniverse/nolb-pup @everything-registry/sub-chunk-2494 @zalastax/nolb-pup

0.0.2

7 years ago

0.0.1

7 years ago