0.0.2 • Published 5 years ago

puppeteer-for-crawling v0.0.2

Weekly downloads
12
License
-
Repository
github
Last release
5 years ago

Puppeteer for crawling

Adds methods like attr, html, getElementsAttribute, etc.. that puppeteer misses by default.

Example:

  • Puppeteer:
let description = await page.evaluate( () => {
    return document.querySelector('[itemprop="about"]').innerText
})
  • Puppeteer + puppeteer-for-crawling:
let description = await page.q('[itemprop="about"]').text();

Install

npm install puppeteer-for-crawling

Usage

        const puppeteer = require('puppeteer');

        require('puppeteer-for-crawling')

        const browser = await puppeteer.launch({
            args: [
                '--no-sandbox',
                '--disable-setuid-sandbox',
            ]
        });
        const page = await browser.newPage();

        await page.goto('https://github.com/tetreum/puppeteer-for-crawling')

        // you can use q for faster coding
        let description = await page.q('[itemprop="about"]').text();
        let readme = await page.q('#readme').html();
        let id = await page.q('meta[name="octolytics-dimension-repository_id"]').attr("content");
        let inputNames = await page.getElementsAttribute('input', "name");

        console.log("Q method:", id, description, inputNames, readme)

        // or keep using puppeteer's selector ($) and call the new methods
        let description = (await page.$('[itemprop="about"]')).text();
        let readme = (await page.$('#readme')).html();

        console.log("$ method:", id, description)

        await page.goto('https://github.com/login')

        if (await page.q('#login form').isVisible()) {
            await page.fill('#login form', {
                'login': "test",
                'password': "test_password"
            });
        }

Added methods

Documentation

class: ElementSelector

New class introduced by this package. ElementSelector can be created with the page.q method.

Unlike page.$, calling page.q alone won't perform any evaluate action/element won't get requested. It is made to request element parts like, attributes, or inner content, rather than the element itself. It will make the crawling experience faster/easier to maintain:

let $el = await (await page.$('title')).text()
let $el2 = await page.q('title').text()
console.log($el, $el2)

Once you call a method (like text, attr, etc..) over q, the elementHandle inside will be cached so it won't be requested everytime you call another method over it

elementSelector.isVisible()

Checks if selector is visible.

elementSelector.attr(name, val)

  • name <string> Attribute's name
  • val <mixed> Optional attribute's value to set
  • returns: <null>

Gets/Sets requested attribute

elementSelector.text()

Returns element's innerText

elementSelector.prop(name)

Returns element's property name

class: Page

From puppeteer.

page.q(selector)

The method prepares to queries frame for the selector.

page.exists(selector)

The method checks if given selector exists.

page.getElementsAttribute(selector, attribute)

Gets attribute values from from the selector

page.fill(selector, fields)

Gets attribute values from from the selector

class: ElementHandle

From puppeteer.

elementHandle.isVisible()

Checks if selector is visible.

elementHandle.attr(name, val)

  • name <string> Attribute's name
  • val <mixed> Optional attribute's value to set
  • returns: <null>

Gets/Sets requested attribute

elementHandle.text()

Returns element's innerText

elementHandle.prop(name)

Returns element's property name

You may also want

For fast debugging/printing: https://github.com/tetreum/perfect-print-js