1.0.3 • Published 4 years ago
@cpclermont/hq v1.0.3
hq
jq
's not as sophisticated cousin but for HTML.
Installation
# with npm
npm install -g @cpclermont/hq
# with yarn
yarn global add @cpclermont/hq
Usage
HTML isn't as well structured as JSON. But we can do some fun stuff with CSS selectors and attribute selectors.
hq
commands all start with a CSS selector and are then optionally passed into a transformation:
# example settings
url="https://www.webpagetest.org"
cssSelector="head > script[src]:not([defer]):not([async])"
# examples
curl -s $url | hq "$cssSelector"
curl -s $url | hq "$cssSelector | html"
curl -s $url | hq "$cssSelector | text"
curl -s $url | hq "$cssSelector | innerHTML"
curl -s $url | hq "$cssSelector | attr(src)"
curl -s $url | hq "$cssSelector | attr(src, href)"
Transformations
html
(alias:outerHTML
) returns theouterHTML
of the node.text
returns the text in the node.innerHTML
returns the innerHTML of the node :grimacing:.attr(attrs)
returns the values of the attributes(comma separated). The output of unmatched attributes is skipped, every attribute is output to a new line.This exists so you can scrape URLs from different node types that might live in different attributes.
e.g.
curl -s https://www.webpagetest.org | hq ' head > script[src]:not([defer]):not([async]), head > link[rel=stylesheet] | attr(src, href)'
Where we want to list render blocking scripts and stylesheets in the order they are found.
Examples
# Get URLs of scripts that are render blocking
curl -s https://www.webpagetest.org | hq '
head > script[src]:not([defer]):not([async])
| attr(src)'
# Get URLs of scripts and stylesheets that are render blocking
curl -s https://www.webpagetest.org | hq '
head > script[src]:not([defer]):not([async]),
head > link[rel=stylesheet]
| attr(src, href)'
Notes
This is scrappy and nowhere near as good as jq. But it's better than nothing :)
License
MIT