webparsy v0.8.4
Fast and light NodeJS library and cli to scrape and interact with websites using Puppeteer (or not) and YAML definitions
version: 1
jobs:
main:
steps:
- goto: https://github.com/marketplace?category=code-quality
- pdf:
path: Github_Tools.pdf
format: A4
- many:
as: github_tools
event: githubTool
selector: main .col-lg-9.mt-1.mb-4.float-lg-right a.col-md-6.mb-4.d-flex.no-underline
element:
- property:
selector: a
type: string
property: href
as: url
transform: absoluteUrl
- text:
selector: h3.h4
type: string
transform: trim
as: name
- text:
selector: p
type: string
transform: trim
as: descriptionReturn an array with Github's tools, and creates a PDF. Example output:
{
"github_tools": [
{
"url": "https://github.com/marketplace/codelingo",
"name": "codelingo",
"description": "Your Code, Your Rules - Automate code reviews with your own best practices"
},
{
"url": "https://github.com/marketplace/codebeat",
"name": "codebeat",
"description": "Code review expert on demand. Automated for mobile and web"
},
...
]
}Don't panic. There are examples for all WebParsy features in the examples folder. This are as basic as possible to help you get started.
Contributors ✨
Thanks goes to these wonderful people (emoji key):
This project follows the all-contributors specification. Contributions of any kind welcome!
Table of Contents
- Overview
- Browser config
- Output
- Transform
- Types
- Multi-jobs
- Steps
- setContent Sets the HTML markup to assign to the page.
- goto Navigate to an URL
- run Runs a group of steps by its name.
- goBack Navigate to the previous page in history
- screenshot Takes an screenshot of the page
- pdf Takes a pdf of the page
- text Gets the text for a given CSS selector
- many Returns an array of elements given their CSS selectors
- title Gets the title for the current page.
- form Fill and submit forms
- html Return HTML code for the page or a DOM element
- click Click on an element (CSS and xPath selectors)
- url Return the current URL
- type Types a text (key events) in a given selector
- waitFor Wait for selectors, time, functions, etc before continuing
- keyboardPress Simulates the press of a keyboard key
- scrollTo Scroll to bottom, top, x, y, selector, xPath before continuing
- scrollToEnd Scroll's to the very bottom (infinite scroll pages)
Overview
You can use WebParsy either as cli from your terminal or as a NodeJS library.
Cli
Install webparsy:
$ npm i webparsy -g$ webparsy example/_weather.yml --customFlag "custom flag value"
Result:
{
"title": "Madrid, España Pronóstico del tiempo y condiciones meteorológicas - The Weather Channel | Weather.com",
"city": "Madrid, España",
"temp": 18
}Library
const webparsy = require('webparsy')
const parsingResult = await webparsy.init({
file: 'jobdefinition.yml'
flags: { ... } // optional
})Methods
init(options)
options:
One of yaml, file or string is required.
yaml: A yaml npm module instance of the scraping definition.string: The YAML definition, as a plain string.file: The path for the YAML file containing the scraping definition.
Additionally, you can pass a flags object property to input additional values
to your scraping process.
Browser config
You can setup Chrome's details in the browser property within the main job.
None of the following settings are required.
jobs:
main:
browser:
width: 1200
height: 800
scaleFactor: 1
timeout: 60
delay: 0
headless: true
executablePath: ''
userDataDir: ''
keepOpen: false- executablePath: If provided, webparsy will launch Chrome from the specified path.
- userDataDir: If provided, webparsy will launch Chrome with the specified user's profile.
Output
In order for WebParsy to get contents, it needs some very basic details. This are:
asthe property you want to be returnedselectorthe css selector to extract the html or text from
Other optional options are
parentGet the parent of the element filtered by a selector.
Example
text:
selector: .entry-title
as: entryLink
parent: aTransform
When you extract texts from a web page, you might want to transform the data before returning them. example
You can use the following - transform methods:
uppercasetransforms the result to uppercaselowercasetransforms the result to lowercaseabsoluteUrlreturn the absolute url for a link
Types
When extractring details from a page, you might want them to be returned in different formats, for example as a number in the example of grabing temperatures. example
You can use the following values for - type:
stringnumberintegerfloatfcdtranform to float an string-number that uses comma for thousandsfdctranform to float an string-number that uses dot for thousands
Multi-jobs support
You can define groups of steps (jobs) that you can reuse at any moment during an scraping process.
For example, let's say you want to signup twice in a website. You will have a "main" job (that executes by defaul) but you can create an additional one called "signup", that you can reuse in the "main" one.
version: 1
jobs:
main:
steps:
- goto: https://example.com/
- run: signup
- click: '#logout'
- run: signup
signup:
steps:
- goto: https://example.com/register
- form:
selector: "#signup-user"
submit: true
fill:
- selector: '[name="username"]'
value: jonsnow@example.comSteps
Steps are the list of things the browser must do.
setContent
Sets the HTML markup to assign to the page.
Setting a string:
- setContent:
html: Hello!Loading the HTML from a file:
- setContent:
file: myMarkup.htmlLoading the HTML from a environment variable:
- setContent:
env: MY_MARKUP_ENVIRONMENT_VARIABLELoading the HTML from a flag:
- setContent:
flag: markupgoto
URL to navigate page to. The url should include scheme, e.g. https://. example
- goto: https://example.comYou can also tell WebParsy to don't use Puppeteer to browse, and instead do a direct HTTP request via got. This will perform much faster, but it may not be suitable for websites that requires JavaScript. simple example / extended example
Note that some methods (for example: form, click and others) will not be
available if you are not browsing using puppeteer.
- goto:
url: https://google.com
method: gotYou can also tell WebParsy which urls it should visit via flags (available via cli and library). Example:
- goto:
flag: websiteUrlYou can then call webparsy as:
webparsy definition.yaml --websiteUrl "https://google.com"or
webparsy.init({
file: 'definition.yml'
flags: { websiteUrl: 'https://google.com' }
})Authentication
You can perform basic HTTP authentication by providing the user and password as in the following example:
- goto:
url: http://example.com
method: got
authentication:
type: basic
username: my_user
password: my_passwordrun
Runs a group of steps by its name.
- run: signupProcessgoBack
Navigate to the previous page in history. example
- goBackscreenshot
Takes an screenshot of the page. This triggers pupetteer's page.screenshot. example
- screenshot:
- path: Github.pngIf you are using WebParsy as a NodeJS module, you can also get the screenshot
retuned as a Buffer by using the as property.
- screenshot:
- as: myScreenshotBufferTakes a pdf of the page. This triggers pupetteer's page.pdf
- pdf:
- path: Github.pdfIf you are using WebParsy as a NodeJS module, you can also get the PDF file
retuned as a Buffer by using the as property.
- pdf:
- as: pdfFileBuffertitle
Gets the title for the current page. If no output.as property is defined, the
page's title will tbe returned as { title }. example
- titlemany
Returns an array of elements given their CSS selectors. example
Example:
- many:
as: articles
selector: main ol.articles-list li.article-item
element:
- text:
selector: .title
as: titleWhen you scape large amount of contents, you might end consuming hords of RAM, your system might become slow and the scraping process might fail.
To prevent this, WebParsy allows you to use process events so you can have the scraped contents as they are scraped, instead of storing them in memory and waiting for the whole process to finish.
To do this, simply add an event property to many, with the event's name you
want to listen to. The event will contain each scraped item.
event will give you the data as it's being scraped. To prevent it from being
stored in memory, set eventMethod to discard.
form
Fill and submit forms. example
Form filling can use values from environment variables. This is useful if you want to keep users login details in secret. If this is your case, instead of specifying the value as a string, set it as the env property for value. Check the example below or refer to banking example
Example:
- form:
selector: "#tsf" # form selector
submit: true # Submit after filling all details
fill: # array of inputs to fill
- selector: '[name="q"]' # input selector
value: test # input valueUsing environment variables
- form:
selector: "#login" # form selector
submit: true # Submit after filling all details
fill: # array of inputs to fill
- selector: '[name="user"]' # input selector
value:
env: USERNAME # process.env.USERNAME
- selector: '[name="pass"]'
value:
env: PASSWORD # process.env.PASSWORDhtml
Gets the HTML code. If no selector specified, it returns the page's full HTML
code. If no output.as property is defined, the result will be returned
as { html }. example
Example:
- html
as: divHtml
selector: divclick
Click on an element. example
Example:
Default behaviour (CSS selector)
- click: button.click-meSame as
- click:
selector: button.click-meBy xPath (clicks on the first match)
- click:
xPath: '/html/body/div[2]/div/div/div/div/div[3]/span'type
Sends a keydown, keypress/input, and keyup event for each character in
the text.
Example:
- type:
selector: input.user
text: jonsnow@example.com
options:
delay: 4000url
Return the current URL.
Example:
- url:
as: currentUrlwaitFor
Wait for specified CSS, XPath selectors, on an specific amount of time before continuing example
Examples:
- waitFor:
selector: "#search-results"- waitFor:
xPath: "/html/body/div[1]/header/div[1]/a/svg"- waitFor:
function: "console.log(Date.now())"- waitFor:
time: 1000 # Time in millisecondskeyboardPress
Simulates the press of a keyboard key. extended docs
- keyboardPress:
key: 'Enter'scrollTo
Scoll to specified CSS, XPath selectors, to bottom/top or to specified x/y value before continuing example
Examples:
- scrollTo:
top: true- scrollTo:
bottom: true- scrollTo:
x: 340- scrollTo:
y: 500- scrollTo:
selector: "#search-results"- scrollTo:
xPath: "/html/body/div[1]/header/div[1]/a/svg"scrollToEnd
Scroll's to the very bottom (infinite scroll pages) example
This accepts three settings:
- step: how many pixels to scroll every time. Default is 10.
- max: up to how many pixels as maximun you want to scroll down - so you are not waiting for decades on non-ending infinite scroll pages. Default is 9999999.
- sleep: how long to wait before scrolls - in milliseconds. Defaul is 100
Examples:
- scrollToEnd- scrollToEnd:
step: 300
sleep: 1000
max: 300000License
MIT © Jose Constela
5 years ago
5 years ago
5 years ago
5 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
