2.0.0 • Published 3 years ago

scrape-them-all v2.0.0

Weekly downloads
8
License
MIT
Repository
github
Last release
3 years ago

Scrape-Them-All is a Cheerio layer which improves your scraping experience.

This package is recent, if you have any suggestions or you notice that something is not working, feel free to open an issue or a pull-request, we will be happy to answer them as soon as possible


📦 Installation

# Using NPM
npm install --save scrape-them-all
npm install --save fetch-cookie #optional

# Using Yarn
yarn add scrape-them-all
yarn add fetch-cookie #optional

fetch-cookie is only required if you plan to use the cookieJar option on requests.

⚠ If you get a too many redirects error when you scrape, we recommend to install fetch-cookie and use the option cookieJar: true in your request. You can also pass an instance of tough.CookieJar to this parameter.

Example:

scrapeTA({ url: 'https://google.com', cookieJar: true }, ...)

📚 Documentation

scrapeTA(query, schema)

Params:

  • query String or Object: The page url or the page url and node-fetch options.
  • schema Object: the list of elements to scrape and the corresponding HTML tags.

Returns:

  • Promise<Object>: A promise containing the result as JSON.

Schema options

OptionTypeDescription
selectorString or ObjectCan be a string expression, DOM Element, array of DOM elements, or cheerio object.
trimBooleanTrim whitespaces in the result. Default as true.
attributeStringReturn the value of the indicated attribute on the selected element.
accessorString or FunctionCheerio access method name (like html for returning html code) or a custom function that take a Cheerio instance as first parameter.
transformerFunctionThe first parameter is your current value for the selected item. Can return a Promise.
listModelObjectContains the options stated above in case of a list.

Example output

{
    "title": "An amazing game",
    "description": "<p>With an amazing description</p>",
    "image": "https://amazing.game/image.jpg",
    "price": 10.99,
    "users": [
        {
            "username": "Tanuki",
            "badges": [
                { "name": "An amazing player" },
                ...
            ]
        },
        ...
    ]
}

The code that goes with it

const { ScrapeTA } = require('scrape-them-all')
ScrapeTA('url_or_https_options', {
  title: '.header h1',
  description: {
    selector: '.header p',
    accessor: 'html',
    //  accessor: selected => selected.html(),
    trim: false
  },
  image: {
    selector: 'img',
    attribute: 'src'
  },
  price: {
    selector: '.footer #price',
    transformer: (value) => parseFloat(value)
  },
  users: {
    selector: '.body .users',
    listModel: {
      username: '.username',
      badges: {
        selector: '.badges',
        listModel: {
          name: '.badgeName'
        }
      }
    }
  }
})
  .then((data) => console.log(data))
  .catch((error) => console.error(error))

💪 Contributions

TODO


📜 License

MIT © Tanuki, Aperrix.